What is a Decision Tree?

At a high level, a decision tree is nothing more than a series of if-else conditions that ultimately leads to some kind of a decision. A simple practical example would be if someone is deciding whether to go to the beach or movies on a given day based on the outside temperature and whether it is sunny or cloudy. If it is sunny and the temperature is above 70 degrees, the person might opt to go to the beach. On the other hand, if it is sunny and below 70, or if it is cloudy at any temperature, he or she might choose the movies instead. The structural representation of that decision criteria can be illustrated through the following diagram of a decision tree. The circles represent decision nodes, or variables whose values determine which direction or decision will be made at the next level of the tree. The squares, or leaves, represent the landing spots based on the criteria considered in the sequence of decision nodes followed. 

Fig 1: Example of a Decision Tree

Decision trees can be used in an analogous way to create a prediction mechanism in the case of real world supervised learning problems. Considering a data set with a target variable and several candidate predictors, a decision tree can be constructed by first identifying the predictor and corresponding splitting criteria that is most predictive of the target. This first decision criteria forms the root node of the tree, which in the above example, is the weather. It then continues by creating additional splits until it allocates all observations into leaf nodes. In a classification problem, a decision tree can perfectly place all observations on the data set on which the tree is trained into their appropriate destination nodes. This leads into one of the biggest drawbacks of single decision trees, in that they are notorious for overfitting the training data, or not generalizing well to unseen observations. In modern machine learning, decision trees are most often used as individual components to ensemble algorithms like Random Forest or Gradient Boosting, which by creating many decision trees, address the problem of overfitting present when using only one tree.