Decision Tree Algorithm Demystified: An Easy-to-Follow Introduction

8 min readApr 28, 2023

What is Decision Tree?

The decision tree algorithm is a popular supervised learning method in machine learning. It is a type of model that uses a tree-like structure to make predictions based on input features. In a decision tree, each node represents a feature or attribute, and each branch represents a possible value of that feature.

The algorithm works by recursively splitting the data into subsets based on the values of the features. The splitting process continues until a stopping criterion is met, such as a maximum tree depth or a minimum number of instances in a leaf node. At each node, the algorithm chooses the feature that best separates the data into different classes or minimizes some other objective function, such as information gain or Gini impurity.

Once the tree is built, it can be used to predict the class of new instances by traversing the tree from the root to a leaf node that corresponds to the predicted class. The prediction is based on the majority class of the training instances that reach that leaf node.

Why use a Decision Tree?

There are several reasons why one might choose to use a decision tree algorithm in machine learning:

Advantages:

Easy to interpret and explain: The decision tree model is intuitive and can be easily understood by non-technical people. The tree-like structure makes it easy to visualize the decision-making process.
Handles both categorical and numerical features: Decision trees can handle a mix of categorical and numerical features, unlike some other machine learning models that only work with one or the other.
Can handle missing values: Decision trees can handle missing values in the data without requiring imputation or removal of incomplete records.
Non-parametric: Decision trees are non-parametric, which means they can fit complex relationships without assuming a specific distribution or form of the data.
Robust to outliers: Decision trees can handle outliers in the data without being heavily influenced by them.

Disadvantages:

Overfitting: Decision trees can easily overfit the data, especially when the tree is deep or the data is noisy or complex. This can lead to poor generalization performance on new, unseen data.
Instability: Decision trees can be sensitive to small changes in the data, leading to different trees and potentially different predictions.
Bias: Decision trees can be biased towards features with more levels or more splits, leading to less important features being overlooked.
Greedy approach: Decision trees use a greedy approach to splitting, which means they may not always find the globally optimal solution.
Lack of expressiveness: Decision trees may not be expressive enough to capture certain complex relationships in the data, such as interactions between features.

Overall, decision trees can be a powerful and interpretable tool for machine learning tasks, but they require careful tuning and validation to avoid overfitting and instability. Ensembling techniques, such as random forests or gradient boosting, can improve performance and mitigate some of the limitations of decision trees.

How Does Decision Tree Work?

The root node feature is selected based on the results from the Attribute Selection Measure(ASM).
The ASM is repeated until a leaf node or a terminal node cannot be split into sub-nodes.

What is Attribute Selective Measure(ASM)?

Attribute Subset Selection Measure is a technique used in the data mining process for data reduction. The data reduction is necessary to make better analysis and prediction of the target variable.

The two main ASM techniques are

Gini index
Information Gain(ID3)

Gini index

The measure of the degree of probability of a particular variable being wrongly classified when it is randomly chosen is called the Gini index or Gini impurity. The data is equally distributed based on the Gini index.

Mathematical Formula:

Pi= probability of an object being classified into a particular class.

When you use the Gini index as the criterion for the algorithm to select the feature for the root node., The feature with the least Gini index is selected.

Information Gain(ID3)

Entropy is the main concept of this algorithm, which helps determine a feature or attribute that gives maximum information about a class is called the Information gain or ID3 algorithm. By using this method, we can reduce the level of entropy from the root node to the leaf node.

Mathematical Formula:

‘p’, denotes the probability of E(S), which denotes the entropy. The feature or attribute with the highest ID3 gain is used as the root for the splitting.

Example To Understand Decision Tree

Example of building a decision tree to classify animals as either “mammal” or “non-mammal” based on two predictors: “has fur” and “lays eggs”.

Suppose we have the following dataset:

The target variable is “Mammal?”, which indicates whether the animal is a mammal or not. The predictor variables are “Has Fur” and “Lays Eggs”, which describe the characteristics of the animal.

To build a decision tree, we start with the root node and choose the feature that best splits the data into two subsets with the highest information gain. Information gain measures how much the entropy of the target variable is reduced by splitting the data based on a particular feature.

In our example, we choose “Has Fur” as the root node because it has the highest information gain among all the features. We split the data into two subsets based on the values of “Has Fur”: “Yes” and “No”.

For the “Yes” subset, we see that all instances have the same label “Yes”, so we assign the leaf node “Yes” for this subset.

For the “No” subset, we need to consider the next feature, “Lays Eggs”. We split the data into two subsets based on the values of “Lays Eggs”: “Yes” and “No”.

For the “Yes” subset, we see that all instances have the same label “No”, so we assign the leaf node “No” for this subset.

For the “No” subset, we see that all instances have the same label “No”, so we assign the leaf node “No” for this subset.

The resulting decision tree looks like this:

       Has Fur
         /   \
        /     \
     Yes       No
      /         \
     /           \
   Yes          Lays Eggs
                 /     \
                /       \
             Yes         No
             (Mammal)    (Non-Mammal)

To make a prediction for a new instance, we start at the root node and follow the path down the tree based on the values of the features. For example, if an animal has fur and lays eggs, we predict “Mammal” because we follow the path “Has Fur: Yes” -> “Lays Eggs: Yes” -> “Mammal: Yes”.

The decision tree is easy to interpret and can handle both categorical and binary features. However, it may not perform well when dealing with complex relationships between features or noisy data. In addition, small changes in the data can lead to different trees, which may not be robust.

Calculation of Information Gain for “Has Fur”

We first calculate the entropy of the target variable:

Entropy(S) = -p(Mammal)*log2(p(Mammal)) - p(Non-Mammal)*log2(p(Non-Mammal))
            = -0.5*log2(0.5) - 0.5*log2(0.5)
            = 1.0

where p(Mammal) = p(Non-Mammal) = 0.5 since there are 5 mammals and 5 non-mammals in the dataset.

Next, we calculate the entropy of the subsets when the data is split based on the “Has Fur” feature:

Entropy(Has Fur=Yes) = -p(Mammal)*log2(p(Mammal)) - p(Non-Mammal)*log2(p(Non-Mammal))
                     = -1*log2(1) - 0*log2(0)
                     = 0
                     
Entropy(Has Fur=No)  = -p(Mammal)*log2(p(Mammal)) - p(Non-Mammal)*log2(p(Non-Mammal))
                     = -0.75*log2(0.75) - 0.25*log2(0.25)
                     = 0.81

where p(Mammal) = 1.0 and p(Non-Mammal) = 0.0 for the subset with "Has Fur"=Yes, and p(Mammal) = 0.25 and p(Non-Mammal) = 0.75 for the subset with "Has Fur"=No.

Finally, we calculate the information gain:

Information Gain(Has Fur) = Entropy(S) - [p(Has Fur=Yes)*Entropy(Has Fur=Yes) + p(Has Fur=No)*Entropy(Has Fur=No)]
                          = 1.0 - [0.5*0 + 0.5*0.81]
                          = 0.405

Calculation of Information Gain for “Lays Eggs”

We first calculate the entropy of the target variable:

Entropy(S) = -p(Mammal)*log2(p(Mammal)) - p(Non-Mammal)*log2(p(Non-Mammal))
            = -0.5*log2(0.5) - 0.5*log2(0.5)
            = 1.0

where p(Mammal) = p(Non-Mammal) = 0.5 since there are 5 mammals and 5 non-mammals in the dataset.

Next, we calculate the entropy of the subsets when the data is split based on the “Lays Eggs” feature:

Entropy(Lays Eggs=Yes) = -p(Mammal)*log2(p(Mammal)) - p(Non-Mammal)*log2(p(Non-Mammal))
                       = -0.6*log2(0.6) - 0.4*log2(0.4)
                       = 0.971
                     
Entropy(Lays Eggs=No)  = -p(Mammal)*log2(p(Mammal)) - p(Non-Mammal)*log2(p(Non-Mammal))
                       = -1*log2(1) - 0*log2(0)
                       = 0

where p(Mammal) = 0.6 and p(Non-Mammal) = 0.4 for the subset with "Lays Eggs"=Yes, and p(Mammal) = p(Non-Mammal) = 0.5 for the subset with "Lays Eggs"=No.

Finally, we calculate the information gain:

Information Gain(Lays Eggs) = Entropy(S) - [p(Lays Eggs=Yes)*Entropy(Lays Eggs=Yes) + p(Lays Eggs=No)*Entropy(Lays Eggs=No)]
                            = 1.0 - [0.6*0.971 + 0.4*0]
                            = 0.292

Based on this calculation, we can see that the “Has Fur” feature has a higher information gain (0.405) than the “Lays Eggs” feature (0.292), so it would be chosen as the best split for building the decision tree.

The resulting decision tree would look like this:

Has Fur?
  |
  +--- Yes ---> Mammal
  |
  +--- No ---> Non-Mammal

At the root node, we split the data based on the “Has Fur” feature. If an animal has fur, we classify it as a mammal, and if it does not have fur, we classify it as a non-mammal.

Top Machine Learning Mastery: Elevate Your Skills with this Step-by-Step Tutorial
1. Need for Machine Learning, Basic Principles, Applications, Challenges
2. Types of Machine Learning
3. Linear Regression
4. Logistic Regression (Binary Classification)
5. K-Nearest Neighbors
6. Decision Tree
7. Random Forest
8. Gradient Boosting (XGboost)
9. Support Vector Machines
10. Classification Evaluation measures (Accuracy, Precision, Recall, confusion Metrics) Overfitting and underfitting
11. Neural Network Representation (Perceptron Learning)
12. Convolution Neural Nets
13. Recurrent Neural Nets
14. Hyperparameter tuning
15. Dimensionality Reduction (PCA, SVD)
16. Clustering (K-Means Clustering, Hierarchical Clustering)
17. Anomaly Detection
18. Association Rule Learning
19. Reinforcement Learning Fundamentals and Applications
20. Q-Learning
21. Recommendation Systems
Dive into an insightful Machine Learning tutorial for exam success and knowledge expansion. More concepts and hands-on projects coming soon — follow my Medium profile for updates!