Last modified on 01 Oct 2021.

What’s the idea of Decision Tree Classifier?

The basic intuition behind a decision tree is to map out all possible decision paths in the form of a tree. It can be used for classification and regression (note). In this post, let’s try to understand the classifier.

Suppose that we have a dataset SS like in the figure below[ref, Table 1.2],

Example of dataset An example of dataset SS.

Example of dataset A decision tree we want.

There are many algorithms which can help us make a tree like above, in Machine Learning, we usually use:

  • ID3 (Iterative Dichotomiser): uses information gain / entropy.
  • CART (Classification And Regression Tree): uses Gini impurity.

Some basic concepts

Concepts with a tree.

  • Splitting: It is a process of dividing a node into two or more sub-nodes.
  • Pruning: When we remove sub-nodes of a decision node, this process is called pruning.
  • Parent node and Child Node: A node, which is divided into sub-nodes is called parent node of sub-nodes where as sub-nodes are the child of parent node.

ID3 algorithm

  1. To check the disorder at current node (let’s say SS, parent node), we calculate its entropy with,

    H(S)=i=12pS,ilog2pS,i, H(S) = -\sum_{i=1}^{2} p_{S,i} \log_2 p_{S,i},

    where ii \in the number of classes and pS,ip_{S,i} is the probability of class ii in SS.

  2. If entropy at this node is pure (there is only 1 class or the majority is 1 class) or it meets the stopping conditions, we stop splitting at this node. Otherwise, go to the next step.

  3. Calculate the information gain (IG) after splitting node SS on each attribute (for example, consider attribute OO). The attribute w.r.t. the biggest IG will be chosen!

    IG(S,O)information gain=H(S)entropy before splitjP(OjS)×H(S,Oj)weighted entropy after split \underbrace{IG(S,O)}_{\text{information gain}} = \underbrace{H(S)}_{\text{entropy before split}} - \underbrace{\sum_j P(O_j | S) \times H(S,O_j)}_{\text{weighted entropy after split}}

    where jj \in number of different properties in OO and P(Oj)P(O_j) is the propability of property OjO_j in OO.

  4. After splitting, we have new child nodes. Each of them becomes a new parent node in the next step. Go back to step 1.

How we know we can split the dataset SS base on the Outlook attribute instead of the others (Temperature, Humidity, Windy)? \Rightarrow We calculate the information gain after splitting SS on each attribute. It’s the information which can increase the level of certainty after splitting. The highest one will be chosen (after this section, you will see that the Outlook attribute has the highest information gain).

In order to calculate the information gain, we need “entropy” which is the amount of information disorder or the amount of randomness in the data.

information gain=entropy before splitweighted entropy after split \text{information gain} = \text{entropy before split} - \text{weighted entropy after split}

At the beginning, entropy before split (H(S)H(S)) shows us the disorder status of the whole dataset SS. If SS contains only Yes, SS has no disorder or it’s pure (H(S)=0)H(S)=0). If the amount of Yes and No in SS is equal, SS has the highest disorder (H(S)=1H(S)=1).

Illustration of entropy with different proportions of Yes/No in S. An illustration of entropy with different proportions of Yes/No in SS.

At each node, we need to calculate again its entropy (corresponding to the number of Yes and No in this node.). We prefer the lowest entropy, of course! How can we calculate entropy of each node? More specifically, how to calculate H(S)H(S)?

H(S)=i=12pS,ilog2pS,i, H(S) = -\sum_{i=1}^{2} p_{S,i} \log_2 p_{S,i},

where ii \in the number of classes (node SS has 2 classes, Yes and No), pS,ip_{S,i} is the probability of class ii in SS.

Graph of H(S) in the case there are 2 classes. Graph of H(p)H(p) in the case of 2 classes. Max is 1.

In this case we use log2\log_2 (binary logarithm) to obtain the maximum H(S)=1H(S)=1 and we also use a convention in which 0×log2(0)=00\times\log_2(0)=0. There are other documents using log\log (natural logarithm) instead.

On node SS, we have,

H(S)=H([9,5])=914×log2(914)514×log2(514)=0.94. \begin{aligned} H(S) &= H([9,5]) \\ &= -\frac{9}{14} \times \log_2(\frac{9}{14}) - \frac{5}{14}\times\log_2(\frac{5}{14})\\ &= 0.94. \end{aligned}

We see that, SS is not pure but it’s also not totally disordered.

The frequency of classes in S The frequency of classes in S.

Because we are considering to split SS on OO (Outlook) and OO has 3 different properties which are Sunny, Overcast and Rainy. Corresponding to these properties, we have different sizes of Yes/No (Different nodes having different sizes of data but their total is equal to the size of SS which is their “parent” node.). That’s why we need to calculate the weighted entropy (weighted entropy after split).

j=13P(Oj)×H(S,Oj), \sum_{j=1}^3 P(O_j) \times H(S,O_j),

where jj \in number of different properties in OO and P(Oj)P(O_j) is the propability of property OjO_j in OO. Therefore, the information gain if split SS on OO is,

IG(S,O)=H(S)j=13P(Oj)×H(S,Oj). IG(S,O) = H(S) - \sum_{j=1}^3 P(O_j) \times H(S,O_j).

If we split S in O? If we split S on Outlook (O), there will be 3 branches.

For example, we consider branch O1O_1 (Sunny), it has P(O1)=514P(O_1)=\frac{5}{14} and entropy at this node, H(S,O1)H(S,O_1) is calculated as

H(S,O1)=H([2,3])=25×log2(25)35×log2(35)=0.97. \begin{aligned} H(S,O_1) &= H([2,3]) \\ &= -\frac{2}{5} \times \log_2(\frac{2}{5}) - \frac{3}{5}\times\log_2(\frac{3}{5})\\ &= 0.97. \end{aligned}

Only consider O1 Only consider branch Sunny (O1O_1).

Thus, the information gain after splitting SS on OO is,

IG(S,O)=H(S)j=13P(Oj)×H(S,Oj)=0.94(514×0.971+414×0+514×0.971)=0.940.693=0.247. \begin{aligned} IG(S,O) &= H(S) - \sum_{j=1}^3 P(O_j) \times H(S,O_j) \\ &= 0.94 - (\frac{5}{14}\times 0.971 + \frac{4}{14}\times 0 + \frac{5}{14}\times 0.971) \\ &= 0.94 - 0.693 = 0.247. \end{aligned}

With the same method, we can calculate the information gain after splitting SS on other attributes (Temperature, Windy, Humidity) and get,

Dataset is split into different ways. Dataset is split into different ways.

We can see that, the winner is Outlook with the highest information gain. We split SS on that attribute first!

Data is split on Outlook. Dataset is split on Outlook.

How about 3 others remaining attributes (Temperature, Humidity, Windy), which one to be chosen next? Especially on branches Suuny and Humidity because on branch Overcast, this node is pure (all are Yes), we don’t need to split any more.

Which attribute will be chosen next? There are remaining Temperature, Humidity, Windy. Which attribute will be chosen next?

We repeat the steps again, for example, on the branch O1O_1 (Sunny), we calculate IG after splitting O1O_1 on each attribute Temperature (T), Humidity (H) or Windy (W). Other words, we need to calculate IG(O1,T)IG(O_1,T), IG(O1,H)IG(O_1, H) and IG(O1,W)IG(O_1, W) and then compare them to find the best one. Let’s consider HH (Humidity) as an example,

IG(O1,H)=H(O1)j=12P(HjO1)×H(O1,Hj)=H(S,O1)35×025×0=0.971. \begin{aligned} IG(O_1,H) &= H(O_1) - \sum_{j=1}^2 P(H_j|O_1) \times H(O_1,H_j) \\ &= H(S,O_1) - \frac{3}{5}\times 0 - \frac{2}{5} \times 0 \\ &= 0.971. \end{aligned}

Nodes W1W_1 and W2W_2 are pure, that’s why their entropy are 00.

Consider branch O1 and attribute W Consider branch O1O_1 and attribute Windy (W).

Similarly, we calculate IG(O1,T)IG(O_1,T), IG(O1,H)IG(O_1,H) and we see that IG(O1,W)IG(O_1,W) is the biggest one! So we choose WW (Windy) to split at node O1O_1. On the branch O3O_3 (Rainy), the biggest information gain after splitting is on HH (Humidity).

Example of dataset

From now, if we have a new input which contains information about Outlook, Temperature, Humidity and Windy, we go from the top of the tree and choose an appropriate branch to get the decision Yes or No.

CART algorithm

The difference between two algorithms is the difference between H(S)H(S) and IG(S)I_G(S).

  1. To check the disorder at current node (let’s say SS, parent node), we calculate its Giny Impurity with,

    IG(S)=i=12pS,i(1pS,i), I_G(S) = \sum_{i=1}^{2} p_{S,i}(1-p_{S,i}),

    where ii \in the number of classes in SS and pS,ip_{S,i} is the probability of class ii in SS.

  2. If entropy at this node is pure (there is only 1 class or the majority is 1 class) or it meets the stopping conditions, we stop splitting at this node. Otherwise, go to the next step.

  3. Calculate the Gini Gain (GG) after splitting node SS on each attribute (for example, consider attribute OO). The attribute w.r.t. the biggest GG will be chosen!

    GG(S,O)gini gain=IG(S)gini impurity before splitjP(OjS)×IG(S,Oj)weighted gini impurity after split \underbrace{GG(S,O)}_{\text{gini gain}} = \underbrace{I_G(S)}_{\text{gini impurity before split}} - \underbrace{\sum_j P(O_j | S) \times I_G(S,O_j)}_{\text{weighted gini impurity after split}}

    where jj \in number of different properties in OO and P(Oj)P(O_j) is the propability of property OjO_j in OO.

  4. After splitting, we have new child nodes. Each of them becomes a new parent node in the next step. Go back to step 1.

It’s quite the same to the ID3 algorithm except a truth that it’s based on the definition of Gini impurity instead of Entropy. Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset.

At every nonleaf node (which isn’t pure), we have to answer a question “Which attribute we should choose to split that node?” We calculate the Gini gain for each split based on the attribute we are going to use. This Gini gain is quite the same as Information gain. The highest one will be chosen.

gini gain=gini impurity before splitweighted gini impurity after split \text{gini gain} = \text{gini impurity before split} - \text{weighted gini impurity after split}

The Gini Impurity at node SS is calculated as,

IG(S)=i=12pS,i(1pS,i), I_G(S) = \sum_{i=1}^2p_{S,i}(1-p_{S,i}),

where ii\in the number of classes in SS, pS,ip_{S,i} is the probability of class ii in SS. IG=0I_G=0 will be the best!

On node SS, we have,

IG(S)=IG([9,5])=914×514+514×914=0.459. \begin{aligned} I_G(S) &= I_G([9,5]) \\ &= \frac{9}{14} \times \frac{5}{14} + \frac{5}{14} \times \frac{9}{14}\\ &= 0.459. \end{aligned}

The frequency of classes in S The frequency of classes in S.

Similarly to the information gain, we can calculate Gini Gain (GGGG) after splitting SS on the property OO with,

GG(S,O)=IG(S)j=13P(Oj)×IG(S,Oj), GG(S,O) = I_G(S) - \sum_{j=1}^3 P(O_j) \times I_G(S,O_j),

where jj \in number of different properties in OO and P(Oj)P(O_j) is the propability of property OjO_j in OO.

If we split S in O? If we split S on Outlook (O), there will be 3 branches.

Apply above equation, we calculate all GG if splitting SS on each property and get,

GG(S,O)=IG([9,5])(514×IG([2,3])+414×IG([4,0])+514×IG([3,2]))= \begin{aligned} GG(S,O) &= I_G([9,5]) - ( \frac{5}{14}\times I_G([2,3]) + \frac{4}{14}\times I_G([4,0]) + \frac{5}{14}\times I_G([3,2])) \\ &= \ldots \end{aligned}

The same for GG(S,H)GG(S,H) (Humidity), GG(S,T)GG(S,T) (Temperature) and GG(S,W)GG(S,W) (Windy). Keep going the same arguments as in the section ID3 in detail, we will get the final tree. The difference between two algorithms is the difference between H(S)H(S) and IG(S)I_G(S).

Gini Impurity or Entropy? [ref]

  • Most of the time, they lead to similar trees.[ref]
  • Gini impurity is slightly faster.[ref]
  • Gini impurity tends to isolate the most frequent class in its own branch of the tree, while entropy tends to produce slightly more balanced trees.

Good / Bad of Decision Tree?[ref]

Some highlight advantages of Decision Tree Classifier:

  1. Can be used for regression or classification.
  2. Can be displayed graphically.
  3. Highly interpretable.
  4. Can be specified as a series of rules, and more closely approximate human decision-making than other models.
  5. Prediction is fast.
  6. Features don’t need scaling.
  7. Automatically learns feature interactions.
  8. Tends to ignore irrelevant features.
  9. Non-parametric (will outperform linear models if relationship between features and response is highly non-linear).

Its disadvantages:

  1. Performance is (generally) not competitive with the best supervised learning methods.
  2. Can easily overfit the training data (tuning is required).
  3. Small variations in the data can result in a completely different tree (high variance).
  4. Recursive binary splitting makes “locally optimal” decisions that may not result in a globally optimal tree.
  5. Doesn’t tend to work well if the classes are highly unbalanced.
  6. Doesn’t tend to work well with very small datasets.

When to stop?

If the number of features are too large, we’ll have a very large tree! Even, it easily leads to an overfitting problem. How to avoid them?

  1. Pruning: removing the branches that make use of features having low importance.
  2. Set a minimum number of training input to use on each leaf. If it doesn’t satisfy, we remove this leaf. In scikit-learn, use min_samples_split.
  3. Set the maximum depth of the tree. In scikit-learn, use max_depth.

When we need to use Decision Tree?

  • When explainability between variable is prioritised over accuracy. Otherwise, we tend to use Random Forest.
  • When the data is more non-parametric in nature.
  • When we want a simple model.
  • When entire dataset and features can be used
  • When we have limited computational power
  • When we are not worried about accuracy on future datasets.
  • When we are not worried about accuracy on future datasets.

Using Decision Tree Classifier with Scikit-learn

Load and create

Load the library,

from sklearn.tree import DecisionTreeClassifier

Create a decision tree (other parameters):

# The Gini impurity (default)
clf = DecisionTreeClassifier() # criterion='gini'
# The information gain (ID3) 
clf = DecisionTreeClassifier(criterion='entropy')

An example,

from sklearn import tree
X = [[0, 0], [1, 1]]
Y = [0, 1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
# predict
clf.predict([[2., 2.]])
# probability of each class
clf.predict_proba([[2., 2.]])
array([1])
array([[0., 1.]])

Plot and Save plots

Plot the tree (You may need to install Graphviz first. Don’t forget to add its installed folder to $path),

from IPython.display import Image
import pydotplus
dot_data = tree.export_graphviz(clf, out_file=None, 
                                rounded=True, 
                                filled=True)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())

Iris tree.

Save the tree (follows the codes in “plot the tree”)

graph.write_pdf("tree.pdf")   # to pdf
graph.write_png("thi.png")    # to png

References