Entropy, Information Gain, Gini Index, Reducing Impurity?
- Different attributes define the split of nodes in a decision tree. There are few algorithms to find the optimal partition.
1. ID3(Iterative Dichotomiser 3) :
This solution uses Entropy and Information gain as metrics to form a better decision tree.
The attribute with the highest information gain is used as a root node and a similar approach is followed after that.
Entropy is the measure that characterizes the impurity of an arbitrary collection of examples.
- Entropy varies from 0 to 1. 0 if all the data belong to a single class and 1 if the class distribution is equal. In this way, entropy will give a measure of impurity in the dataset.
Steps to decide which attribute to split:
Compute the entropy for the dataset.
For every attribute:
Calculate entropy for all categorical values.
Take average information entropy for the attribute.
Calculate the gain for the current attribute.
Pick the attribute with the highest information gain.
Repeat until we get the desired tree.
- A leaf node is decided when entropy is zero
- Information Gain = 1 - ∑ (Sb/S)*Entropy (Sb)
- Sb - Subset, S - entire data
2. CART Algorithm (Classification and Regression trees): :
In CART, we use the GINI index as a metric. The Gini index is used as a cost function to evaluate split in a dataset.
Steps to calculate Gini for a split:
- Calculate Gini for subnodes, using the formula sum of the square of probability for success and failure (p2+q2).
- Calculate Gini for the split using the weighted Gini score of each node of that split.
Choose the split based on a higher Gini value:
- Split on Gender:
Gini for sub-node Female = (0.2)(0.2)+(0.8)(0.8) = 0.68
Gini for sub-node Male = (0.65)(0.65)+(0.35)(0.35) = 0.55
Weighted Gini for Split Gender = (10/30)*0.68+(20/30)*0.55 = 0.59
- Similar to Split on Class:
Gini for subxnode Class IX = (0.43)(0.43)+(0.57)(0.57) = 0.51
Gini for sub-node Class X = (0.56)(0.56)+(0.44)(0.44) = 0.51
Weighted Gini for Split Class = (14/30)*0.51+(16/30)*0.51 = 0.51
- Here Weighted Gini is high for gender, so we consider splitting based on gender.