#11 Machine Learning & Data Science Challenge 11

#11 Machine Learning & Data Science Challenge 11

Entropy, Information Gain, Gini Index, Reducing Impurity?

  • Different attributes define the split of nodes in a decision tree. There are few algorithms to find the optimal partition.

1. ID3(Iterative Dichotomiser 3) :

  • This solution uses Entropy and Information gain as metrics to form a better decision tree.

  • The attribute with the highest information gain is used as a root node and a similar approach is followed after that.

  • Entropy is the measure that characterizes the impurity of an arbitrary collection of examples.

  • Entropy varies from 0 to 1. 0 if all the data belong to a single class and 1 if the class distribution is equal. In this way, entropy will give a measure of impurity in the dataset.

Steps to decide which attribute to split:

  1. Compute the entropy for the dataset.

  2. For every attribute:

    • Calculate entropy for all categorical values.

    • Take average information entropy for the attribute.

    • Calculate the gain for the current attribute.

  1. Pick the attribute with the highest information gain.

  2. Repeat until we get the desired tree.

    • A leaf node is decided when entropy is zero
    • Information Gain = 1 - ∑ (Sb/S)*Entropy (Sb)
    • Sb - Subset, S - entire data

2. CART Algorithm (Classification and Regression trees): :

  • In CART, we use the GINI index as a metric. The Gini index is used as a cost function to evaluate split in a dataset.

  • Steps to calculate Gini for a split:

    • Calculate Gini for subnodes, using the formula sum of the square of probability for success and failure (p2+q2).
    • Calculate Gini for the split using the weighted Gini score of each node of that split.

Choose the split based on a higher Gini value:

  • Split on Gender:

Gini for sub-node Female = (0.2)(0.2)+(0.8)(0.8) = 0.68

Gini for sub-node Male = (0.65)(0.65)+(0.35)(0.35) = 0.55

Weighted Gini for Split Gender = (10/30)*0.68+(20/30)*0.55 = 0.59

  • Similar to Split on Class:

Gini for subxnode Class IX = (0.43)(0.43)+(0.57)(0.57) = 0.51

Gini for sub-node Class X = (0.56)(0.56)+(0.44)(0.44) = 0.51

Weighted Gini for Split Class = (14/30)*0.51+(16/30)*0.51 = 0.51

  • Here Weighted Gini is high for gender, so we consider splitting based on gender.

Did you find this article valuable?

Support Bhagirath's Blog Vision by becoming a sponsor. Any amount is appreciated!