decision_tree/gini_impurity_entropy

Gini Impurity and Entropy

What will you learn in this section
  • What do we mean by node purity in decision tree
  • How to calculate node purity for Classification tree using
    • Entropy
    • Gini Impurity
  • How to calculate node purity for Regression tree using
    • Variance

The Decision Tree algorithm employs recursive partitioning to divide data into distinct regions. It splits the data in a manner that increases homogeneity with each split. The greater the purity of the data after a split, the more effective the decision split. As we have observed in our example, each split results in increasingly homogeneous regions. However, an important question arises: How does the decision tree determine the homogeneity or purity of a region?

In this section, we will explore the fundamental concepts necessary to understand the detailed workings of decision trees.
Decision trees are used for both classification and regression problems, and the methods for calculating node purity differ for each type.

Node purity in Classification Tree

Classification problems have discrete dependent variables. The algorithm splits the data in such a way that, after the split, at least one of the newly created regions contains a majority of data points from the same class. Each split in the decision tree aims to separate positive and negative class data points. There are several methods to measure the homogeneity or purity of a region in classification problems. Below are some of the most commonly used approaches.

  • Entropy

    Entropy represents randomness of the node. If a Region has almost equal number of positive & negative class examples it is said to have high entropy(less purer). On the other hand if a Region has one of the classes in majority then it is said to have low entropy (more purer). Mathematical equation of entropy is shown below

    \begin{align} \text{entropy} = - \sum_{i=1}^k p_i * \log(p_i) \end{align} \( p_i \) is predicted probability of class i
    \( k \) denotes number of classes.
    For binary class ( two class problem) entropy equation gets reduced to
    \begin{align} \text{entropy} = - p_0*\log(p_0) - p_1*\log(p_1) \end{align} \( p_0 \) is predicted probability of class 0
    \( p_1 \) is predicted probability of class 1

    How do we calculate the predicted probability of class in the region?
    We can count the number of data points for each class and then calculate the probability of the class. [see this for more details]

    Entropy Calculation Example
    Consider a binary class problem with the following cases.

    • Case 1
      Positive class count = 10
      Negative class count = 10
      \( p_0 \) (predicted probability of class 0) = 10/20 = 0.5
      \( p_1 \) (predicted probability of class 1) = 10/20 = 0.5
      \begin{align} \text{entropy} &= -0.5*\log(0.5)-0.5*\log(0.5) \\ \text{entropy} &= 0.693 \end{align}
      We have an equal number of positive and negative class data points. Which means this region is not homogeneous and we have high entropy. This is the maximum entropy which we can get in a binary class problem.
    • Case 2
      Positive class count = 1
      Negative class count = 19
      \( p_0 \) (predicted probability of class 0) = 19/20
      \( p_1 \) (predicted probability of class 1) = 1/20
      \begin{align} \text{entropy} &= -\frac{19}{20}*\log(\frac{19}{20}) -\frac{1}{20}*\log(\frac{1}{20}) \\ \text{entropy} &= 0.198 \end{align}
      Now we have the majority of negative classes and we are getting low entropy. Lower entropy indicates homogeneous(purer) data.
    • Case 3
      Positive class count = 19
      Negative class count = 1
      \( p_0 \) (predicted probability of class 0) = 1/20
      \( p_1 \) (predicted probability of class 1) = 19/20
      \begin{align} \text{entropy} &= -\frac{1}{20}*\log(\frac{1}{20}) -\frac{19}{20}*\log(\frac{19}{20}) \\ \text{entropy} &= 0.198 \end{align}
      This case is just opposite of case 2 and we have the same entropy value.

    Entropy is independent of class type. If a node contains a majority of data points from a single class (either positive or negative), the entropy value will be low. Cases 2 and 3 are opposites of each other, yet they yield the same entropy value.

    Below is the entropy vs. probability plot for the positive class. This plot is symmetric around the black line (probability = 0.5). The left region (green) has a low probability for class = 1, meaning the probability for class = 0 is high. Similarly, the right region (red) has a high probability for class = 1, resulting in a low probability for class = 0.

  • Gini Impurity

    Gini is another method for measuring node purity. The mathematical formula for Gini impurity is given below:

    \begin{align} \text{Gini} &= 1-\sum_{i=1}^k p_i^2 \end{align} \( p_i \) represents the predicted probability of class \( i \)
    \( k \) denotes the number of classes
    For a binary classification problem, Gini values range between [0, 0.5]. A lower Gini value indicates higher homogeneity, whereas a higher Gini value signifies greater randomness. The plot below illustrates the variation of Gini impurity with the probability of the positive class. Similar to the entropy plot, this plot is also symmetric.

Node Purity in Regression Tree

In a regression tree, node purity is determined by the squared error between the actual and predicted values. Since the predicted value for a region is the average of all data points within that region, node purity is essentially equivalent to the variance of the data points.
\begin{align} \text{Node Purity} &= \text{average of} (y_{actual}-y_{predicted})^2 \\ \\ y_{predicted} &= \bar{y} \ \text{(average value in the region)} \\ \\ \text{Node Purity} &= \text{average of} (y_{actual}-\bar{y})^2 = \text{variance} \end{align}
A lower variance in a region indicates that the data points are more similar to each other.