What will you learn in this section?
- What is tree pruning?
- Types of tree pruning
- Cost complexity pruning
Decision trees are prone to overfitting. Consider a fully grown decision tree, if not controlled, it will partition the data to the extent that every training example is perfectly predicted. While this may seem ideal, it often leads to overfitting. To prevent this, careful tuning of hyperparameters is required. Some important hyperparameters include:
-
Depth of the tree
-
min_samples_split
-
min_samples_leaf
These hyperparameters help regulate the tree's growth. If any of the conditions set by these parameters are
not met, the algorithm stops further splitting. This method of controlling tree growth is known as pre-pruning.
Another approach to controlling tree growth is called post-pruning. In this method, the tree is initially allowed to
grow fully and is then pruned by removing its weakest links splits that do not significantly improve the tree's performance.
This process is known as cost complexity pruning.
Cost Complexity Pruning via Tree Score Computation
For a regression tree, the tree score is calculated using the equation below:
\[
\text{Tree Score} = \text{SSR} + \alpha \times |T|
\]
Where:
\(\text{SSR}\) = Sum of squared residuals
\(\alpha\) = Cost complexity parameter (ccp)
\(|T|\) = Size of the tree (number of leaf nodes)
As the depth of the tree increases, the total SSR on training data keeps decreasing. However, in some cases, new partitions may not contribute significantly to reducing SSR. At this point, while the decrease in SSR is minimal, the \(\alpha \times |T|\) term continues to grow, eventually causing the overall tree score to increase. All splits occurring beyond this point are pruned to optimize the tree's performance.