How to control leaf height and Pruning?
To control the leaf size, we can set the parameters:
Maximum Depth:
Maximum tree depth is a limit to stop the further splitting of nodes when the specified tree depth has been reached during the building of the initial decision tree.
NEVER use maximum depth to limit the further splitting of nodes. In other words: use the largest possible value.
Minimum Split Size:
The minimum split size is a limit to stop the further splitting of nodes when the number of observations in the node is lower than the minimum split size.
This is a good way to limit the growth of the tree. When a leaf contains too few observations, further splitting will result in overfitting (modeling of noise in the data).
Minimum Leaf Size:
The minimum leaf size is a limit to split a node when the number of observations in one of the child nodes is lower than the minimum leaf size.
Pruning is mostly done to reduce the chances of overfitting the tree to the training data and reduce the overall complexity of the tree.
There are two types of Pruning:
Pre-Pruning:
- Pre-pruning is also known as the early stopping criteria. As the name suggests, the criteria are set as parameter values while building the model. The tree stops growing when it meets any of these pre-pruning criteria, or it discovers the pure classes.
Post-Pruning:
- In Post-pruning, the idea is to allow the decision tree to grow fully and observe the CP value. Next, we prune/cut the tree with the optimal CP(Complexity Parameter) value as the parameter.