An ensemble learning method combining boosting and decision trees.
Boosting: A technique that iteratively creates weak models (weak learners), with each subsequent learner correcting the errors of the previous one, thereby improving performance.
Shallow decision trees are created, each of which performs well only on a portion of the data. Boosting improves their overall performance.
Although sensitive to parameter tuning, XGBoost can outperform Random Forest when configured correctly.
Despite the term "regression tree," it can be used for both regression and classification tasks.
The less frequent (rarer) an event, the greater its "information content":
i(x)=−log2p(x)
Impurity
Entropy (Average Information Content)
A measure of randomness or disorder. It uses the expected value of information content based on occurrence probabilities:
IH(t)=−i=1∑cp(i∣t)log2p(i∣t)(p(i∣t)=Nni:Probability of class i at node t)
Gini Impurity
Borrowed from econometrics:
IG(t)=1−i=1∑cp(i∣t)2
A higher value indicates more mixed classes, i.e., higher impurity → poor classification. Other metrics like misclassification rate can also be used.
Gain
The difference in impurity between nodes before and after splitting. Higher gain indicates a greater reduction in impurity.
Decision Tree Learning Using Gain
For each feature, consider the midpoints of adjacent data points as threshold candidates.
Compute the impurity after splitting at each candidate threshold.
Split at the threshold that reduces impurity the most.
Repeat recursively.
Stop when a node contains too few data points or further splitting is not possible.
ΔIH(t)=IH(tB)−i=1∑bwiIHi(tAi)
(IH(tB):Impurity before branching, i=1∑bwiIH(tAi):Weighted average impurity after branching)
Ensemble Methods combine multiple machine learning models to achieve better predictive performance than any single model.
By integrating predictions from different models, ensemble methods compensate for individual weaknesses, improving overall predictive accuracy and generalization performance.
Indicates a model's ability to make accurate predictions on unseen data, not just the training data.
Overfitting
A phenomenon where the model performs well on training data but poorly on test data.
Occurs when the model "memorizes" the training data, becoming unable to generalize to new data.
Analogous to scoring high on regular tests but poorly on mock or entrance exams.
Weak Learners
Models with low predictive accuracy that are prone to overfitting, such as shallow decision trees.
For example, if a model is trained to identify cats as animals with pointy ears, it might fail to recognize cats with rounded ears.
Strong Learners
Models with higher predictive accuracy than weak learners. Boosting transforms weak learners into a single strong learner system.
For instance, to identify cats, a weak learner that predicts based on pointy ears can be combined with another that identifies eye shape. By refining predictions iteratively, the overall system improves accuracy.
An ensemble technique designed to address overfitting. Multiple weak learners are created, and their predictions are aggregated using majority voting (for classification) or averaging (for regression) to enhance generalization performance.
A general boosting method. Repeatedly creates weak learners, emphasizing misclassified points by adjusting their weights. Final predictions are made using a weighted majority vote of all weak learners.
GBDT (Gradient Boosting Decision Tree)
Minimizes the error between predicted values and true labels iteratively, gradually transforming weak learners into strong learners. The final model's output is the combined prediction of all learners.
XGBoost (eXtreme Gradient Boosting)
An enhanced version of GBDT with features like regularization and parallel computation for improved performance.
LightGBM is a faster, simpler alternative to XGBoost, offering high generalization performance. While not available in Exploratory, it is widely supported in Python and often serves as the default choice.