A cherished child of Adaboost and learning theory, Gradient boosting is used to analyze instances that cannot be predicted accurately. Based on an assumption that the weak learners can be modified to get better results, it uses decision trees to arrive at prediction models.
In the pure layman terms, it is the process of simplifying an otherwise complex algorithm to get the nearest accurate results.
Before we get into details, here’s a quick recap of the basic definitions:
Boosting = Building iterative models from weak learners.
Ensemble = a combination of separate models.How Gradient Boosting Works:
In gradient boosting ensembles are added in stages. In every stage, weak learners are added to compensate for the existing weak learners. The shortcomings of the combined model are identified with gradients.
Gradient boosting is used for:
The difficulty level increases from regression to classification to ranking.
Let’s try to understand this with the help of an example:
We want to predict a person’s age based on their interest in playing PUBG, visiting family or dancing. The objective of this exercise is to bring squared errors to a minimum. We have these 9 training samples with us:
|S.no||Age||Playing PUBG||Visiting family||Dancing|
Intuition tells us:
Young people like to play PUBG.
Middle-aged like to visit their family.
And anyone could like dancing.
A rough estimate of interests set will look like this:
And this is how its regression tree will look like:
Measurement of training errors:
|Age||Tree 1 prediction||Tree 1 residual (age-prediction)|
We can design tree 2 based on the residuals from tree 1:
|Age||Tree 1 prediction||Tree 1 Residual(age-prediction)||Tree 2 prediction||Combined prediction(Tree 1+Tree 2)||Final residual|
Gradient boosting can be approached in a number of ways. The simplest route can be taken with these 5 steps:
Step 1: Add additional model h to the existing model so now our equation becomes:
F(x) +H (x)
Step 2: To improve the model insert multiple points:
F(x1) + h(x1) = y1
F(x2) + h(x2) = y2
F(xn) + h(xn) = yn
Step 3: Reframe the equation as:
h(x1) = y1 − F(x1)
h(x2) = y2 − F(x2)
h(xn) = yn − F(xn)
Step 4: Fit regression tree to the data:
(x1, y1 − F(x1)), (x2, y2 − F(x2)), ..., (xn, yn − F(xn))
Step 5: Keep adding regression trees until the desired result is achieved.
The above equation can be refined to:
Gradient boosting is unbelievably efficient and its most popular algorithm is XGBoost. Gradient boosting can further be enhanced by using:
- Tree constraints
- Random sampling
- Penalized learning