Maths behind XGBoost|XGBoost algorithm explained with Data Step by Step
Summary
TLDRIn this data science tutorial, Aman explains the mathematical foundation of the XGBoost algorithm, an advanced machine learning technique. He begins with an overview of boosting algorithms, emphasizing their sequential learning approach compared to bagging. Aman then delves into XGBoost's unique features like regularization, auto-pruning, and the importance of parameters like lambda, gamma, and eta. Using a simple dataset, he illustrates how XGBoost models are trained iteratively to minimize prediction errors and handle outliers. The video promises a follow-up on implementing XGBoost in Python and exploring its parameters' impact.
Takeaways
- đ XGBoost is a boosting algorithm that builds models sequentially, with each new model attempting to correct the errors of the previous ones.
- đ Boosting differs from bagging in that it is a sequential ensemble method, training models one after another, whereas bagging is parallel.
- đ XGBoost extends the Gradient Boosting algorithm, which uses decision trees and focuses on reducing the residuals of predictions.
- đ The script uses a simple dataset with age as the independent variable and IQ as the dependent variable to illustrate how XGBoost works.
- đą Lambda is a regularization parameter in XGBoost that helps control overfitting by adjusting the impact of residuals on the model.
- đ± Gamma is a threshold parameter in XGBoost that controls auto-pruning of the trees, thus preventing overfitting.
- đ Eta, or the learning rate, determines how quickly the boosting models converge to the final prediction.
- đł XGBoost creates a base model first, often starting with a simple average, and then fits additional models on the residuals of the previous model's predictions.
- đ The residuals are used to build subsequent trees in the boosting process, with each tree attempting to minimize the error of the previous model.
- đ ïž The script explains how parameters like Lambda, Gamma, and Eta influence the model's ability to handle outliers and prevent overfitting.
Q & A
What is XGBoost?
-XGBoost stands for eXtreme Gradient Boosting. It is a machine learning algorithm that is an optimized distributed gradient boosting library. It is designed to be highly efficient, flexible, and portable. It implements machine learning algorithms under the Gradient Boosting framework.
What is the difference between a boosting algorithm and a bagging algorithm?
-A boosting algorithm is a sequential ensemble technique where models are trained one after another, with each new model trying to correct the errors of the previous ones. In contrast, a bagging algorithm is a parallel ensemble technique where multiple models are trained independently and then combined.
How is XGBoost different from Gradient Boosting?
-XGBoost is an extension of Gradient Boosting. It adds features like regularization, auto pruning, and a more flexible definition of the objective function, making it more effective in preventing overfitting and handling large scale data.
What is the role of lambda in XGBoost?
-Lambda in XGBoost is a regularization parameter that controls the complexity of the model. It helps in controlling overfitting by penalizing the size of the trees, thus acting as an aggressive pruning parameter.
What does gamma represent in the context of XGBoost?
-Gamma in XGBoost is a threshold for minimum loss reduction required to make a further partition on a leaf node of the tree. It helps in controlling overfitting by preventing the model from learning noise from the training data.
What is the significance of the eta parameter in XGBoost?
-Eta, also known as the learning rate in XGBoost, controls how fast the model learns. It shrinks the feature weights after each boosting step to prevent overfitting. A smaller eta means a slower rate of convergence.
How does XGBoost handle outliers?
-XGBoost handles outliers by using the lambda parameter. By increasing the value of lambda, the impact of outliers on the prediction is reduced, as it effectively neutralizes the effect of extreme data points.
What is the base prediction model in the given example?
-In the provided example, the base prediction model is the average of the target values. It predicts the IQ based on the average age of the students.
How does the residual value change after the first model in XGBoost?
-After the first model, the residual values change based on the difference between the actual values and the predictions made by the model. These new residual values are then used to train the next model in the sequence.
What is the concept of similarity score in XGBoost?
-The similarity score in XGBoost is a measure used to determine the homogeneity of residuals at a node. It is calculated as the sum of squared residuals divided by the number of residuals plus lambda. It plays a crucial role in deciding whether to split a node or not.
How does XGBoost reduce the residual error?
-XGBoost reduces the residual error by sequentially fitting new models on the residuals of the previous model. Each new model aims to correct the errors of the previous ones, thus gradually reducing the overall residual error.
Outlines
đ Introduction to XGBoost Mathematics
Aman, a data scientist, introduces the concept of XGBoost's mathematical algorithm. He explains that XGBoost is a boosting algorithm, which is an ensemble technique of sequential learning. Unlike bagging, boosting trains models sequentially, where each model tries to correct the errors of the previous ones. Aman mentions that XGBoost is an extension of the Gradient Boosting algorithm and plans to explain it using sample student data. He highlights the special features of XGBoost, such as regularization, auto-pruning, and the learning rate (eta), which contribute to its effectiveness in reducing errors and improving predictions.
đł Understanding XGBoost Tree Construction
Aman demonstrates how XGBoost constructs trees by fitting models on residuals. He introduces the concept of similarity score, which includes a regularization parameter lambda, to control overfitting. Aman explains that the tree splitting criteria are based on gain, which is calculated by comparing the similarity score before and after a split. The gamma parameter is used to determine whether a split should occur based on the gain value. He also discusses how lambda affects the similarity score and, consequently, the tree's growth and pruning. Aman emphasizes the importance of lambda in controlling overfitting and managing the impact of outliers on predictions.
đ Deep Dive into Lambda's Role in XGBoost
Aman further explores the role of lambda as a regularization parameter in XGBoost. He explains that increasing lambda leads to a more aggressive pruning of the tree to control overfitting. Additionally, lambda is used in predictions to neutralize the effect of outliers. Aman illustrates how lambda affects the prediction outcome by adjusting the impact of residuals, thus generalizing the model's predictions. He also touches on the use of the learning rate (eta) in updating predictions, which controls how quickly the model converges to the correct values.
đ Wrapping Up and Looking Forward to Implementation
Aman concludes the theoretical discussion on XGBoost's mathematics and invites viewers to ask questions or comment if they need further clarification. He assures that he will respond to comments to help clarify any doubts. Aman also previews the next video, where he will implement XGBoost in Python, demonstrating how different parameters affect the algorithm's performance and speed.
Mindmap
Keywords
đĄXGBoost
đĄBoosting
đĄEnsemble Technique
đĄRegularization
đĄAuto Pruning
đĄLambda
đĄGamma
đĄEta
đĄResiduals
đĄSimilarity Score
Highlights
XGBoost is a boosting algorithm and an extension of gradient boosting.
Boosting algorithms train models sequentially to improve results.
XGBoost introduces regularization, auto pruning, and a learning rate for enhanced performance.
The base model in XGBoost starts with the average of the target values.
Residuals are used to fit the next model in the boosting process.
Lambda is a regularization parameter that controls overfitting in XGBoost.
Gamma is a threshold for auto pruning and controlling the tree's growth.
Eta is the learning rate that controls how quickly the algorithm converges to the next value.
The similarity score is calculated using residuals and lambda to measure the quality of splits.
Gain is the difference in similarity score before and after a split, used to decide whether to split a node.
Increasing lambda can help in reducing the impact of outliers on predictions.
The prediction in XGBoost is a combination of residuals, number of residuals, and lambda.
XGBoost builds trees by iteratively fitting models on the residuals of the previous model.
The learning rate (eta) influences the step size in the direction of the gradient.
XGBoost's ensemble model aims to minimize the residuals and closely approximate the original observations.
The video will continue with an implementation of XGBoost in Python and exploring parameter effects.
Transcripts
welcome to unfold data science friends
this is aman here and i am a data
scientist
in this video we will understand the
concept of xg boost mathematics
so what is the mathematical algorithm or
what is the mathematical
approach behind xc boost algorithm this
is one topic which was requested by many
of you
so i just thought of taking a sample
data
and explaining you with the data let's
start a discussion
now fg boost is what kind of algorithm
guys
fg boost is a boosting algorithm what is
your boosting algorithm by definition
boosting algorithm by definition is
nothing but an ensemble technique
of sequential learning what i mean by
sequential learning
so the difference between bagging and
boosting is backing is a parallel and
symbol and boosting is a sequential
ensemble
which means in boosting different model
get trained
one after another so first model will be
getting trend
then the second model then the third
model and then many models will combine
to give you a better result
now fg boost is nothing but an extension
of gradient boost we can say
so how gradient boost works there is a
link
in the description and you can see on
the card as well
i have given you a detailed description
of how gradient boost works
if you have not watched that video i
advise you to watch that video as well
now lg boost is nothing but an extension
of gradient boost
let us try to understand how hdboost
will work on this data
so i have taken a very simple data here
guys let us say this is students data
here is age column which is my x feature
when i say x feature independent feature
dependent feature or target is iq so iq
of a student
based on the age okay just three data
points i have taken for simplicity
how xg boost is more effective or how
exit boost is more
you know gives you better result is
based on certain things that it does
i had explained you in my last video
what are the special things about
hd boost for example regularization for
example
auto pruning what are the things which
make xd boost and special algorithm
we will try to understand with this
example so i have plotted these three
points here okay
now not only xgboost but
how normal gradient boost will also work
is
it will try to create a base model first
so what is my job here my job here is to
predict
iq based on the age so the first step is
creating a base model what is your base
model
let us say the the very normal
assumption what you can take
you can take very normal assumption of
prediction is
let us take the prediction as the
average number of these three numbers
okay so what is the average of these
three numbers
so the average of these three numbers
will be 30.6
for simplicity i am putting it 30 here
so let us say this is the average line
this is my model 0 okay base model
so my base model is saying that the
prediction
for these three cases all these cases
are actually 30
but by my base model will have some
errors right
what will be these errors obviously
these distance is right
so this is error for this point this is
error for this point and this is error
for this point right
we have certain errors now these errors
right
e1 e2 and e3
next model m1 will be fitted on these
errors
to minimize these errors okay
next model will be fitted on these
errors
to minimize these errors this is very
important to understand
so when i talk of m one for m1
input data will be my independent
features
plus my target feature will be these
errors
simple that is what exactly happens in
gradient boost as well but how is xgboos
different
so in hd boost you have to understand
three things mainly
first is known as lambda lambda is
nothing but a regularization parameter
okay regularization parameter second
thing you have to understand in the fg
boost model is known as gamma
so i am writing it like this so gamma is
nothing but a threshold
that that defines your auto pruning of
the tree
or that controls your overfitting how i
will tell you that okay
and the third thing you should remember
is something known as eta
so eta is nothing but how well or how
fast
you want to converse to the next value
so there is learning rate and there is
eta so you must not confuse with the
learning rate and eta this tells you
how fast you want to converge okay i
will give you examples to make you
understand what are these things
let us see how the models will be fitted
on these errors first
so that was our model 0 after model 0
which was an
average model we have these residuals
right
now as i told you our job is to fit a
model on these residuals using
independent features let us try to fit a
normal decision tree first
okay so what i am trying to do here is i
am trying to create an
something called as an xg boost tree on
residuals okay
so let me write here residual values
minus 10
4 and 8 okay now there is a concept of
something called as similarity score of
residuals
on this node similarity score
so what is similarity score i am writing
here similarity score is equal to
similarity score is equal to sum of
residuals
okay sum of residuals square
divided by number of residuals
which means number of how many residuals
are there
plus lambda okay this lambda is nothing
but the regularization parameter which i
spoke about
some time back okay we will try to
understand what is the use of this
so if we plug in the values here what
will be the sum of residuals here
8 plus 4 12 minus 10 how much
2 so 2 square how much 4
4 divided by number of residuals how
much
how many residuals we have 3 4 divided
by let us put lambda is equal to 0
so it will be nothing but 4 sum of
residual square
divided by 3 plus lambda we are starting
as 0
let us see i'll tell you how what will
happen if you increase lambda
so lambda 0 what will be this value
something more than 1
let us put this as 1.3 for example
this is the similarity score of the
residuals
at this particular node now
we will define a tree splitting criteria
for example let us say i define a tree
splitting criteria saying
age greater than 10 just a simple
splitting criteria
so how many records will come this side
one record
how many records will go this side two
records which is the one record which
comes on the left
residual wise i'm talking -10 which is
the one record two records which goes
right
residual wise again these two residuals
write four and eight
four and eight similarly using this
score
similarity score will be calculated at
these nodes as well
so what will be the similarity score
here sum of residual square
divided by for this node i am talking
about 100
divided by 1 so it will be 100
what will be similarity score here sum
of residual square
which means 12 144 divided by
2 right so it will be nothing but 72
right so this is the similarity score
here and this is the similarity score
here
what we have to understand here is when
the residuals are of opposite sign
then the similarity score is lower the
reason for that is residuals cancel each
other
when the similarity score is when the
residuals
sign is same then it does not cancel
each other and the
similarity score value is higher as you
can see similarity score value is higher
here and here both
now there is one term defined known as
gain
okay so how do you define gain is you
say gain is equal to
similarity score of this branch after
split
minus similarity score of the branch
before split
so in this case 100 plus 72 minus 1.3
that becomes the gain so 172 minus
1.3 whatever you get here
that becomes your gain now comes the use
of the parameter that i was talking
about known as gamma
so gamma is nothing but when you call an
xg boost algorithm
you supply a value gamma let us say your
gamma in this case is
130 okay just a simple number
whenever your gamma value is
less than the gain value then this split
will happen otherwise the split will not
happen
and that is how your auto pruning
happens
so i was discussing about auto pruning
in my last video
so how long this tree will grow that
will depend on
how much of gain it is gaining by
splitting and if that gain is
within the range of the gain that we
intending to go uh
i mean more than the range that we are
intending to get
if this is not satisfied then this
breakage will not happen
okay this is about your gamma parameter
so
higher gamma if you give that means you
want
to prune the tree with more aggressive
approach right
lower gamma you give means you want to
prune the tree with lesser aggressive
approach
now what is the use of this lambda here
two things to understand we have
proceeded this which lambda is equal
to 0 as a baseline what will happen
if we keep lambda is equal to 1 if we
keep lambda is equal to 1
then similarity score here will be
1 in this case so if i replace this 0 by
one similarity score will be one right
so here similarity score in place of one
point three will be one right
similarly at this places also similarity
score will come down
which means that to control your tree to
over fit
the overall similarity score will come
down which means the gain will come down
right
if the gain comes down then will your
tree will be pruned right
here gain number is bigger hence the
breakage is allowed
if the gain number is in place of let us
say 170
if the gain number is just 70 and we are
supplying gamma as 130
then this split will not be allowed
which means your tree is pruned
i am repeating it again guys lambda is a
very important parameter in fg boost
known as
regularization parameter what is the use
of regularization parameter one uses
if you increase your regularization
parameter then you are taking a more
aggressive approach to prune your tree
or control the overfitting of the tree
that is number one
there is one more use of lambda what is
that use
so how the prediction will happen here
is
how the prediction will happen let us
say this is metric built
this is my tree right so how the
prediction here happens is
it simply tells you sum of all the
residuals
divided by number of residuals plus
lambda
again okay so i'm talking about
if tomorrow a new data comes then how
the prediction will happen let us say
new kid comes
whose age is 11 so what this model i am
talking about model m1 only now
what this model will predict for that
guy it will predict
sum of residuals to whichever branch it
goes for example in this branch it comes
right
sum of residuals is how much 12. number
of residuals is how much 2
plus lambda now here also
we can play with lambda for one moment
let us put lambda is equal to 0
so what will be the prediction in this
branch 6
next moment put lambda is equal to 1
what will be the prediction in this
branch
6 uh 2 plus 1 3 right 4
if we increase lambda further then it
will come further down
what is the use of lambda in this case
if you think carefully is
the effect of outliers so another
advantage of xg boost is
it takes care of the outliers to some
extent so if you increase the lambda
size if you increase the lambda value
the impact of
outliers on the prediction will come
down significantly
these are the usage of your
regularization parameter one
is how aggressive you want to control
your overfitting how aggressive you want
to prune your tree
similar kind of use for this gain gamma
parameter as well right
and what is the use of lambda other use
you know
how much you want to control the effect
of an outlier in your data
how much you want to generalize so if
you keep lambda is equal to 0 then this
is nothing but an average right
lambda is equal to 0 is just an average
number of the prediction
if you increase lambda it's not just
average it is trying to neutralize the
effect of
extreme variables or extreme data points
this is about
gamma lambda and the third parameter i
am going to tell you what is that
new prediction is equal to previous
prediction which is nothing but my model
0 prediction base prediction
plus learning rate into output now let
us take the example of the previous tree
that we created
i am considering this second data point
okay the second data point
is greater than 10 right so it will come
to that side
where the output was 6 right you can go
back and see
output was 6 what is the previous
prediction for this guy i mean this
record
it is nothing but 30.6 so i am
considering that 30
for simplicity the average of these
three nothing more 30
plus learning rate i'll tell you what is
this learning rate in a moment
into output what was the output or
output prediction for that branch
for lambda is equal to zero it was six
so six
now this learning rate is nothing but
what i told in the beginning a parameter
called eta
okay so eta in typically fgboost we take
0.3
you can tweak the value between 0.121 as
well
if i put 0.3 right it will be 6
30 plus how much
1.8 right 30 plus 1.8
is equal to 31.8 that is the new
prediction
for this particular record when the new
prediction
comes then the residual value will
automatically
change what will be the new residual
value now
new residual value will be nothing but
34
minus 31.8 how much
2.2 i think 32 yeah 2.2
so new residual here will be 2.2
right now previous residual goes out of
the picture
new residual comes and model 2
is trained this is how one after another
model 1 model 2 model 3 model 4 will get
trained
and in the last model what will happen
is we will have
reduced residual so if you see here
previously receivable was 4
now as it will reduce to 2.2 in next
model it will reduce further
and similarly this is how xg boost will
try to reduce the residual
and give you the final ensemble model
which is very close to the original
observations
so this was all about the mathematics
behind fg boost with a simple data
little complex i am sure you might be
having some doubt
write me in comment i'll definitely
respond to you and i try to explain you
in a simple way still
if you feel at some step you wanted to
ask something just don't hesitate to
write me
i'll see you all in the next video with
the implementation of this in python
and i'll also show you how different
parameters change and how
this algorithm runs faster and many
other things right
i'll see you all in the next video till
then all of you stay safe and take care
you
Voir Plus de Vidéos Connexes
Xgboost Regression In-Depth Intuition Explained- Machine Learning Algorithms đ„đ„đ„đ„
XGBoost's Most Important Hyperparameters
Linear Regression, Cost Function and Gradient Descent Algorithm..Clearly Explained !!
The Exact Skills and Certifications for an Entry Level Machine Learning Engineer
What is a Machine Learning Engineer
Neural Networks Pt. 2: Backpropagation Main Ideas
5.0 / 5 (0 votes)