Xgboost Regression In-Depth Intuition Explained- Machine Learning Algorithms đ„đ„đ„đ„
Summary
TLDRIn this YouTube tutorial, the host Krishna dives into the workings of the XGBoost Regressor, an ensemble machine learning technique. He explains how decision trees are constructed within XGBoost, detailing the process from creating a base model to calculating residuals and constructing sequential binary trees. Krishna also covers the calculation of similarity weights and information gain, crucial for optimizing splits in the tree. The video is educational, aiming to provide in-depth intuition into XGBoost's regression capabilities.
Takeaways
- đ XGBoost is an ensemble technique that uses boosting methods, specifically extreme gradient boosting.
- đ The script explains how decision trees are constructed in XGBoost, starting with creating a base model that uses the average of the target variable.
- đ Residuals are calculated by subtracting the base model's output from the actual values, which are then used to train the decision trees.
- đ The script discusses the calculation of similarity weights in XGBoost, which is a key component in determining how to split nodes in the trees.
- đ Lambda is introduced as a hyperparameter that can adjust the similarity weight, thus influencing the complexity of the model.
- đ The process of calculating information gain for different splits is detailed, which helps in deciding the best splits for the decision trees.
- đł The script walks through an example of constructing a decision tree using the 'experience' feature and comparing it with other potential splits.
- đ§ The concept of gain, which is the improvement in model performance from a split, is calculated and used to decide which features and splits to use.
- đ The script mentions the use of multiple decision trees in XGBoost, each contributing to the final prediction with a learning rate (alpha) applied.
- đ The role of hyperparameters like gamma in post-pruning to prevent overfitting is discussed, highlighting the importance of tuning these parameters for optimal model performance.
Q & A
What is the main topic of the video?
-The main topic of the video is an in-depth discussion on the XGBoost Regressor, focusing on how decision trees are created in XGBoost and the mathematical formulas involved.
What is XGBoost and what does it stand for?
-XGBoost stands for eXtreme Gradient Boosting. It is an ensemble machine learning algorithm that uses a boosting technique to build and combine multiple decision trees.
What is the role of the base model in XGBoost?
-The base model in XGBoost is used to calculate the average output, which serves as the initial prediction before any decision trees are applied. It helps in computing residuals that are used to train the subsequent decision trees.
How does the script define 'residual' in the context of XGBoost?
-In the context of XGBoost, 'residual' refers to the difference between the actual value and the predicted value by the base model. It represents the error that the model is trying to correct.
What is the significance of the lambda parameter in XGBoost?
-The lambda parameter in XGBoost is a hyperparameter that controls the complexity of the model. It is used in the calculation of similarity weight, which in turn affects the decision of splitting nodes in the decision trees.
What is the purpose of calculating similarity weight in XGBoost?
-The similarity weight in XGBoost is calculated to measure the quality of a split in a decision tree. It is used to determine the best split that maximizes the reduction in the weighted sum of squared residuals.
How does the script describe the process of creating a decision tree in XGBoost?
-The script describes the process of creating a decision tree in XGBoost by first creating a base model, then calculating residuals, and using these residuals to determine the best splits for the decision tree nodes based on similarity weight and gain.
What is the role of the learning rate (alpha) in the XGBoost algorithm?
-The learning rate (alpha) in the XGBoost algorithm determines the contribution of each decision tree to the final prediction. It is used to control the impact of each tree, allowing the model to update predictions incrementally.
What is the purpose of the gamma parameter mentioned in the script?
-The gamma parameter in the script is used for post-pruning in the decision trees. If the information gain after a split is less than the gamma value, the split is pruned, which helps in preventing overfitting.
How does the script explain the process of calculating the output for a decision tree in XGBoost?
-The script explains that the output for a decision tree in XGBoost is calculated by taking the average of the residuals that fall into a particular node, and then using this output to update the predictions for the corresponding records.
What is the final output of the XGBoost model as described in the script?
-The final output of the XGBoost model is the sum of the base model output and the outputs from each decision tree, each multiplied by their respective learning rates (alphas).
Outlines
đ Introduction to XGBoost Regressor
The speaker, Krishna, introduces the topic of the video, which is an in-depth exploration of the XGBoost Regressor. He explains that the video will cover the creation of decision trees within XGBoost and the mathematical formulas involved. Krishna references a previous video on the HDBSCAN classifier and encourages viewers to watch it for context. The video then delves into a problem statement involving three features: experience, gap, and salary. The aim is to demonstrate how decision trees are constructed in XGBoost, starting with the creation of a base model using the average salary as a predictor.
đ Constructing Decision Trees in XGBoost
Krishna explains the process of creating sequential decision trees in XGBoost, beginning with the calculation of residuals based on the average salary. He then discusses how the first decision tree is constructed using the features 'experience' and 'gap', along with the residuals. The video illustrates how to split the data at the root node based on the 'experience' feature, creating two branches: one for values less than or equal to 2 and another for values greater than 2. Krishna also introduces the concept of similarity weight, which is calculated using the sum of residual squares and a hyperparameter called lambda. The goal is to find the best split that maximizes the information gain, which is the difference between the sum of similarity weights of the left and right nodes and the similarity weight of the root node.
đ Calculating Information Gain and Decision Tree Output
The speaker continues by calculating the information gain for different splits in the decision tree, comparing the gains to determine the most effective split. He emphasizes the importance of choosing the split that yields the highest gain. Krishna then explains how to calculate the output of the decision tree nodes, using the average of the residuals for each node. The video demonstrates how the base model output is combined with the decision tree output, adjusted by a learning rate, to produce the final predictions. The process is repeated for each record, and Krishna clarifies a mistake made in the calculation, ensuring accuracy in the demonstration.
đ ïž Advanced XGBoost Techniques and Conclusion
In the final part of the video, Krishna discusses advanced techniques in XGBoost, such as the use of the gamma parameter for post-pruning to prevent overfitting. He explains that if the information gain after a split is less than the gamma value, the split is pruned. Krishna summarizes the process of constructing an XGBoost model, which involves creating multiple decision trees and combining their outputs with the base model output, each multiplied by a learning rate. He concludes by encouraging viewers to subscribe to the channel for more content and thanks them for watching.
Mindmap
Keywords
đĄXGBoost Regressor
đĄBoosting Technique
đĄDecision Tree
đĄResiduals
đĄSimilarity Weight
đĄLambda
đĄInformation Gain
đĄHyperparameter Tuning
đĄLearning Rate
đĄPruning
Highlights
Introduction to XGBoost Regressor and its ensemble technique using boosting.
Explanation of how decision trees are created in XGBoost.
Discussion on the Mean Absolute Percentage Error (MAPE) formula involved in XGBoost.
Overview of the base model creation process in XGBoost.
Calculation of residuals based on the base model output.
Introduction to the concept of similarity weight in XGBoost Regressor.
Calculation of similarity weight for different splits in the decision tree.
Explanation of information gain in the context of XGBoost Regressor.
Demonstration of how to choose the best split point in a decision tree.
Construction of a binary tree with experience as a continuous feature.
Use of lambda as a hyperparameter to adjust similarity weight.
Calculation of the output for each node in the decision tree.
Description of how to pass a record through the decision tree to get the output.
Introduction to the concept of learning rate in the context of XGBoost.
Explanation of how residuals are recalculated after each iteration.
Discussion on the role of gamma as a hyperparameter for post-pruning in XGBoost.
Final output calculation using the XGBoost formula with multiple decision trees.
Encouragement for viewers to subscribe for more in-depth tutorials on XGBoost.
Transcripts
hello all my name is krishna welcome to
my youtube channel so guys today in this
particular video we are going to discuss
about xg boost regressor we'll try to
understand how the decision tree
is actually created in xgboost and apart
from that we'll also try to see
the maps formula that are actually
involved in xg boost now this will be an
xg boost regression in-depth intuition
my previous video i had already covered
with respects to hdbush classifier
if you have not seen that guys please go
ahead and see that again the link of
that particular video will be given in
the description
okay now to begin with what i'm going to
do is that again if you don't know about
xgboost it is an ensemble technique it
uses boosting technique
and there are many other algorithms in
boosting technique like
boost gradient boosting extreme gradient
boosting i've covered all these three
and again there are some more boosting
techniques which i'm going to cover in
the upcoming sessions
so to begin with i'm going to actually
take this simple problem statement
this paper i have actually done all the
computation and
suppose if i miss out any calculation
i'll just verify it from here okay
so in this particular problem statement
i have three features
like experience gap and salary
so this basically says that a person if
he's having an experience of two years
and suppose if he has gap
gap is basically a categorical feature
if he has a gap the salary will be
somewhere around 40k
okay it may be in dollars then 2.5 yes
41k if he does not have gap usually
based on experience the salary will
become more
okay so this is the complete data set i
have okay
if you remember if you have gone through
my xg boost classifier
you know there i've actually shown you
how to construct the decision tree
similarly i'll try to show you how you
can construct a decision entry in exibus
regressor
so let's begin now over here always
remember
as in xgboost what we do is that we
create sequential decision trees
so first of all we'll try to create a
base model this base model
will actually give you the output what
kind of output it will give
so based on this average salary suppose
uh by default you know the first base
model will take the average of all this
salary
40k 41k 52k 60k and 62k
if i calculate the average of this the
average salary will be somewhere on 51k
okay then based on this 51k i will try
to find out the residual
okay this will be my residual one
because i'm finding out for the first
residual
and if i whenever i talk about boosting
techniques guys the decision tree gets
trained
based on this particular residual values
so i'll try to subtract 40 k
minus 51k so this will be nothing but 9k
sorry minus 9k then if i sorry it should
not be 9
minus 9 but instead it should be minus
11k
right if i subtract 41 with 51 it should
be minus 10
right if i subtract 52 with 51 it should
be 1.
if i subtract 60 with 51 it should be 9
and if i subtract 62 uh with 51
then at that particular point of time it
should be actually
51 11 11 1162 right it should be 11
okay now these are all my values with
respect to residual ones
let me consider guys one thing let me
just make a small change over here
instead of writing 41k
i will try to write 42k okay
just a small change this just to make my
calculation easier
but again it will not matter if you also
write 41k
now if i try to find out the difference
this will be somewhere around
9k just to make my calculation easier i
have just written it as 40k
now what we will do the xg boost
regressor we have created the base model
the base model output is 51k based on
that we have got the residuals residual
basically means error
right now in the first decision tree
that is basically getting created
we are going to take this independent
feature experience gap and salary
sorry not salary experience gap and
residual
so first of all we will try to take the
root node we will try to see in which
road node we will try to divide
okay so suppose if i take experience
now in experience okay i know all my
output values what are my output values
it is minus 11k then it is minus 9
then it is 1 then 9 and 11
right i have this and remember
experience is a continuous feature right
it is not a category feature it is a
continuous feature two two point five
three four four point five
so in a continuous feature how a
decision tree is basically constructed
suppose
every time remember in extreme boost we
create a binary trees only
so here i will try to create a binary
tree
the first tree over here i'll say that
the condition will be less than or equal
to the first record that is 2
okay then in my next record i will say
this will be greater than
2. okay so i have actually taken this it
will be less than or equal to 2 it will
be greater than 2
that basically means the first record
will go over here remaining all the
records will go in this side
right now if i try to get all the values
over here
right what will be the values if i have
less than or equal to
2 that is only this one residual so here
i will be getting minus 11
and remaining or i will be getting minus
9 1
9 comma 11. so all these values will be
coming over here which is greater than 2
pretty much simple pretty much easy okay
this is one way
you can also take like this first of all
you can take first two records you can
calculate the average
then less than or equal to average all
the records you put it on the left hand
side
otherwise you put it on the right hand
side but here what i've done is that
i've taken the first record i've written
greater than less than or equal to two
made it as one branch the other branch
will be this now this is the first step
now in the second step we basically
calculate
something called as similarity weight
now if you remember in xd boost
classifier
the similarity weight was defined as
summation of residual square
in exhibit classifier again i'm telling
you guys classifier
summation of residual square summation
of probability
minus 1 minus probability and this
probability used to be 0.5
now in case of sg bush regressor this
formula will change little bit
okay here we'll basically say that
number of residuals number of residuals
plus a parameter hyper parameter which
is called as lambda
okay this lambda value will definitely
if we increase the lambda value this
will decrease the
similarity weight in short okay so this
can be treated as a hyper parameter
so let me go ahead and let me start
computing the similarity weight
in exhibit regressor also we create we
calculate the similarity weight
now suppose for this we'll try to
calculate the similarity weight
because i need to basically understand
if i am taking experience as my root
node how do i have to split
right now i've just taken the first
record in the left hand side and
remaining greater than 2 in the right
hand side
right whether this record is the best
record to split
right for that we'll calculate the
similarity to g weight after that we'll
calculate the information gain
or we'll say all as gain okay so here
we'll try to calculate the similarity
weight okay so here some residual square
i have only one residual so this will
become 121
minus 11 is nothing minus 11 square is
nothing but 121
number of residual is again 1 plus in
this particular case let me consider
that i am going to take lambda as 1
so if i try to write 121 divided by 2
this would be nothing but
six zero and then uh five zero 0.5 so my
similarity weight in this case is 65.5
so i've calculated
it i'll write it over here so my
similarity weight
is 65.5 remember my lambda is equal to
1.
now similarly i will try to calculate
the similarity weight for
this how will i calculate i'll take all
the residual
summation of this minus 1 9 plus 1
plus 9 plus 11 whole square
divided by how many numbers are there it
is basically 4
4 plus 1 right now if i try to compute
this
right and this 9 and this 9 will get
cancelled 12 12 is nothing but 144
divided by 5 right so if i try to do
this
ok how many records i will be getting
five ones are five twos are ten
then five eights are 40 88 28.5
so i will get the sims this similarity
weight as 28.5
if there is little bit mistake don't
worry okay just calculate the
similarities weight will be 28.5
if it is getting coming some other value
you can let me know in the description
so i'm going to just going to write 28.5
so here i'm going to write 28.5
now the third thing that i'm going to do
is basically
i also have to compute the similarity
weight of the root
right this root so here i will be taking
again
minus 11 minus 9 9 11 this will get
cancelled
only 1 will be remaining so this will be
1 square divided by how many numbers i
have 5
5 plus 1 so this is nothing but 1 by 6
1 by 6 is nothing but if i say that it
is somewhere around
0.1 um 0.16
okay so i'm going to take this 0.16 so
i will write over here my similarity
weight
is nothing but 0.16
now if i really want to calculate the
information gain or again i'll say
then we have to take this left
similarity weight
add it with right similarity weight
subtract with the root similarity weight
okay so here if i try to calculate this
is
20 um suppose 5 5 1 13 14
14 and 94 minus 0.16
okay and here if i try to subtract 2094
minus
0.16 you know so this will be somewhere
around four
eight three ninety three point eight
four
ninety three point eight four so the
total gain
that i am getting from this split is
basically 93.84
so here i'm just going to note down this
one and i'll say that the total gain
that i'm going to get is
93.84 okay so this is my
gain now similarly what i'll do
i have done this now i'll go with the
next split in next plate i'll say that
okay
instead of making see guys with one
split we got the gain 98.4 so let me
just write it down somewhere here
okay so for the first record split i
got somewhere around 98.34
okay so i hope i'm writing it right okay
now
in my second split what i'm going to do
is that i'm going to just drop this
and now i'll go to my second record
i'll go to my second record now second
record is nothing but 2.5 so here i'll
say that
less than or equal to 2.5 and greater
than
2.5 now when i do this less than or
equal to 2.5 how many records are there
one 2 so here i am going to write minus
11 minus 9
okay here again there will be 3 records
1 9 11
okay 1 9 11 now when we try to calculate
the similarity weight again over here it
will be minus 11 minus 9 whole square
divided by 2 which is nothing but 400
sorry 2 plus 1 right because alpha value
is also there
so 400 divided by 3 this will be
somewhere around
3 1 are three threes or three threes are
0.33 so it will be nothing but 133.33 so
this similarity weight over here
will be 133.33
okay then for this similarity weight
again i will try to
add up all these things 1 plus 9 is 10
10 plus 11 is 21
21 whole square will be 441 441 divided
by 3 plus 1 which is nothing but
4 okay so if i write 441 divided by 4
it will be nothing but 1 1 0
or 0.25 so 1 1 0.25 so i'm going to take
this similarity weight as
1 1 0.25
okay now i'll try to i know my root
similarity
now if i really want to find out the
gain it will be 133.33
plus 110.25 minus
0.16 okay so once i do this
probably how much it will come so let me
add it for you
133.33 110.25
okay so this will be 85.34 143.58
and if i try to subtract with 1 6
this will be nothing but 4 143.42
so here the total gain that i will be
getting is
143.42 and obviously this gain is
better than this gain so i'll definitely
not do the split from here
i'll do the split from here and
similarly i'll try to compare with all
the splits
and suppose i find out that i have to do
the split from this record
i will take this and i will try to
create it now once i decided okay this
is the split that i have to use
after this what i'll do is that i will
go with the next category feature
now suppose over here i go with gap as
yes and no
in gap if it is yes suppose this is yes
right and this is no in yes because both
these values are in yes so this becomes
the root node so i will not go with this
splitting
right then next with this splitting i'll
go with gap
again i'll write yes or no so how many
in this records which is greater than
2.5 there are two nodes
two nodes over here right two nodes and
these are nothing but one and nine
so i'll write it as one nine nine here i
will write it as
um remaining one is eleven then again
i'll try to calculate the similarity
score
i'll try to calculate the gain okay now
suppose if i go with this split right
now
okay and then i'll try to compare like
whichever will be having the highest
gain i'll be taking that
now suppose this is the overall tree
structure that has been created for the
first tree
how will i calculate the output now
suppose in this particular path if it is
going the output over here
will be the average of this value so
minus 11 minus 9 is nothing but 20
20 divided by 2 will be nothing but 10
so this output of this node
will be actually 20. so here i'm going
to write it as
output as 20 okay
in this particular output since this is
a single node it will be 11
in this it will be the 5 because this
average is nothing but 5
right when i have done this when i have
done this now suppose
i take this record and i try to
calculate what will be the output now
right i know that first
after this base model then will
concatenate basically will
then pass through this tree only after
the base model
so as soon as i pass this record suppose
i am passing this first record
first of all it will go to the base
model the base model will be giving me
the value as 51
because this is 51k then suppose i
consider my learning rate parameter
that is alpha is 0.5 okay now this 0.5
multiplied by the output of this tree
now suppose this is 2 2 basically means
will come here
right this path will be taken here and
once we find out the output is 20
so here i will write it as 20 right
this is with respect to one decision
tree like this i can create any number
of decision trees like i can create
with the help of gap as my root node
with the help of experience on some
other records
right i can create multiple decision
tree so this will be my
alpha t one then again i can have alpha
this can be alpha one t one this can be
alpha two t
2 alpha 3 t 3 like this i can have any
number of trees
now in this particular case i have since
i have constructed only one decision
tree
i am just going to use this so what will
be the value 51 plus
0.5 into 2 is nothing but 10.
sorry the output uh sorry this output
will be
i'm extremely sorry guys this output
will be minus
minus 20 by 2 minus 10 okay
i made one mistake the average output
for this will be minus 10
okay minus 10 so here i will write it as
0.5
multiplied by minus 10 right
when i subtract it it will be nothing
but minus 0.5
sorry minus minus 5 okay
this will be minus 5 so here my total
value will be 46 that basically means
once i pass this particular output my
real output now
my real output now will be 46
then again if i try to pass with respect
to 42 k 42 to k will again pass in this
path
again this will also become 46 right
plus 0.5
minus multiplied by minus 10 similarly
we have to pass all these particular
records
calculate our value suppose this value
is somewhere around for this value let
me compute okay what will be this value
okay see three is basically passing
through this road
node right now negative is no no is
basically getting passed over here
the output is 5 so i will write 51
k plus 0.5 which is my learning rate
multiplied by 5
okay and here i can write 51 plus
0.25 right five five two sorry it should
not be 0.25 it should be
2.5 2.5 so here i will be having
53.5 so this will be my output over here
right and similarly i will be trying to
write all these values i'm just going to
rub this guys
i'm just going to rub this okay
similarly i'm going to find out like
this so it'll be here will be 62
and here probably will be 63. then again
i'll try to calculate my residual 2
some value will be again coming now in
my next iteration i'll create one more
tree
where i'll be taking this two input
features and this will be my dependent
feature and again i'll be trying to
create a decision tree
this will be my second decision tree my
first decision tree is basically getting
created here
i have this particular output like this
over here and after that this will be my
second decision tree after this i will
try to add it again my formula will be
like what
my base model output right plus
alpha 1 t 1 plus alpha 2
t 2 like this plus alpha and
t n and like this will be my complete
x j boost output that it will be coming
up you know
and that is the how we have to say there
is also one more parameter which is
called as gamma
gamma basically says that suppose if i
say gamma is 140
or let me just write it as okay gamma is
somewhere around 150.5
and suppose after this split the
information gain i got it as 140.
if i try to subtract 150 with 140 uh if
i
try to subtract this 140 with 150.5
if it is negative value then we can
postpone this
we can prune basically we can cut that
particular tree if it is a positive
value
it will be no we should not prune it and
again this is the kind of hyper
parameter
that we can set for doing the post
pruning technique only when it is trying
to get over fitted
but most of the scenarios some value of
alpha or gamma will be actually set
the default value you can again see in
the scale who are there also
in that particular library some default
value will be getting set
so i hope you understood how xgboost
aggressor work
please do subscribe the channel if you
have not already subscribe i'll see you
in the next video
have a great day ahead thank you bye bye
Voir Plus de Vidéos Connexes
Tutorial 43-Random Forest Classifier and Regressor
Maths behind XGBoost|XGBoost algorithm explained with Data Step by Step
CatBoost Part 2: Building and Using Trees
XGBoost's Most Important Hyperparameters
Konsep memahami Algoritma C4.5
Insurance Fraud Detection using Machine Learning | 11 ML Algorithms Used to Identify Insurance Fraud
5.0 / 5 (0 votes)