ADABOOST: GEOMETRIC INTUTION LEC # 425
Summary
TLDRIn this lecture on applied data science, the instructor discusses AdaBoost, a popular boosting algorithm, and provides a geometric interpretation of its workings. AdaBoost, commonly used in computer vision and image processing, focuses on adjusting weights for misclassified data points in each iteration, thus adapting to errors. The lecture compares AdaBoost with Gradient Boosting, highlighting their differences and similarities. AdaBoost is explained step-by-step using a toy example, emphasizing its effectiveness in face detection, while noting that Gradient Boosting is more widely used in general-purpose machine learning tasks.
Takeaways
- 📚 AdaBoost is a popular boosting algorithm, also known as Adaptive Boosting, often compared with Gradient Boosting but with key differences.
- 🔍 AdaBoost is widely used in computer vision and image processing applications, especially in face detection, although it can be used in other fields.
- 🧠 The core idea of AdaBoost is to adaptively give more weight to misclassified points during each iteration of the model training process.
- 🌟 The process starts with a decision tree, typically a decision stump (a shallow tree with depth = 1), to classify the data into different regions.
- 📈 After each round of training, the misclassified points are given higher weight, which influences the next model to focus on correcting these errors.
- 📊 The final AdaBoost model is a combination of several weak classifiers (decision stumps), each weighted by how well they perform on the data.
- 🎯 The key difference between AdaBoost and Gradient Boosting is that AdaBoost assigns higher weight to misclassified points, while Gradient Boosting uses the negative gradient of the loss function.
- 💡 In AdaBoost, weights are updated exponentially for the misclassified points to emphasize their importance in the subsequent rounds of training.
- 🚀 AdaBoost’s effectiveness is demonstrated in tasks like face detection, but for general-purpose machine learning, Gradient Boosted Decision Trees (GBDT) are often more widely used.
- 🛠️ AdaBoost is available in libraries like Scikit-learn, and its variations can be found in other boosting algorithms.
Q & A
What is AdaBoost, and how does it differ from gradient boosting?
-AdaBoost, also known as Adaptive Boosting, is a popular boosting algorithm similar to gradient boosting, but with key differences. AdaBoost focuses on adapting to errors by increasing the weights of misclassified points, whereas gradient boosting uses pseudo-residuals computed from the negative gradient of the loss function.
What is a decision stump in the context of AdaBoost?
-A decision stump is a weak learner in AdaBoost that is essentially a decision tree with a depth of one. It creates a simple model, usually represented as a hyperplane parallel to the x or y axis, which separates data into two classes.
How are misclassified points handled in AdaBoost?
-In AdaBoost, misclassified points are given more weight in the next round of training. This is done by either up-sampling the misclassified points or explicitly assigning higher weights to them. The goal is to make the next model focus more on correcting these errors.
What happens at each stage of AdaBoost when a new model is trained?
-At each stage of AdaBoost, a new weak learner (like a decision stump) is trained on the weighted dataset. The weights of misclassified points from the previous round are increased, and the new model attempts to classify them correctly.
What role does the weight (Alpha) play in AdaBoost?
-The weight (Alpha) in AdaBoost determines the influence of each weak learner on the final model. The weight is calculated based on how well the model performs on the training data, with lower error rates resulting in higher Alpha values.
How does AdaBoost combine multiple weak learners?
-AdaBoost combines multiple weak learners by weighting their predictions using their respective Alpha values. The final model is a weighted sum of the individual models' predictions, allowing it to make more accurate classifications.
What are the main applications of AdaBoost?
-AdaBoost is commonly used in image processing applications, especially for tasks like face detection. However, it can also be applied to non-image processing tasks. It is particularly effective when combined with other techniques in tasks that involve identifying patterns or features in images.
Why are weights increased exponentially in AdaBoost?
-In AdaBoost, the weights of misclassified points are increased exponentially to ensure that subsequent models focus on correcting these errors. This adaptive mechanism helps the algorithm hone in on the most challenging points to classify.
How does AdaBoost compare to Gradient Boosted Decision Trees (GBDT) in terms of usage?
-While AdaBoost is effective in certain areas like face detection, Gradient Boosted Decision Trees (GBDT) are more commonly used in general-purpose machine learning, particularly in internet companies. GBDT tends to be preferred due to its flexibility and performance across various tasks.
What is the final model in AdaBoost composed of?
-The final model in AdaBoost is a weighted sum of the weak learners' predictions. Each weak learner contributes according to its Alpha value, which represents its accuracy in the training process. The final model combines these weighted predictions to make a more accurate classification.
Outlines
📚 Introduction to AdaBoost in Data Science
This lecture introduces AdaBoost, a popular boosting algorithm also known as adaptive boosting. It's compared to Gradient Boosting, highlighting similarities and key differences. The lecturer explains that AdaBoost is widely used in image processing, particularly in face detection, though it also has non-image processing applications. A toy example from the University of Pennsylvania is used to illustrate the concepts. The example begins with a dataset (D1), where a decision stump (a simple decision tree with depth 1) is used to create the first model (H1), resulting in some errors.
🔄 Weighted Models and Adjusting for Errors
The concept of weighted models is introduced. Misclassified points are given more weight in AdaBoost by upsampling or adjusting their weights. The lecturer demonstrates this by increasing the size of misclassified points (positive points in this case), which leads to a new dataset (D2) for training a second model (H2). The process of exponentially increasing weights for misclassified points is explained, and H2 attempts to classify the increased-weight points correctly, thus improving upon the previous model.
📈 Iterative Boosting Process with AdaBoost
In this paragraph, the iterative process of AdaBoost is further explored. After H2, the model still misclassifies some points, so the process continues with a third model (H3). The weights are adjusted again for the remaining misclassified points, and H3 creates a new hyperplane to classify them. The lecturer emphasizes that errors from earlier models are corrected in subsequent rounds. Finally, a combined model is constructed, which is the weighted sum of H1, H2, and H3, showcasing the core mechanism of AdaBoost.
🤖 Comparison with Gradient Boosting and Final Thoughts
The final part compares AdaBoost with Gradient Boosting, noting that while AdaBoost adapts by focusing on misclassified points, Gradient Boosting uses pseudo-residuals from the negative gradient of the loss function. AdaBoost is called 'adaptive' because it adjusts to errors at each stage. Although AdaBoost has been successfully applied in face detection and other areas, the lecturer personally observes that Gradient Boosting (GBDT) is more commonly used in general-purpose machine learning, particularly in internet companies.
Mindmap
Keywords
💡AdaBoost
💡Gradient Boosting
💡Decision Tree
💡Hyperplane
💡Weights
💡Face Detection
💡Misclassified Points
💡Decision Stump
💡Alpha
💡Up-sampling
Highlights
Introduction to Ada Boost, an adaptive boosting algorithm used in computer vision and image processing.
Ada Boost is similar to Gradient Boosting but with key differences.
The lecture will provide a geometric inclusion of Ada Boost rather than a deep dive.
Ada Boost is most successful in image processing applications like face detection.
GBDT (Gradient Boosted Decision Trees) is a very good algorithm for general purpose machine learning.
A toy example from the University of Pennsylvania's computer science department is used to explain Ada Boost.
Explanation of training a simple decision tree (stump) at stage 0 of Ada Boost.
How to compute the weight (Alpha) for a model based on errors in Ada Boost.
The concept of up-sampling misclassified points to increase their weight in the next round.
The process of training the second model (H2) with increased weights on misclassified points.
How the weight of misclassified points is increased exponentially in Ada Boost.
The training of the third model (H3) focusing on the points misclassified by H1 and H2.
The final model is a combination of H1, H2, and H3 with their respective weights (Alphas).
Comparison of Alphas in Ada Boost to Gammas in GBDT.
The core idea of Ada Boost is adapting to errors made in previous stages.
Ada Boost is used in SQL and Gradient Boost for classification and regression.
Personal observation that GBDT is more commonly used than Ada Boost in internet companies.
Recommendation to go through the lecture for further understanding and to leave comments for difficulties.
Transcripts
my dear friends welcome to rajasekhar
classes on applied data science with
python this is lecture number 425 in
this lecture we will try to understand
geometric inclusion of ADA boost there
is an other popular boosting algorithm
called Ada boost it is also called as
adaptive boosting it is very similar to
gradient boosting but with a couple of
key differences so what I will do is
instead of going into too deep in Ada
boost I will give you a geometric
inclusion of what is happening in Ada
boost adabus typically used in computer
vision or image processing like
applications of course it is also used
in non-image processing applications but
most successful applications of ADA
boost are in image processing especially
in the case of face detection where
wherein it tries to detect where is the
phase in an image but for general
purpose stuff
gbdt is very very good algorithm super
good algorithm so for this what I have
done I have taken a top example a toy
example from University of Pennsylvania
come to science department somebody has
some somebody has done very very nice
example and it would take me a lot of
time to recreate this I am borrowing
this example and content from them just
just for this particular class thanks a
lot to the people who created this
phenomenal content by the way now let's
see let's assume this is my data set my
plus here is positive data point and
minus is a negative data point let's
assume this is my data set D1 at stage
0. this is at stage 0. from this I train
a model let's assume I tried a simple
decision tree let's assume I tried a
decision tree with a depth equal to one
what is depth equal to one it means
depth equal to one basically means is a
decision stump and what is the decision
stump it is basically parallel to Y axis
r x axis I just basically get a
hyperplane so here when I train it what
I get my model H1 my first model is
basically a line like this isn't it is
it not H1 is a line which we If You
observe carefully this is this is first
round if you everything everything
left side of hyperplane is is just see
what is everything on left side of hyper
plane that's blue everything right side
of the hyper plane is let me say red
these three points become error why
because this plus must be in on blue
side they are on red side now for this
model I compute these three pluses are
errors for model for this model I
compute a weight Alpha One what is alpha
1 0.42 I compute the wind based on how
many errors here how many errors we have
three pluses and I I compute the weight
based on how many errors I got how well
this model performs on the original data
and things like that so I get a weight
Alpha One equal to 0.42 I have my H1 of
X which is nothing but hyperplane isn't
it and Alpha One I have Alpha One even I
have H1 of X in first round I can
calculate Alpha 1 into H one of X I have
these three points they are these three
points are misclassified which which
points are misclassified these three
plus symbols are misclassified now what
I will do here before I am going into
second stage or second round I will
increase the weight on these points I
will increase the weight of this
misclassified points that are plus isn't
it you may you may remember you can do a
weighted decision tree you can do
weighted logistic regression or you can
say weighted model in general in general
I can say it is a weighted model isn't
it so what do what what do weighted
model means one way to implement
weighted model is up sample the points
because these points erroneous this plus
these points are erroneous you see
these points these points are erroneous
erroneous at stage one what is a round
first round isn't it because H1
misclassified this this X One what did
this H1 misclassified these three points
is isn't it we want to give more weights
to them what is in adap boost what you
have to do you have to give more weight
to them by up sampling them or by
creating a weight
for each of these points if you see the
size of plus is increased just say the
size of here size of the plus here size
of plus size of the plus is increased
the larger plus basically means these
three points are misclassified we are
making them big or we are up sampling
them while keeping the rest of the
points which are correctly classified as
it is now D2 becomes the what is my ah
D2 this is my D2
yes this D2 my D2 becomes the data set
for my H2 when I go to H2 this is my H1
isn't it where is my H1 this is my H1
this is my H2 this is my D2 D2 becomes
the data set for H2 when I go to H2 when
I go to H2 what happens I have I I have
more just see in X these These Are
points I have more weightage here
because this looks big plus isn't it in
this point I am doing what you are doing
I am doing up sampling so if you see the
size of plus is also increased let's
assume I am increasing the weight by
some value typically typically speaking
you increase the weight
exponentially you increase the weight
exponentially in adaptive boosting there
is an exact method in Ada boosting on
how to increase the weight but since
these are increased weights just see
these are the increased weights are they
not increased rates these are increased
these are the increased rates it will
prefer to classify them correctly isn't
it H2 will try to classify these points
these increased weights correctly so if
my H2 will try to classify correctly if
my H2 is like this my X2 is is like this
it is it it is corrected all of my over
sampling points just see it is corrected
all of my was what is the data set for
this one this D2 is data set for this
one just remember this one is
this one is
this one is data set
data set for H2 that is important isn't
it now
now it is correctly classy correct it is
corrected all of my over sampling points
these are the over sampling points it is
correctly correctly classified which one
is correctly class fit H2 is correctly
plus but it has created these errors
whichever Stacy these these three errors
even for this model based on these
errors I can calculate Alpha two let me
say Alpha to equal 0.65 isn't it so I
have H1 Alpha One
H1 I have let me write
H1 alpha 1 from round one isn't it now
you have H2 Alpha 2 from second round
isn't it yes I have X1 Alpha in first
round H2 Alpha to Alpha two x two in
second round because because these three
these three errors because I have these
three errors no no I will give more
weightage to these points which points
these erroneous points I have to give
more more weight test now I will try my
third model I have to train my third
model just see this one yes
this is
what is this this is H1 this is H2 this
is H3
now this is ah what is this this is my
H1
this is my H2 and third model looks like
this this is H3 this is H3 isn't it I am
giving more weight to three negative
points these parts these are we are
giving more weight just see you are
giving more weights to that three
negative points it it will create hyper
plane hyperplane hyper plane like this
H3 this is the hyperplane now hyperplane
like this parallel to x axis is it not
this is parallel to parallel to x axis
cyber plane parallel X action and it
says and it says what it says the points
above this particular hyper plane are
positive Parts the points below this H3
are negative points isn't it now I have
only one point which is error this one
which one is error yes yes I have only
one point this is error I have only one
point this error actually you are saying
is here positive but you have negative
Point here I have two errors you are
saying negative but you have positive
points now I have only one point which
is error in the above hyperplate in the
above hyperplane in the above hyperplane
I have this error isn't it isn't it I
have only one error Point here in the
below hyper plane there are couple of
points uh points which are error this is
error this is also error here but these
two positive points but these two
positive points these two these two
positive points would have been K would
have been taken care by H one this one
would have been taken care by H one
isn't it and this particular negative
point would have been taken care by H2
isn't it by H2 So eventually
what I will do here is my final model
will look like this my final model look
like alpha 1 H 1 plus Alpha 2 h 2 plus
Alpha 3 H3 this becomes my final model
my F3 affix which is basically
0.42 times of H1 that is Alpha One H one
plus this is Alpha One I can say this is
Alpha One H one plus this is Alpha 2 h 2
plus this is Alpha 3 this hyperplane H3
isn't it you are alpha is here
equivalent to
Gammas in gbdt in gbdt I can compare
this Alphas with Gammas so here what
what you are doing effective what you
are doing effectively the key difference
here in the case of gbdt in the case of
in the case of GB DT we used pseudo
residuals to compute to pseudo residuous
computed from the negative gradient of
the loss function here in the first
round here in the first row here in the
first round all all the at first round
are all the points which are
misclassified are erroneous points are
given more weightage using some
something called adaptive boosting it is
called adaptive boosting because at
every stage you are adapting to the
errors that you are adapting to the
errors that were made in previous stage
and we are giving more importance to to
those points which are misclassified
this is the core idea of ADA boost and
as you have seen you can get Ada boost
in SQL and gradient
Ada boost in a scale and gradient boost
classifier grade and gradient boost
regressor so at our boost I have not
seen the uh so adapt I have not seen the
extensively in internet companies but of
course I have seen Ada boost and
variations of ADA boost being used in
things like phase detection personally
speaking from my own personal bio biases
I have seen gbdt being applied much much
more much more than Ada boost especially
in internet companies may be wrong with
other companies especially at internet
companies general purpose machine
learning gbdts are more often used than
Ada booster there are so many articles
this is this is an article related to
same Ada boost I replace detection using
boosting I request all of you to go
through this lecture if you have any
difficulty please keep a comment thank
you very much
5.0 / 5 (0 votes)