ADABOOST: GEOMETRIC INTUTION LEC # 425

Rajasekhar Classes
26 Dec 202214:45

Summary

TLDRIn this lecture on applied data science, the instructor discusses AdaBoost, a popular boosting algorithm, and provides a geometric interpretation of its workings. AdaBoost, commonly used in computer vision and image processing, focuses on adjusting weights for misclassified data points in each iteration, thus adapting to errors. The lecture compares AdaBoost with Gradient Boosting, highlighting their differences and similarities. AdaBoost is explained step-by-step using a toy example, emphasizing its effectiveness in face detection, while noting that Gradient Boosting is more widely used in general-purpose machine learning tasks.

Takeaways

  • 📚 AdaBoost is a popular boosting algorithm, also known as Adaptive Boosting, often compared with Gradient Boosting but with key differences.
  • 🔍 AdaBoost is widely used in computer vision and image processing applications, especially in face detection, although it can be used in other fields.
  • 🧠 The core idea of AdaBoost is to adaptively give more weight to misclassified points during each iteration of the model training process.
  • 🌟 The process starts with a decision tree, typically a decision stump (a shallow tree with depth = 1), to classify the data into different regions.
  • 📈 After each round of training, the misclassified points are given higher weight, which influences the next model to focus on correcting these errors.
  • 📊 The final AdaBoost model is a combination of several weak classifiers (decision stumps), each weighted by how well they perform on the data.
  • 🎯 The key difference between AdaBoost and Gradient Boosting is that AdaBoost assigns higher weight to misclassified points, while Gradient Boosting uses the negative gradient of the loss function.
  • 💡 In AdaBoost, weights are updated exponentially for the misclassified points to emphasize their importance in the subsequent rounds of training.
  • 🚀 AdaBoost’s effectiveness is demonstrated in tasks like face detection, but for general-purpose machine learning, Gradient Boosted Decision Trees (GBDT) are often more widely used.
  • 🛠️ AdaBoost is available in libraries like Scikit-learn, and its variations can be found in other boosting algorithms.

Q & A

  • What is AdaBoost, and how does it differ from gradient boosting?

    -AdaBoost, also known as Adaptive Boosting, is a popular boosting algorithm similar to gradient boosting, but with key differences. AdaBoost focuses on adapting to errors by increasing the weights of misclassified points, whereas gradient boosting uses pseudo-residuals computed from the negative gradient of the loss function.

  • What is a decision stump in the context of AdaBoost?

    -A decision stump is a weak learner in AdaBoost that is essentially a decision tree with a depth of one. It creates a simple model, usually represented as a hyperplane parallel to the x or y axis, which separates data into two classes.

  • How are misclassified points handled in AdaBoost?

    -In AdaBoost, misclassified points are given more weight in the next round of training. This is done by either up-sampling the misclassified points or explicitly assigning higher weights to them. The goal is to make the next model focus more on correcting these errors.

  • What happens at each stage of AdaBoost when a new model is trained?

    -At each stage of AdaBoost, a new weak learner (like a decision stump) is trained on the weighted dataset. The weights of misclassified points from the previous round are increased, and the new model attempts to classify them correctly.

  • What role does the weight (Alpha) play in AdaBoost?

    -The weight (Alpha) in AdaBoost determines the influence of each weak learner on the final model. The weight is calculated based on how well the model performs on the training data, with lower error rates resulting in higher Alpha values.

  • How does AdaBoost combine multiple weak learners?

    -AdaBoost combines multiple weak learners by weighting their predictions using their respective Alpha values. The final model is a weighted sum of the individual models' predictions, allowing it to make more accurate classifications.

  • What are the main applications of AdaBoost?

    -AdaBoost is commonly used in image processing applications, especially for tasks like face detection. However, it can also be applied to non-image processing tasks. It is particularly effective when combined with other techniques in tasks that involve identifying patterns or features in images.

  • Why are weights increased exponentially in AdaBoost?

    -In AdaBoost, the weights of misclassified points are increased exponentially to ensure that subsequent models focus on correcting these errors. This adaptive mechanism helps the algorithm hone in on the most challenging points to classify.

  • How does AdaBoost compare to Gradient Boosted Decision Trees (GBDT) in terms of usage?

    -While AdaBoost is effective in certain areas like face detection, Gradient Boosted Decision Trees (GBDT) are more commonly used in general-purpose machine learning, particularly in internet companies. GBDT tends to be preferred due to its flexibility and performance across various tasks.

  • What is the final model in AdaBoost composed of?

    -The final model in AdaBoost is a weighted sum of the weak learners' predictions. Each weak learner contributes according to its Alpha value, which represents its accuracy in the training process. The final model combines these weighted predictions to make a more accurate classification.

Outlines

00:00

📚 Introduction to AdaBoost in Data Science

This lecture introduces AdaBoost, a popular boosting algorithm also known as adaptive boosting. It's compared to Gradient Boosting, highlighting similarities and key differences. The lecturer explains that AdaBoost is widely used in image processing, particularly in face detection, though it also has non-image processing applications. A toy example from the University of Pennsylvania is used to illustrate the concepts. The example begins with a dataset (D1), where a decision stump (a simple decision tree with depth 1) is used to create the first model (H1), resulting in some errors.

05:02

🔄 Weighted Models and Adjusting for Errors

The concept of weighted models is introduced. Misclassified points are given more weight in AdaBoost by upsampling or adjusting their weights. The lecturer demonstrates this by increasing the size of misclassified points (positive points in this case), which leads to a new dataset (D2) for training a second model (H2). The process of exponentially increasing weights for misclassified points is explained, and H2 attempts to classify the increased-weight points correctly, thus improving upon the previous model.

10:04

📈 Iterative Boosting Process with AdaBoost

In this paragraph, the iterative process of AdaBoost is further explored. After H2, the model still misclassifies some points, so the process continues with a third model (H3). The weights are adjusted again for the remaining misclassified points, and H3 creates a new hyperplane to classify them. The lecturer emphasizes that errors from earlier models are corrected in subsequent rounds. Finally, a combined model is constructed, which is the weighted sum of H1, H2, and H3, showcasing the core mechanism of AdaBoost.

🤖 Comparison with Gradient Boosting and Final Thoughts

The final part compares AdaBoost with Gradient Boosting, noting that while AdaBoost adapts by focusing on misclassified points, Gradient Boosting uses pseudo-residuals from the negative gradient of the loss function. AdaBoost is called 'adaptive' because it adjusts to errors at each stage. Although AdaBoost has been successfully applied in face detection and other areas, the lecturer personally observes that Gradient Boosting (GBDT) is more commonly used in general-purpose machine learning, particularly in internet companies.

Mindmap

Keywords

💡AdaBoost

AdaBoost, short for Adaptive Boosting, is a popular boosting algorithm. It is used to improve the performance of weak learners by adapting to the errors made in previous stages. In the video, AdaBoost is highlighted as an important algorithm, especially for tasks like face detection, where it identifies and corrects misclassified points in iterative steps.

💡Gradient Boosting

Gradient Boosting is another boosting technique mentioned in the video, often compared to AdaBoost. It uses pseudo-residuals computed from the negative gradient of the loss function. The speaker mentions that while AdaBoost is frequently used in image processing, Gradient Boosting (GBDT) is more commonly employed in general-purpose machine learning tasks, particularly in internet companies.

💡Decision Tree

A Decision Tree is a type of predictive model used in both AdaBoost and Gradient Boosting. In the video, a 'decision stump' (a tree with depth one) is used as the weak learner for AdaBoost, forming the basis of the boosting process. The tree makes predictions by dividing data into different segments based on feature values.

💡Hyperplane

A hyperplane is a decision boundary that separates different classes in a dataset. In the context of AdaBoost, the video describes the creation of hyperplanes by decision trees that attempt to classify data points into positive or negative categories. Misclassified points lead to changes in the hyperplane for the next iteration of the model.

💡Weights

Weights refer to the importance given to certain data points in the AdaBoost process. In each iteration, the algorithm increases the weight of misclassified points to focus on correcting them in the next stage. The video explains how these weights are crucial for adapting the model to handle previously misclassified points more effectively.

💡Face Detection

Face detection is one of the most successful applications of AdaBoost, as mentioned in the video. The algorithm is commonly used to locate faces in images by focusing on areas where errors in classification (e.g., misidentifying a face) occur, and boosting the model's accuracy in recognizing them in future iterations.

💡Misclassified Points

Misclassified points are the data points that a model incorrectly predicts. In AdaBoost, these points are given more weight in subsequent iterations to improve the model's performance. The video describes how misclassified points become larger in size to indicate their increased weight, and the algorithm adjusts its focus to classify them correctly.

💡Decision Stump

A decision stump is a simple decision tree with only one level (or depth of one). In the video, a decision stump is used as the weak learner in the initial stage of AdaBoost, providing a simple classification that the algorithm builds upon by correcting errors in subsequent stages.

💡Alpha

Alpha represents the weight assigned to each model or hypothesis in AdaBoost, determining its contribution to the final prediction. The video explains how Alpha is calculated based on the model's performance in each iteration, with an example of Alpha values (e.g., 0.42, 0.65) being computed for different rounds.

💡Up-sampling

Up-sampling refers to the process of increasing the importance of certain data points, often by replicating or giving them more weight. In AdaBoost, this is done for misclassified points, ensuring that the model pays more attention to them in future iterations. The video describes how the size of misclassified points increases visually to reflect this process.

Highlights

Introduction to Ada Boost, an adaptive boosting algorithm used in computer vision and image processing.

Ada Boost is similar to Gradient Boosting but with key differences.

The lecture will provide a geometric inclusion of Ada Boost rather than a deep dive.

Ada Boost is most successful in image processing applications like face detection.

GBDT (Gradient Boosted Decision Trees) is a very good algorithm for general purpose machine learning.

A toy example from the University of Pennsylvania's computer science department is used to explain Ada Boost.

Explanation of training a simple decision tree (stump) at stage 0 of Ada Boost.

How to compute the weight (Alpha) for a model based on errors in Ada Boost.

The concept of up-sampling misclassified points to increase their weight in the next round.

The process of training the second model (H2) with increased weights on misclassified points.

How the weight of misclassified points is increased exponentially in Ada Boost.

The training of the third model (H3) focusing on the points misclassified by H1 and H2.

The final model is a combination of H1, H2, and H3 with their respective weights (Alphas).

Comparison of Alphas in Ada Boost to Gammas in GBDT.

The core idea of Ada Boost is adapting to errors made in previous stages.

Ada Boost is used in SQL and Gradient Boost for classification and regression.

Personal observation that GBDT is more commonly used than Ada Boost in internet companies.

Recommendation to go through the lecture for further understanding and to leave comments for difficulties.

Transcripts

play00:00

my dear friends welcome to rajasekhar

play00:03

classes on applied data science with

play00:06

python this is lecture number 425 in

play00:10

this lecture we will try to understand

play00:13

geometric inclusion of ADA boost there

play00:18

is an other popular boosting algorithm

play00:21

called Ada boost it is also called as

play00:26

adaptive boosting it is very similar to

play00:30

gradient boosting but with a couple of

play00:33

key differences so what I will do is

play00:38

instead of going into too deep in Ada

play00:43

boost I will give you a geometric

play00:45

inclusion of what is happening in Ada

play00:49

boost adabus typically used in computer

play00:53

vision or image processing like

play00:57

applications of course it is also used

play01:00

in non-image processing applications but

play01:04

most successful applications of ADA

play01:07

boost are in image processing especially

play01:11

in the case of face detection where

play01:15

wherein it tries to detect where is the

play01:18

phase in an image but for general

play01:21

purpose stuff

play01:23

gbdt is very very good algorithm super

play01:28

good algorithm so for this what I have

play01:32

done I have taken a top example a toy

play01:36

example from University of Pennsylvania

play01:38

come to science department somebody has

play01:43

some somebody has done very very nice

play01:47

example and it would take me a lot of

play01:51

time to recreate this I am borrowing

play01:53

this example and content from them just

play01:57

just for this particular class thanks a

play01:59

lot to the people who created this

play02:02

phenomenal content by the way now let's

play02:05

see let's assume this is my data set my

play02:08

plus here is positive data point and

play02:13

minus is a negative data point let's

play02:16

assume this is my data set D1 at stage

play02:19

0. this is at stage 0. from this I train

play02:24

a model let's assume I tried a simple

play02:28

decision tree let's assume I tried a

play02:32

decision tree with a depth equal to one

play02:35

what is depth equal to one it means

play02:38

depth equal to one basically means is a

play02:42

decision stump and what is the decision

play02:45

stump it is basically parallel to Y axis

play02:49

r x axis I just basically get a

play02:54

hyperplane so here when I train it what

play02:58

I get my model H1 my first model is

play03:03

basically a line like this isn't it is

play03:07

it not H1 is a line which we If You

play03:11

observe carefully this is this is first

play03:13

round if you everything everything

play03:17

left side of hyperplane is is just see

play03:22

what is everything on left side of hyper

play03:24

plane that's blue everything right side

play03:27

of the hyper plane is let me say red

play03:29

these three points become error why

play03:32

because this plus must be in on blue

play03:36

side they are on red side now for this

play03:38

model I compute these three pluses are

play03:42

errors for model for this model I

play03:45

compute a weight Alpha One what is alpha

play03:48

1 0.42 I compute the wind based on how

play03:53

many errors here how many errors we have

play03:55

three pluses and I I compute the weight

play03:59

based on how many errors I got how well

play04:02

this model performs on the original data

play04:05

and things like that so I get a weight

play04:09

Alpha One equal to 0.42 I have my H1 of

play04:14

X which is nothing but hyperplane isn't

play04:18

it and Alpha One I have Alpha One even I

play04:22

have H1 of X in first round I can

play04:25

calculate Alpha 1 into H one of X I have

play04:30

these three points they are these three

play04:32

points are misclassified which which

play04:35

points are misclassified these three

play04:37

plus symbols are misclassified now what

play04:41

I will do here before I am going into

play04:44

second stage or second round I will

play04:47

increase the weight on these points I

play04:49

will increase the weight of this

play04:52

misclassified points that are plus isn't

play04:54

it you may you may remember you can do a

play04:58

weighted decision tree you can do

play05:01

weighted logistic regression or you can

play05:04

say weighted model in general in general

play05:07

I can say it is a weighted model isn't

play05:10

it so what do what what do weighted

play05:13

model means one way to implement

play05:16

weighted model is up sample the points

play05:20

because these points erroneous this plus

play05:23

these points are erroneous you see

play05:27

these points these points are erroneous

play05:31

erroneous at stage one what is a round

play05:34

first round isn't it because H1

play05:37

misclassified this this X One what did

play05:40

this H1 misclassified these three points

play05:43

is isn't it we want to give more weights

play05:46

to them what is in adap boost what you

play05:48

have to do you have to give more weight

play05:50

to them by up sampling them or by

play05:54

creating a weight

play05:56

for each of these points if you see the

play05:59

size of plus is increased just say the

play06:02

size of here size of the plus here size

play06:04

of plus size of the plus is increased

play06:07

the larger plus basically means these

play06:10

three points are misclassified we are

play06:13

making them big or we are up sampling

play06:16

them while keeping the rest of the

play06:19

points which are correctly classified as

play06:22

it is now D2 becomes the what is my ah

play06:26

D2 this is my D2

play06:29

yes this D2 my D2 becomes the data set

play06:33

for my H2 when I go to H2 this is my H1

play06:37

isn't it where is my H1 this is my H1

play06:40

this is my H2 this is my D2 D2 becomes

play06:44

the data set for H2 when I go to H2 when

play06:49

I go to H2 what happens I have I I have

play06:53

more just see in X these These Are

play06:56

points I have more weightage here

play06:59

because this looks big plus isn't it in

play07:03

this point I am doing what you are doing

play07:05

I am doing up sampling so if you see the

play07:09

size of plus is also increased let's

play07:12

assume I am increasing the weight by

play07:15

some value typically typically speaking

play07:18

you increase the weight

play07:21

exponentially you increase the weight

play07:23

exponentially in adaptive boosting there

play07:27

is an exact method in Ada boosting on

play07:31

how to increase the weight but since

play07:34

these are increased weights just see

play07:37

these are the increased weights are they

play07:40

not increased rates these are increased

play07:43

these are the increased rates it will

play07:45

prefer to classify them correctly isn't

play07:48

it H2 will try to classify these points

play07:52

these increased weights correctly so if

play07:54

my H2 will try to classify correctly if

play07:57

my H2 is like this my X2 is is like this

play08:01

it is it it is corrected all of my over

play08:04

sampling points just see it is corrected

play08:06

all of my was what is the data set for

play08:09

this one this D2 is data set for this

play08:12

one just remember this one is

play08:16

this one is

play08:18

this one is data set

play08:20

data set for H2 that is important isn't

play08:23

it now

play08:24

now it is correctly classy correct it is

play08:28

corrected all of my over sampling points

play08:30

these are the over sampling points it is

play08:33

correctly correctly classified which one

play08:35

is correctly class fit H2 is correctly

play08:38

plus but it has created these errors

play08:41

whichever Stacy these these three errors

play08:44

even for this model based on these

play08:48

errors I can calculate Alpha two let me

play08:51

say Alpha to equal 0.65 isn't it so I

play08:54

have H1 Alpha One

play08:57

H1 I have let me write

play09:00

H1 alpha 1 from round one isn't it now

play09:05

you have H2 Alpha 2 from second round

play09:09

isn't it yes I have X1 Alpha in first

play09:12

round H2 Alpha to Alpha two x two in

play09:14

second round because because these three

play09:18

these three errors because I have these

play09:21

three errors no no I will give more

play09:24

weightage to these points which points

play09:26

these erroneous points I have to give

play09:29

more more weight test now I will try my

play09:33

third model I have to train my third

play09:36

model just see this one yes

play09:40

this is

play09:41

what is this this is H1 this is H2 this

play09:46

is H3

play09:47

now this is ah what is this this is my

play09:51

H1

play09:53

this is my H2 and third model looks like

play09:56

this this is H3 this is H3 isn't it I am

play10:00

giving more weight to three negative

play10:03

points these parts these are we are

play10:06

giving more weight just see you are

play10:08

giving more weights to that three

play10:10

negative points it it will create hyper

play10:13

plane hyperplane hyper plane like this

play10:16

H3 this is the hyperplane now hyperplane

play10:19

like this parallel to x axis is it not

play10:21

this is parallel to parallel to x axis

play10:24

cyber plane parallel X action and it

play10:26

says and it says what it says the points

play10:30

above this particular hyper plane are

play10:33

positive Parts the points below this H3

play10:36

are negative points isn't it now I have

play10:40

only one point which is error this one

play10:43

which one is error yes yes I have only

play10:47

one point this is error I have only one

play10:51

point this error actually you are saying

play10:54

is here positive but you have negative

play10:55

Point here I have two errors you are

play10:58

saying negative but you have positive

play11:00

points now I have only one point which

play11:03

is error in the above hyperplate in the

play11:06

above hyperplane in the above hyperplane

play11:09

I have this error isn't it isn't it I

play11:12

have only one error Point here in the

play11:15

below hyper plane there are couple of

play11:17

points uh points which are error this is

play11:20

error this is also error here but these

play11:24

two positive points but these two

play11:26

positive points these two these two

play11:30

positive points would have been K would

play11:34

have been taken care by H one this one

play11:37

would have been taken care by H one

play11:40

isn't it and this particular negative

play11:43

point would have been taken care by H2

play11:46

isn't it by H2 So eventually

play11:50

what I will do here is my final model

play11:54

will look like this my final model look

play11:57

like alpha 1 H 1 plus Alpha 2 h 2 plus

play12:01

Alpha 3 H3 this becomes my final model

play12:05

my F3 affix which is basically

play12:09

0.42 times of H1 that is Alpha One H one

play12:13

plus this is Alpha One I can say this is

play12:18

Alpha One H one plus this is Alpha 2 h 2

play12:21

plus this is Alpha 3 this hyperplane H3

play12:25

isn't it you are alpha is here

play12:29

equivalent to

play12:30

Gammas in gbdt in gbdt I can compare

play12:35

this Alphas with Gammas so here what

play12:38

what you are doing effective what you

play12:41

are doing effectively the key difference

play12:43

here in the case of gbdt in the case of

play12:48

in the case of GB DT we used pseudo

play12:52

residuals to compute to pseudo residuous

play12:56

computed from the negative gradient of

play13:01

the loss function here in the first

play13:03

round here in the first row here in the

play13:06

first round all all the at first round

play13:10

are all the points which are

play13:12

misclassified are erroneous points are

play13:14

given more weightage using some

play13:17

something called adaptive boosting it is

play13:20

called adaptive boosting because at

play13:22

every stage you are adapting to the

play13:25

errors that you are adapting to the

play13:28

errors that were made in previous stage

play13:30

and we are giving more importance to to

play13:33

those points which are misclassified

play13:36

this is the core idea of ADA boost and

play13:39

as you have seen you can get Ada boost

play13:42

in SQL and gradient

play13:45

Ada boost in a scale and gradient boost

play13:48

classifier grade and gradient boost

play13:50

regressor so at our boost I have not

play13:52

seen the uh so adapt I have not seen the

play13:57

extensively in internet companies but of

play14:00

course I have seen Ada boost and

play14:02

variations of ADA boost being used in

play14:06

things like phase detection personally

play14:09

speaking from my own personal bio biases

play14:12

I have seen gbdt being applied much much

play14:16

more much more than Ada boost especially

play14:19

in internet companies may be wrong with

play14:22

other companies especially at internet

play14:25

companies general purpose machine

play14:27

learning gbdts are more often used than

play14:30

Ada booster there are so many articles

play14:32

this is this is an article related to

play14:35

same Ada boost I replace detection using

play14:37

boosting I request all of you to go

play14:39

through this lecture if you have any

play14:41

difficulty please keep a comment thank

play14:44

you very much

Rate This

5.0 / 5 (0 votes)

Etiquetas Relacionadas
AdaBoostBoosting AlgorithmsMachine LearningData ScienceFace DetectionComputer VisionGradient BoostingWeighted ModelsDecision TreesImage Processing
¿Necesitas un resumen en inglés?