CatBoost Part 1: Ordered Target Encoding

StatQuest with Josh Starmer
26 Feb 202308:32

Summary

TLDRIn this StatQuest episode, Josh Starmer discusses ordered target encoding in CatBoost, a machine learning algorithm similar to gradient boosting. He explains how CatBoost avoids data leakage by treating each data row sequentially and uses a defined prior in its encoding equation. The video highlights the importance of results in machine learning, regardless of the method used.

Takeaways

  • 📚 The video is about CatBoost, a machine learning algorithm similar to Gradient Boost and XGBoost, and its approach to ordered target encoding.
  • 🔑 CatBoost stands for 'Categorical Boosting' due to its unique method of dealing with categorical variables.
  • 🚫 Basic target encoding can cause data leakage, which results in models that perform well on training data but poorly on testing data.
  • 🔄 The script discusses k-fold target encoding as a method to reduce leakage by splitting data into groups.
  • 🎯 CatBoost avoids leakage by treating each row of data sequentially and using a defined prior instead of an overall mean.
  • 📉 The CatBoost encoding equation simplifies the denominator by adding 1 to the number of rows, rather than using a weight.
  • 🔢 CatBoost uses a prior set to 0.05 for encoding categorical features when no previous data is available.
  • 🔄 Ordered target encoding in CatBoost is influenced by the order of the data, making each occurrence of a category unique.
  • 🔢 The encoding process involves calculating option counts and using previous occurrences to determine the value for each category.
  • 📈 After creating a CatBoost model, the entire dataset is used to target encode new data for classification.
  • 📘 The video emphasizes that machine learning is about achieving results, regardless of the method used, as long as it works effectively.

Q & A

  • What is the main topic of the video script?

    -The main topic of the video script is 'Cat Boost', specifically discussing ordered target encoding in the context of machine learning.

  • What is the issue with basic target encoding?

    -The issue with basic target encoding is that it can lead to data leakage, where the target value of each row is used to modify the same row's value in the categorical feature, resulting in models that perform well on training data but poorly on testing data.

  • What is the purpose of k-fold target encoding?

    -The purpose of k-fold target encoding is to reduce leakage by splitting the data into K groups, ensuring that the target value is not used to modify the same row's value directly.

  • Why is it suggested to convert features with only one or two options to zeros and ones instead of using target encoding?

    -It is suggested because features with only one or two options are essentially binary, and converting them to zeros and ones simplifies the model without the risk of leakage associated with target encoding.

  • What is the significance of CatBoost in the context of this script?

    -CatBoost is significant because it is a machine learning algorithm that is fundamentally similar to gradient boost and XGBoost, and it has a unique method for dealing with categorical variables, which is the focus of the script.

  • How does CatBoost avoid leakage when encoding categorical variables?

    -CatBoost avoids leakage by treating each row of data as if it were being fed into the algorithm sequentially, ignoring all other rows when encoding the first occurrence of a category, and using a defined prior or guess instead of an overall mean.

  • What is the defined prior or guess used by CatBoost in its encoding equation?

    -The defined prior or guess used by CatBoost in its encoding equation is typically set to 0.05, as mentioned in the script.

  • How does the order of data affect the encoding process in CatBoost?

    -The order of data affects the encoding process in CatBoost because it uses the information from previous rows to calculate the target encoding for the current row, making the encoding dependent on the sequence of data.

  • What is the term used to describe the target encoding method in CatBoost where the order of data matters?

    -The term used to describe this method is 'ordered target encoding'.

  • What is the final step after creating a CatBoost model according to the script?

    -The final step after creating a CatBoost model is to use the entire dataset to target encode the new data that you want to classify.

  • What lesson about machine learning does the script emphasize?

    -The script emphasizes that machine learning is all about results and doing whatever it takes to achieve them, regardless of the method used.

Outlines

00:00

🌟 Introduction to CatBoost and Ordered Target Encoding

In this introductory segment, Josh from StatQuest explains the concept of CatBoost, a machine learning algorithm akin to Gradient Boost and XGBoost, and introduces the topic of ordered target encoding. He emphasizes the importance of avoiding data leakage in target encoding, which can lead to models that perform well on training data but poorly on testing data. Josh also briefly touches on the idea of k-fold target encoding as a method to reduce leakage. The segment sets the stage for a deeper dive into CatBoost's unique approach to encoding categorical features, which is central to its performance and the reason behind its name 'Categorical Boosting'.

05:01

📊 CatBoost's Ordered Target Encoding Technique

This paragraph delves into the specifics of how CatBoost performs ordered target encoding to avoid data leakage. It describes the process of treating each row of data sequentially and using a predefined prior or guess, set to 0.05, in the encoding equation. The method simplifies the denominator by adding 1 to the number of rows, rather than using a weight. CatBoost's encoding takes into account the order of data and uses values from all previous data to calculate the encoding for the current row. The summary illustrates this with examples, showing how the encoding changes as more data is processed. The paragraph concludes by noting that after creating a CatBoost model, the entire dataset is used to encode new data for classification. Josh also reflects on the practicality of the method, stating that while the motivation might be questionable, CatBoost's effectiveness in machine learning is undeniable.

Mindmap

Keywords

💡CatBoost

CatBoost is a machine learning algorithm that is fundamentally similar to Gradient Boost and XGBoost. It is specifically designed to handle categorical variables efficiently. In the video, CatBoost is highlighted for its unique approach to encoding categorical features, which is central to its effectiveness in machine learning tasks. The term 'CatBoost' is derived from 'categorical boosting,' emphasizing its focus on categorical data.

💡Target Encoding

Target encoding is a technique used in machine learning to convert categorical variables into numerical values that can be used in predictive models. The script discusses the traditional method of target encoding and its potential issue of data leakage, where information from the target variable is inadvertently used in the training process, leading to overfitting. CatBoost introduces a modified approach to target encoding to mitigate this issue.

💡Data Leakage

Data leakage occurs when information from the target variable is used in the training process, which can lead to models that perform well on training data but poorly on unseen data. In the context of the video, data leakage is a problem with basic target encoding methods, as they can inadvertently use target information to modify the same row's feature values, leading to overfitting.

💡K-Fold Target Encoding

K-fold target encoding is a method to reduce data leakage by splitting the data into K groups and performing target encoding within each fold. This technique is mentioned in the script as a way to address the issue of data leakage in traditional target encoding methods, ensuring that the model does not have access to the target information during training.

💡Categorical Variables

Categorical variables are data types that consist of categories or groups rather than numerical values. In the video, categorical variables like 'favorite color' are used as examples to illustrate how CatBoost handles them. CatBoost's strength lies in its ability to effectively encode these variables, which is crucial for building accurate predictive models.

💡Ordered Target Encoding

Ordered target encoding is a method introduced by CatBoost that takes into account the order of data when encoding categorical variables. This approach avoids data leakage by treating each row of data as if it were being fed into the algorithm sequentially. The script explains how this method calculates encoding values based on the data that came before the current row, ensuring that the model does not use future information.

💡Gradient Boost

Gradient Boost is a type of ensemble machine learning technique that combines multiple weak predictive models to form a strong predictive model. The script mentions Gradient Boost as a comparison point for CatBoost, highlighting that they are similar at a fundamental level but differ in how they handle categorical variables.

💡XGBoost

XGBoost, short for eXtreme Gradient Boosting, is a popular machine learning algorithm that is efficient in handling large datasets and provides high predictive performance. In the video, XGBoost is mentioned alongside Gradient Boost and CatBoost, indicating that they are all related in terms of being boosting algorithms but differ in their approach to handling categorical data.

💡One Hot Label Encoding

One hot label encoding is a technique used to convert categorical variables into a format that can be provided to machine learning algorithms to do a better job in prediction tasks. The script discusses the use of one hot label encoding in the context of predicting a target variable, such as whether someone loves the movie Troll 2.

💡Machine Learning

Machine learning is a field of artificial intelligence that uses statistical methods to enable computer systems to learn from data and make predictions or decisions. The video emphasizes that machine learning is all about achieving results, and techniques like CatBoost's ordered target encoding are used to improve the performance of predictive models.

💡Statistical Methods

Statistical methods are mathematical techniques used to analyze and interpret data. In the context of the video, statistical methods such as target encoding and ordered target encoding are used to transform data into a form that can be effectively used by machine learning algorithms. The script highlights the importance of these methods in building accurate predictive models.

Highlights

Introduction to CatBoost and ordered Target encoding.

CatBoost's similarity to Gradient Boost and XGBoost algorithms.

The problem of data leakage in basic Target encoding and its implications.

The concept of k-fold Target encoding to reduce leakage.

CatBoost's unique approach to encoding categorical features to avoid leakage.

The treatment of each data row sequentially in CatBoost's encoding process.

Use of a defined prior or guess in CatBoost's Target encoding equation.

Simplification of the Target encoding denominator in CatBoost.

How CatBoost handles the first occurrence of categorical values in encoding.

The iterative process of updating categorical values based on previous data in CatBoost.

The importance of the data order in CatBoost's ordered Target encoding.

Final Target encoding of the entire dataset for classification in CatBoost.

The effectiveness and practicality of CatBoost in machine learning applications.

The lesson that machine learning is about achieving results through various methods.

StatQuest's resources for offline review of statistics and machine learning.

The StatQuest PDF study guides and Josh's book for deeper understanding.

Encouragement to subscribe for more StatQuest content.

Ways to support StatQuest through Patreon, channel membership, or merchandise.

Transcripts

play00:00

boarded Target encoding gonna do it for

play00:05

cat the Boost bounce

play00:08

Quest

play00:10

hello I'm Josh charmer and welcome to

play00:13

stat Quest today we're going to talk

play00:15

about cat boost part one ordered Target

play00:18

encoding if you got a big huge cat boost

play00:23

model and you run it in the cloud you

play00:26

better use lightning

play00:28

bam this stack Quest is also brought to

play00:32

you by the letters a b and c a always b

play00:36

b c curious always be curious note this

play00:42

stack Quest assumes you are already

play00:43

familiar with Target encoding if not

play00:46

check out the quest

play00:47

also note catboost is a machine learning

play00:51

algorithm that at a fundamental level is

play00:54

very similar to gradient boost and XG

play00:56

boost so if you're not already familiar

play00:58

with those methods you might want to

play01:00

check out the quests

play01:02

in the stat Quest on one hot label and

play01:05

Target encoding we had this data and we

play01:08

wanted to use favorite color and height

play01:10

to predict if someone loves Troll 2

play01:13

which is a really terrible movie

play01:15

then we applied one hot label and Target

play01:18

encoding to favored color and talked

play01:21

about the pros and cons of using each

play01:23

method

play01:25

the problem with basic Target encoding

play01:28

is that each Row's Target value the

play01:30

thing we want to predict is used to

play01:32

modify the same rows value in favored

play01:35

color

play01:36

and doing this sort of thing is a data

play01:38

science No-No that we call leakage

play01:41

leakage results in models that work

play01:43

great with training data but not so well

play01:46

with testing data

play01:48

so we ultimately described k-fold

play01:51

targeting coding which splits the data

play01:53

into K groups to reduce leakage

play01:56

now if you read the original catboost

play01:59

manuscript they point out that if we

play02:02

only have a single category for example

play02:04

if everyone loved the color blue and we

play02:07

apply leave one out targeting coding to

play02:10

the data

play02:11

then all of the rows with favorite color

play02:14

equal to 0.33 correspond to people who

play02:18

love troll 2.

play02:19

and all the rows with favorite color

play02:21

equal to 0.5 correspond to the people

play02:24

who do not love troll 2.

play02:27

and that means any tree that starts out

play02:30

by splitting on favorite color less than

play02:32

0.42

play02:34

we'll classify each person in the

play02:36

training data set perfectly

play02:38

and that means we have leakage

play02:43

now to be honest I think it's a little

play02:46

silly to include a variable that only

play02:48

has a single category in the first place

play02:52

and even if we did it is standard

play02:55

practice to convert features with only

play02:57

one or two options like loves Troll 2 to

play03:00

zeros and ones rather than use targeting

play03:03

coding

play03:04

so in this example we would just convert

play03:08

blue to zero rather than use Target

play03:10

encoding so this example of leakage

play03:13

seems a little silly to me because it

play03:15

should not happen

play03:17

however the creators of catboost didn't

play03:20

think it was silly so they came up with

play03:22

a whole new way to encode categorical

play03:24

features bam

play03:27

in fact catboost is short for

play03:29

categorical boosting because of how

play03:32

Central dealing with categorical

play03:33

variables is to this method

play03:36

so let's go back to the original data

play03:38

set and talk about how catboost encodes

play03:41

categorical features

play03:43

the way catboost avoids leakage when

play03:45

encoding categorical variables starts

play03:48

with treating each row of data as if it

play03:50

were being fed into the algorithm

play03:52

sequentially

play03:54

for example catboost treats the first

play03:57

row with blue as if it is all the data

play04:00

it has received so far

play04:02

and that means that catboost ignores all

play04:05

the other rows when targeting coding the

play04:07

first occurrence of blue

play04:09

another thing different about catboost

play04:11

is the equation

play04:13

the big difference is that instead of

play04:15

using an overall mean it uses a defined

play04:18

prior or guess that in the examples I

play04:21

saw was set to 0.05

play04:25

the catboost equation also simplifies

play04:27

the denominator by just adding 1 to the

play04:30

number of rows rather than a weight

play04:33

now given this equation catboost plugs

play04:37

in values derived from all of the other

play04:39

data that came before the current row

play04:42

and that means that since we are

play04:44

starting with the first row and no data

play04:47

came before it

play04:48

then the option count the number of

play04:51

people we have seen before who at this

play04:53

point love blue and Troll 2 is zero

play04:57

and we plug in 0 for n because there are

play05:00

no previous rows that have blue as the

play05:03

favorite color

play05:05

and when we do the math we get 0.05

play05:09

so we plug in 0.05 for blue in the first

play05:12

row

play05:14

now we work on the second row because

play05:17

none of the preceding rows also have red

play05:20

as the favorite color we set option

play05:22

count to zero again

play05:24

and again n equals zero

play05:28

so we replace red in the second row with

play05:31

0.05

play05:33

likewise we replace Green in the third

play05:36

row with 0.05

play05:40

however in the fourth row things finally

play05:43

change

play05:44

now because we've seen blue before in

play05:47

the first row we use it to calculate the

play05:50

option count

play05:51

so in this case because the one person

play05:55

who liked blue before also like Troll 2

play05:58

the option count is equal to one and n

play06:02

equals one

play06:03

so we plug in

play06:05

0.525 for the second time we see blue

play06:09

likewise the second time we see green we

play06:13

plug in 0.525

play06:15

however the third time we see green

play06:19

we use the two previous times when

play06:21

calculating the option count

play06:23

and the option count is 2 because both

play06:26

of the two previous people that liked

play06:28

Green also liked troll 2.

play06:32

and because we've already seen two

play06:34

people who like green n equals two

play06:38

and that means we replace the third

play06:40

occurrence of green with 0.683

play06:45

lastly we replace the third occurrence

play06:48

of blue with 0.35

play06:51

because only one of the two previous

play06:54

times we saw blue also liked troll 2.

play06:58

and thus this is how catboost performs

play07:01

Target encoding

play07:03

and because the order of the data makes

play07:06

a difference in the encoding this method

play07:08

is called ordered Target encoding

play07:11

bam

play07:12

lastly once you're done creating your

play07:15

cat boost model the entire data set is

play07:17

used to Target and code the new data

play07:19

that you want to classify

play07:21

note as I said earlier I'm not certain

play07:25

the motivation for this method is really

play07:27

justifiable

play07:28

however the important thing is that

play07:31

regardless of the justification catboost

play07:34

works and it works well

play07:36

and that's an important lesson about

play07:38

machine learning

play07:39

machine learning is all about results

play07:42

and doing whatever it takes to get them

play07:45

bam

play07:47

now it's time for some

play07:49

Shameless self-promotion if you want to

play07:52

review statistics and machine learning

play07:54

offline check out the statquest PDF

play07:57

study guides and my book The statquest

play07:59

Illustrated guide to machine learning at

play08:02

stackquest.org there's something for

play08:04

everyone

play08:06

hooray we've made it to the end of

play08:08

another exciting stat Quest if you like

play08:11

this stat Quest and want to see more

play08:12

please subscribe and if you want to

play08:15

support statquest consider contributing

play08:17

to my patreon campaign becoming a

play08:19

channel member buying one or two of my

play08:21

original songs or a t-shirt or a hoodie

play08:24

or just donate the links are in the

play08:26

description below alright until next

play08:29

time Quest on

Rate This

5.0 / 5 (0 votes)

相关标签
CatBoostTarget EncodingMachine LearningData ScienceModel TrainingCategorical FeaturesLeakage AvoidanceSequential DataAlgorithm OptimizationStatistical Analysis
您是否需要英文摘要?