CatBoost Part 1: Ordered Target Encoding
Summary
TLDRIn this StatQuest episode, Josh Starmer discusses ordered target encoding in CatBoost, a machine learning algorithm similar to gradient boosting. He explains how CatBoost avoids data leakage by treating each data row sequentially and uses a defined prior in its encoding equation. The video highlights the importance of results in machine learning, regardless of the method used.
Takeaways
- 📚 The video is about CatBoost, a machine learning algorithm similar to Gradient Boost and XGBoost, and its approach to ordered target encoding.
- 🔑 CatBoost stands for 'Categorical Boosting' due to its unique method of dealing with categorical variables.
- 🚫 Basic target encoding can cause data leakage, which results in models that perform well on training data but poorly on testing data.
- 🔄 The script discusses k-fold target encoding as a method to reduce leakage by splitting data into groups.
- 🎯 CatBoost avoids leakage by treating each row of data sequentially and using a defined prior instead of an overall mean.
- 📉 The CatBoost encoding equation simplifies the denominator by adding 1 to the number of rows, rather than using a weight.
- 🔢 CatBoost uses a prior set to 0.05 for encoding categorical features when no previous data is available.
- 🔄 Ordered target encoding in CatBoost is influenced by the order of the data, making each occurrence of a category unique.
- 🔢 The encoding process involves calculating option counts and using previous occurrences to determine the value for each category.
- 📈 After creating a CatBoost model, the entire dataset is used to target encode new data for classification.
- 📘 The video emphasizes that machine learning is about achieving results, regardless of the method used, as long as it works effectively.
Q & A
What is the main topic of the video script?
-The main topic of the video script is 'Cat Boost', specifically discussing ordered target encoding in the context of machine learning.
What is the issue with basic target encoding?
-The issue with basic target encoding is that it can lead to data leakage, where the target value of each row is used to modify the same row's value in the categorical feature, resulting in models that perform well on training data but poorly on testing data.
What is the purpose of k-fold target encoding?
-The purpose of k-fold target encoding is to reduce leakage by splitting the data into K groups, ensuring that the target value is not used to modify the same row's value directly.
Why is it suggested to convert features with only one or two options to zeros and ones instead of using target encoding?
-It is suggested because features with only one or two options are essentially binary, and converting them to zeros and ones simplifies the model without the risk of leakage associated with target encoding.
What is the significance of CatBoost in the context of this script?
-CatBoost is significant because it is a machine learning algorithm that is fundamentally similar to gradient boost and XGBoost, and it has a unique method for dealing with categorical variables, which is the focus of the script.
How does CatBoost avoid leakage when encoding categorical variables?
-CatBoost avoids leakage by treating each row of data as if it were being fed into the algorithm sequentially, ignoring all other rows when encoding the first occurrence of a category, and using a defined prior or guess instead of an overall mean.
What is the defined prior or guess used by CatBoost in its encoding equation?
-The defined prior or guess used by CatBoost in its encoding equation is typically set to 0.05, as mentioned in the script.
How does the order of data affect the encoding process in CatBoost?
-The order of data affects the encoding process in CatBoost because it uses the information from previous rows to calculate the target encoding for the current row, making the encoding dependent on the sequence of data.
What is the term used to describe the target encoding method in CatBoost where the order of data matters?
-The term used to describe this method is 'ordered target encoding'.
What is the final step after creating a CatBoost model according to the script?
-The final step after creating a CatBoost model is to use the entire dataset to target encode the new data that you want to classify.
What lesson about machine learning does the script emphasize?
-The script emphasizes that machine learning is all about results and doing whatever it takes to achieve them, regardless of the method used.
Outlines
🌟 Introduction to CatBoost and Ordered Target Encoding
In this introductory segment, Josh from StatQuest explains the concept of CatBoost, a machine learning algorithm akin to Gradient Boost and XGBoost, and introduces the topic of ordered target encoding. He emphasizes the importance of avoiding data leakage in target encoding, which can lead to models that perform well on training data but poorly on testing data. Josh also briefly touches on the idea of k-fold target encoding as a method to reduce leakage. The segment sets the stage for a deeper dive into CatBoost's unique approach to encoding categorical features, which is central to its performance and the reason behind its name 'Categorical Boosting'.
📊 CatBoost's Ordered Target Encoding Technique
This paragraph delves into the specifics of how CatBoost performs ordered target encoding to avoid data leakage. It describes the process of treating each row of data sequentially and using a predefined prior or guess, set to 0.05, in the encoding equation. The method simplifies the denominator by adding 1 to the number of rows, rather than using a weight. CatBoost's encoding takes into account the order of data and uses values from all previous data to calculate the encoding for the current row. The summary illustrates this with examples, showing how the encoding changes as more data is processed. The paragraph concludes by noting that after creating a CatBoost model, the entire dataset is used to encode new data for classification. Josh also reflects on the practicality of the method, stating that while the motivation might be questionable, CatBoost's effectiveness in machine learning is undeniable.
Mindmap
Keywords
💡CatBoost
💡Target Encoding
💡Data Leakage
💡K-Fold Target Encoding
💡Categorical Variables
💡Ordered Target Encoding
💡Gradient Boost
💡XGBoost
💡One Hot Label Encoding
💡Machine Learning
💡Statistical Methods
Highlights
Introduction to CatBoost and ordered Target encoding.
CatBoost's similarity to Gradient Boost and XGBoost algorithms.
The problem of data leakage in basic Target encoding and its implications.
The concept of k-fold Target encoding to reduce leakage.
CatBoost's unique approach to encoding categorical features to avoid leakage.
The treatment of each data row sequentially in CatBoost's encoding process.
Use of a defined prior or guess in CatBoost's Target encoding equation.
Simplification of the Target encoding denominator in CatBoost.
How CatBoost handles the first occurrence of categorical values in encoding.
The iterative process of updating categorical values based on previous data in CatBoost.
The importance of the data order in CatBoost's ordered Target encoding.
Final Target encoding of the entire dataset for classification in CatBoost.
The effectiveness and practicality of CatBoost in machine learning applications.
The lesson that machine learning is about achieving results through various methods.
StatQuest's resources for offline review of statistics and machine learning.
The StatQuest PDF study guides and Josh's book for deeper understanding.
Encouragement to subscribe for more StatQuest content.
Ways to support StatQuest through Patreon, channel membership, or merchandise.
Transcripts
boarded Target encoding gonna do it for
cat the Boost bounce
Quest
hello I'm Josh charmer and welcome to
stat Quest today we're going to talk
about cat boost part one ordered Target
encoding if you got a big huge cat boost
model and you run it in the cloud you
better use lightning
bam this stack Quest is also brought to
you by the letters a b and c a always b
b c curious always be curious note this
stack Quest assumes you are already
familiar with Target encoding if not
check out the quest
also note catboost is a machine learning
algorithm that at a fundamental level is
very similar to gradient boost and XG
boost so if you're not already familiar
with those methods you might want to
check out the quests
in the stat Quest on one hot label and
Target encoding we had this data and we
wanted to use favorite color and height
to predict if someone loves Troll 2
which is a really terrible movie
then we applied one hot label and Target
encoding to favored color and talked
about the pros and cons of using each
method
the problem with basic Target encoding
is that each Row's Target value the
thing we want to predict is used to
modify the same rows value in favored
color
and doing this sort of thing is a data
science No-No that we call leakage
leakage results in models that work
great with training data but not so well
with testing data
so we ultimately described k-fold
targeting coding which splits the data
into K groups to reduce leakage
now if you read the original catboost
manuscript they point out that if we
only have a single category for example
if everyone loved the color blue and we
apply leave one out targeting coding to
the data
then all of the rows with favorite color
equal to 0.33 correspond to people who
love troll 2.
and all the rows with favorite color
equal to 0.5 correspond to the people
who do not love troll 2.
and that means any tree that starts out
by splitting on favorite color less than
0.42
we'll classify each person in the
training data set perfectly
and that means we have leakage
now to be honest I think it's a little
silly to include a variable that only
has a single category in the first place
and even if we did it is standard
practice to convert features with only
one or two options like loves Troll 2 to
zeros and ones rather than use targeting
coding
so in this example we would just convert
blue to zero rather than use Target
encoding so this example of leakage
seems a little silly to me because it
should not happen
however the creators of catboost didn't
think it was silly so they came up with
a whole new way to encode categorical
features bam
in fact catboost is short for
categorical boosting because of how
Central dealing with categorical
variables is to this method
so let's go back to the original data
set and talk about how catboost encodes
categorical features
the way catboost avoids leakage when
encoding categorical variables starts
with treating each row of data as if it
were being fed into the algorithm
sequentially
for example catboost treats the first
row with blue as if it is all the data
it has received so far
and that means that catboost ignores all
the other rows when targeting coding the
first occurrence of blue
another thing different about catboost
is the equation
the big difference is that instead of
using an overall mean it uses a defined
prior or guess that in the examples I
saw was set to 0.05
the catboost equation also simplifies
the denominator by just adding 1 to the
number of rows rather than a weight
now given this equation catboost plugs
in values derived from all of the other
data that came before the current row
and that means that since we are
starting with the first row and no data
came before it
then the option count the number of
people we have seen before who at this
point love blue and Troll 2 is zero
and we plug in 0 for n because there are
no previous rows that have blue as the
favorite color
and when we do the math we get 0.05
so we plug in 0.05 for blue in the first
row
now we work on the second row because
none of the preceding rows also have red
as the favorite color we set option
count to zero again
and again n equals zero
so we replace red in the second row with
0.05
likewise we replace Green in the third
row with 0.05
however in the fourth row things finally
change
now because we've seen blue before in
the first row we use it to calculate the
option count
so in this case because the one person
who liked blue before also like Troll 2
the option count is equal to one and n
equals one
so we plug in
0.525 for the second time we see blue
likewise the second time we see green we
plug in 0.525
however the third time we see green
we use the two previous times when
calculating the option count
and the option count is 2 because both
of the two previous people that liked
Green also liked troll 2.
and because we've already seen two
people who like green n equals two
and that means we replace the third
occurrence of green with 0.683
lastly we replace the third occurrence
of blue with 0.35
because only one of the two previous
times we saw blue also liked troll 2.
and thus this is how catboost performs
Target encoding
and because the order of the data makes
a difference in the encoding this method
is called ordered Target encoding
bam
lastly once you're done creating your
cat boost model the entire data set is
used to Target and code the new data
that you want to classify
note as I said earlier I'm not certain
the motivation for this method is really
justifiable
however the important thing is that
regardless of the justification catboost
works and it works well
and that's an important lesson about
machine learning
machine learning is all about results
and doing whatever it takes to get them
bam
now it's time for some
Shameless self-promotion if you want to
review statistics and machine learning
offline check out the statquest PDF
study guides and my book The statquest
Illustrated guide to machine learning at
stackquest.org there's something for
everyone
hooray we've made it to the end of
another exciting stat Quest if you like
this stat Quest and want to see more
please subscribe and if you want to
support statquest consider contributing
to my patreon campaign becoming a
channel member buying one or two of my
original songs or a t-shirt or a hoodie
or just donate the links are in the
description below alright until next
time Quest on
5.0 / 5 (0 votes)