Machine Learning Tutorial Python - 9 Decision Tree
Summary
TLDRThe video script explains the process of solving a classification problem using the decision tree algorithm. It starts with a simple dataset where predicting a person's salary over $100,000 is based on their company, job title, and degree. The decision tree method is illustrated by first splitting the dataset based on the company, then further refining the splits based on job title and degree. The importance of the order of feature selection is emphasized, with information gain and gini impurity as key metrics for choosing the best splits. The script then guides viewers through the practical steps of implementing a decision tree in Python using a Jupyter notebook, including data preparation, encoding categorical variables, and training the model. The tutorial concludes with an exercise where participants are encouraged to apply the decision tree algorithm to a real-world dataset, predicting the survival rate of passengers on the Titanic based on various attributes.
Takeaways
- 📈 **Decision Tree Algorithm**: Used for classification problems where logistic regression might not be suitable due to the complexity of the dataset.
- 🌳 **Building a Decision Tree**: In complex datasets, decision trees help in splitting the data iteratively to create decision boundaries.
- 💼 **Dataset Example**: The example dataset predicts if a person's salary is over $100,000 based on company, job title, and degree.
- 🔍 **Feature Selection**: The order in which features are split in a decision tree significantly impacts the algorithm's performance.
- 📊 **Information Gain**: Choosing the attribute that provides the highest information gain at each split is crucial for building an effective decision tree.
- 🔢 **Entropy and Impurity**: Entropy measures the randomness in a sample; low entropy indicates a 'pure' subset. Gini impurity is another measure used in decision trees.
- 🔧 **Data Preparation**: Convert categorical data into numerical form using label encoding before training a machine learning model.
- 📚 **Label Encoding**: Use `LabelEncoder` from `sklearn.preprocessing` to convert categorical variables into numbers.
- 🤖 **Model Training**: Train the decision tree classifier using the encoded input features and the target variable.
- 🧐 **Model Prediction**: Predict outcomes using the trained model by supplying new input data.
- 📝 **Exercise**: Practice by working on a provided dataset (e.g., Titanic dataset) to predict outcomes such as survival rates.
Q & A
What is the main purpose of using a decision tree algorithm in classification problems?
-The main purpose of using a decision tree algorithm in classification problems is to create a model that predicts the membership of a given data sample in a specific category. It is particularly useful when dealing with complex datasets that cannot be easily classified with a single decision boundary, as it can iteratively split the dataset to come up with decision boundaries.
How does a decision tree algorithm handle complex datasets?
-A decision tree algorithm handles complex datasets by splitting the dataset into subsets based on the feature values. It does this recursively until the tree is fully grown, creating a hierarchy of decisions that leads to the classification of the data.
What is the significance of the order in which features are selected for splitting in a decision tree?
-The order in which features are selected for splitting significantly impacts the performance of the decision tree algorithm. The goal is to select features that result in the highest information gain at each split, which helps in creating a more accurate and efficient decision tree.
What is entropy and how is it related to decision tree algorithms?
-Entropy is a measure of randomness or impurity in a dataset. In the context of decision tree algorithms, a lower entropy indicates a more 'pure' subset, where the samples are more uniformly distributed towards a particular class. The decision tree aims to maximize information gain, which is related to the reduction in entropy after a split.
What is the Gini impurity and how does it differ from entropy in decision trees?
-Gini impurity is another measure used to estimate the purity of a dataset in a decision tree. It is calculated based on the probability of incorrectly classifying a randomly chosen element from the dataset. Gini impurity and entropy both measure impurity, but they do so using different formulas and scales.
How does one prepare the data for training a decision tree classifier?
-To prepare the data for training a decision tree classifier, one must first separate the target variable from the independent variables. Then, categorical features are encoded into numerical values using techniques like label encoding. After encoding, any additional label columns used for encoding are dropped, leaving only numerical data for the model to process.
What is the role of the 'fit' method in training a decision tree classifier?
-The 'fit' method is used to train the decision tree classifier. It takes the independent variables and the target variable as inputs and learns the decision rules by finding the best splits based on the selected criterion, such as Gini impurity or entropy.
How can one evaluate the performance of a decision tree classifier?
-The performance of a decision tree classifier can be evaluated by using metrics such as accuracy, which is the proportion of correct predictions out of the total number of cases. In the script, the score of the model is mentioned as '1', indicating a perfect fit on the training data. However, for real-life complex datasets, the score would typically be less than one.
What is the importance of using a test set when training machine learning models?
-Using a test set is crucial for evaluating the generalization performance of a machine learning model. It helps to estimate how well the model will perform on unseen data. Ideally, the dataset should be split into a training set and a test set, with common ratios being 80:20 or 70:30.
How does one make predictions using a trained decision tree classifier?
-To make predictions using a trained decision tree classifier, one supplies the model with new input data that has been preprocessed and encoded in the same way as the training data. The model then uses the learned decision rules to classify the input data and provide a prediction.
What is the exercise provided at the end of the script for further practice?
-The exercise involves using a Titanic dataset to predict the survival rate of passengers based on features such as class, sex, age, and fare. The task is to ignore certain columns, use the remaining ones to predict survival, and post the model's score as a comment for verification.
What is the importance of practicing with exercises after learning a new machine learning concept?
-Practicing with exercises is essential for solidifying understanding and gaining hands-on experience with the concept. It allows learners to apply the theory to real or simulated datasets, troubleshoot issues, and improve their skills in using machine learning algorithms.
Outlines
🌟 Introduction to Decision Tree Algorithm
This paragraph introduces the concept of using a decision tree algorithm for classification problems. It explains that while logistic regression might be suitable for simpler datasets, decision trees are more appropriate for complex datasets that require multiple splits to define decision boundaries. The example given involves predicting whether a person's salary exceeds $100,000 based on their company, job title, and degree. The paragraph also touches on the idea of building a mental decision tree and emphasizes the importance of the order in which features are split, which impacts the algorithm's performance. It concludes by raising the question of how to select the ordering of features, which is crucial for the decision tree's effectiveness.
📊 Data Preprocessing for Decision Trees
This paragraph focuses on the initial steps of preparing data for a decision tree algorithm. It outlines the process of separating the dataset into target and independent variables, and the necessity of converting categorical data into numerical form using label encoding. The speaker demonstrates how to use the LabelEncoder from the sklearn.preprocessing module to encode categorical columns and then drop the original categorical columns, leaving only numerical data for the model. The paragraph also briefly mentions the importance of splitting the dataset into training and test sets, although for simplicity, this step is omitted in the example.
🔮 Training the Decision Tree Model and Predictions
The third paragraph describes the process of training a decision tree classifier using the prepared numerical data. It covers the importation of the decision tree module from the scikit-learn library and the creation of the decision tree model. The model is trained using the input features and the target variable. The speaker then discusses the use of the model to make predictions, highlighting that the accuracy of predictions on the training data is expected to be high but may be lower for unseen, complex datasets. An example prediction is made for a sales executive at Google with a master's degree, and the outcome is explained. The paragraph concludes with a disclaimer about the hypothetical nature of the dataset and an encouragement for viewers to practice on their own with a provided Titanic dataset, aiming to predict passenger survival rates.
Mindmap
Keywords
💡Decision Tree Algorithm
💡Logistic Regression
💡Information Gain
💡Entropy
💡Gini Impurity
💡Label Encoding
💡Feature Selection
💡Data Splitting
💡Model Training
💡Predictive Modeling
💡Titanic Dataset
Highlights
Decision tree algorithm is used to solve a classification problem where logistic regression might not suffice due to the complexity of the dataset.
Decision trees are effective in handling complex datasets by iteratively splitting the data to create decision boundaries.
Humans naturally build a decision tree in their minds when solving classification problems, which is mirrored in the algorithmic approach.
The dataset used in the example predicts if a person's salary is over $100,000 based on their company, job title, and degree.
The order in which attributes are split in a decision tree significantly impacts the performance of the algorithm.
High information gain is sought at every split of the decision tree to improve the algorithm's performance.
Entropy is a measure of randomness in a sample and is used to determine the purity of subsets in the decision tree.
Gini impurity is another term used to describe the impurity in a dataset, which is similar to entropy.
The Jupyter Notebook is used for demonstrating the implementation of a decision tree algorithm.
Data is divided into target variables and independent variables before being fed into the machine learning model.
Label encoding is used to convert categorical data into numerical form for machine learning algorithms.
The decision tree classifier is trained using the 'tree' module from the 'sklearn' library.
The 'fit' method is used to train the model, and the 'predict' method is used for making predictions.
The choice between Gini impurity and entropy as a splitting criterion can be based on mathematical understanding or the 'sklearn' library's defaults.
The dataset used for training the model is also used for prediction, resulting in a score of 1, which may not be the case with more complex datasets.
A disclaimer is provided that the dataset used in the example is fabricated, and real-world data might yield different results.
An exercise is provided using a Titanic dataset to predict survival rates based on class, sex, age, and fare paid.
The 'survived' column in the Titanic dataset is the target variable for the exercise.
The importance of practicing the concept through exercises is emphasized for better understanding and learning.
Transcripts
we are going to solve a classification
problem using decision tree algorithm
today
when you have a data set like this it's
easier to
draw a decision boundary using logistic
regression
but if your data set is little complex
like this
you cannot just draw a single line you
might have to split your data set
again and again to come up with the
decision boundaries
and this is what decision tree algorithm
does for you
we will use this particular data set
where you try to predict
if person's salary is more than 100 000
based on the company his job title and
the degree that he has now
when you look at the data set and when
you give it to any human being to solve
this problem
you will naturally try to build a
decision tree in your brain
so first you will split the data set
using the company
and here you can see what happened is
if your company's facebook no matter
what your degree or job title is your
answer is always yes you are always
getting hundred thousand
dollar per annum i mean they have a lot
of money right now and their stock is
going up revenue is going up so they
don't mind paying such a high salary
but in other two cases uh you have
mixed samples so you need to ask further
question for example for google i will
ask
what is the job position and based on
that
i have further conclusions such as if
it's a business manager
the answer is always yes sales executive
answer is no
computer programmer again i need to
split my decision tree
and you can do this iteratively to come
up with
a tree like this now this sounded very
simple but in real life you will not
have three attributes you will have
probably 50 attributes
and it matters in which order you split
the three right now we
chose company first then job title and
then the degree
in which order you select these
attributes is going to impact the
performance of your algorithm
so the question arises how do you
exactly select the ordering of these
features
so let's look at our example so here we
used
company first we might have used the
degree instead of company in which case
our data
set would be split like this now observe
carefully
on the left hand side what's happening
is we are getting
a little bit of a pure subset what i
mean by pure subset
is in the case of facebook all the
samples are green
okay so this has a very low entropy
now if you remember the definition of
entropy from your school days
it is basically the measure of
randomness in your sample
here there is no randomness everything
is green
so six green samples zero red hence low
entropy
here there is some entropy but still
majority of the samples are red okay
whereas on the right hand side
for this case four red for green means
there is total
randomness it is 50 50 hence my entropy
is one
okay here it's little better and rupees
little low
so overall i'm thinking if i
use company as shown on the left hand
side
i will have a high information gain
okay whereas on the right hand side i
have low information gain
hence you should use an approach which
gives you
high information at every split hence we
chose
company as the first attribute and in
the further split also
you can use high information grain
criteria
to divide it further there is another
term that
you hear often when you are dealing with
decision tree which is
genie impurity now this is nothing but
an
impurity in your data set for example
when i split
my sample like this at the bottom
most of the samples are red whereas one
is green
so this is almost pure but there is
little bit of
impurity all right it is sort of similar
to entropy
i'm not going to go into mathematics too
much you can read articles on it
uh will straight away jump into writing
code
i launched my jupyter notebook and
loaded the same data set into my data
frame you can see
i have the same csv file that i am
loading into my data frame
now the first step is once i have my
data from ready
i want to divide it it in between
the target variable and the independent
variable
so i will call the target varia
independent variable
data frame inputs
and i will just
drop see this is my target column okay
target variable
so i'm just going to drop that
and i will say access is equal to
[Music]
columns
okay so once i execute this
what's happening is my
input looks like this so it doesn't have
the last column which is my answer and
my target
looks like this which is my last column
now by this point you all know that
machine learning algorithms can only
work on numbers they cannot understand
labels so what we have to do is we have
to
convert these particular columns these
three columns
into numbers and one of the ways is to
use the label encoder
so from sklearn sklearn.preprocessing
i will
all right i hit tab so it was
autocompleting but it was slow
but see if you hit tab again it's not
working
this is sometimes the night work it's
funny
all right once i uh import label encoder
i am going to create the object of this
class and
i have three columns so i have to create
like three different objects okay
so first is ali company
the second one is ali job
and then degree
once you have these three what you do is
in your inputs data frame
you are creating one more column and
this is how you create extra column in a
data frame
you call fit and transform method on
your company column
and you can do the same thing for
your job and degree column also
so here you have job
your degree
once you do that and when then when you
print head
this is how your data frame going is
going to look like it has three extra
column
and we have label encoded your label
columns into numbers
next step is to drop
those label columns so i'm going to
create a new data frame here
and just say drop and you can draw
multiple columns at the same time
axis is equal to columns
and when you look at your
inputs and data frame what it did is
dropped all the label columns now all
you have is numbers
so google it encoded as number two
um the second one was
abc pharma which was encoded as zero
and facebook is encoded as one and same
thing for like
job title and degree it just assigns
different numbers to different labels
now we are ready to train our classifier
so as usual i am going to
import some module now for decision tree
you import the tree from your ascalon
library
and then your model is nothing but tree
dot
d season tree this isn't reclassifier
and then you can now train your model
so you can call fit here
and i'm going to call inputs n and my
target variable
so it train my model now i'm not using
test train split here
just to keep things simple but ideally
you should
split your data set into training set
and test set
80 20 70 30 whatever ratio you prefer
all right but i'm just keeping it very
simple here
it use criteria as gini impurity by
default you can change it to entropy
also again
for math i'm not going to go into very
much detail
you can google it to know the difference
between genie and entropy
uh these details are abstracted by
sklearn library
so you're fine although knowing math
always
helps in terms of what kind of criteria
you should choose for a given problem so
i
i still suggest going through that all
right now my model is ready
to predict so the first thing i'm going
to do
is predict my score all right and the
way you predict your score
is you supply your input and target data
set
now pause this video for a moment and
tell me what is going to be your score
the score is going to be 1
because i'm using the same data set
which i use for training and my data
step was also very simple
so i was expecting that it will do
very okay it will be very accurate with
my prediction
uh hence the score is one in real life
when you have complex data set
your score will be less than one okay so
now let's do some prediction
so i'm going to do predict
all right what are we going to predict
so
let me predict a salary of person
working in google
sales executive is his job and master's
degree
okay so let's see
so that's number two row okay so two two
one
two two one
[Music]
all right is expecting 2d array usually
you supply data frame so i'm just going
to
do this and it says zero means
the person who is working in google
sales executive is his job master degree
his seller is not going to be more than
100 000 or and by the way just a
disclaimer
i just made made up this data set
in reality google says executive might
be getting much more than 100 000
but i just made it up all right so
that's a little disclaimer
how about business manager
so business managers number
label encoded number is
zero
so his salary is one all right so we are
doing
perfectly all right here you can
have this model and you can do further
prediction
using the train model and by calling a
predict method on that
already now the most important part
which is the exercise so i expect all of
you
to work on the exercise once you learn
this concept because just by watching
the tutorial you are not going to learn
anything
so you must do an exercise on your own
i have a titanic data set so this is
showing the survival rate of passenger
uh in a titanic crash this is the real
data set
and you can get this csv file uh by
clicking on a link in the video
description below
so that link contains the jupyter
notebook that was used in uh
this tutorial and it has exercise
subfolder
within that you have titanic.csv
here you should ignore all the red
columns
and use the remaining columns to predict
the survival rate so here
the survived column is your target
variable
and you have to predict the survival of
a passenger based on the class the sex
age and the fare that he paid before
onboarding titanic ship okay so that's
what you have to do
uh come up with uh the score of your
model and post your score
as a comment in below
and i will verify your answer and and
we'll see
how well you can do with it all right
that's all i had for this tutorial thank
you very much for watching bye
Browse More Related Video
Machine Learning Tutorial Python - 8: Logistic Regression (Binary Classification)
Machine Learning Tutorial Python - 8 Logistic Regression (Multiclass Classification)
Neighbour Joining
Konsep memahami Algoritma C4.5
Machine Learning Tutorial Python - 3: Linear Regression Multiple Variables
What is a Machine Learning Engineer
5.0 / 5 (0 votes)