Machine Learning Tutorial Python - 9 Decision Tree

codebasics
16 Nov 201814:45

Summary

TLDRThe video script explains the process of solving a classification problem using the decision tree algorithm. It starts with a simple dataset where predicting a person's salary over $100,000 is based on their company, job title, and degree. The decision tree method is illustrated by first splitting the dataset based on the company, then further refining the splits based on job title and degree. The importance of the order of feature selection is emphasized, with information gain and gini impurity as key metrics for choosing the best splits. The script then guides viewers through the practical steps of implementing a decision tree in Python using a Jupyter notebook, including data preparation, encoding categorical variables, and training the model. The tutorial concludes with an exercise where participants are encouraged to apply the decision tree algorithm to a real-world dataset, predicting the survival rate of passengers on the Titanic based on various attributes.

Takeaways

  • 📈 **Decision Tree Algorithm**: Used for classification problems where logistic regression might not be suitable due to the complexity of the dataset.
  • 🌳 **Building a Decision Tree**: In complex datasets, decision trees help in splitting the data iteratively to create decision boundaries.
  • 💼 **Dataset Example**: The example dataset predicts if a person's salary is over $100,000 based on company, job title, and degree.
  • 🔍 **Feature Selection**: The order in which features are split in a decision tree significantly impacts the algorithm's performance.
  • 📊 **Information Gain**: Choosing the attribute that provides the highest information gain at each split is crucial for building an effective decision tree.
  • 🔢 **Entropy and Impurity**: Entropy measures the randomness in a sample; low entropy indicates a 'pure' subset. Gini impurity is another measure used in decision trees.
  • 🔧 **Data Preparation**: Convert categorical data into numerical form using label encoding before training a machine learning model.
  • 📚 **Label Encoding**: Use `LabelEncoder` from `sklearn.preprocessing` to convert categorical variables into numbers.
  • 🤖 **Model Training**: Train the decision tree classifier using the encoded input features and the target variable.
  • 🧐 **Model Prediction**: Predict outcomes using the trained model by supplying new input data.
  • 📝 **Exercise**: Practice by working on a provided dataset (e.g., Titanic dataset) to predict outcomes such as survival rates.

Q & A

  • What is the main purpose of using a decision tree algorithm in classification problems?

    -The main purpose of using a decision tree algorithm in classification problems is to create a model that predicts the membership of a given data sample in a specific category. It is particularly useful when dealing with complex datasets that cannot be easily classified with a single decision boundary, as it can iteratively split the dataset to come up with decision boundaries.

  • How does a decision tree algorithm handle complex datasets?

    -A decision tree algorithm handles complex datasets by splitting the dataset into subsets based on the feature values. It does this recursively until the tree is fully grown, creating a hierarchy of decisions that leads to the classification of the data.

  • What is the significance of the order in which features are selected for splitting in a decision tree?

    -The order in which features are selected for splitting significantly impacts the performance of the decision tree algorithm. The goal is to select features that result in the highest information gain at each split, which helps in creating a more accurate and efficient decision tree.

  • What is entropy and how is it related to decision tree algorithms?

    -Entropy is a measure of randomness or impurity in a dataset. In the context of decision tree algorithms, a lower entropy indicates a more 'pure' subset, where the samples are more uniformly distributed towards a particular class. The decision tree aims to maximize information gain, which is related to the reduction in entropy after a split.

  • What is the Gini impurity and how does it differ from entropy in decision trees?

    -Gini impurity is another measure used to estimate the purity of a dataset in a decision tree. It is calculated based on the probability of incorrectly classifying a randomly chosen element from the dataset. Gini impurity and entropy both measure impurity, but they do so using different formulas and scales.

  • How does one prepare the data for training a decision tree classifier?

    -To prepare the data for training a decision tree classifier, one must first separate the target variable from the independent variables. Then, categorical features are encoded into numerical values using techniques like label encoding. After encoding, any additional label columns used for encoding are dropped, leaving only numerical data for the model to process.

  • What is the role of the 'fit' method in training a decision tree classifier?

    -The 'fit' method is used to train the decision tree classifier. It takes the independent variables and the target variable as inputs and learns the decision rules by finding the best splits based on the selected criterion, such as Gini impurity or entropy.

  • How can one evaluate the performance of a decision tree classifier?

    -The performance of a decision tree classifier can be evaluated by using metrics such as accuracy, which is the proportion of correct predictions out of the total number of cases. In the script, the score of the model is mentioned as '1', indicating a perfect fit on the training data. However, for real-life complex datasets, the score would typically be less than one.

  • What is the importance of using a test set when training machine learning models?

    -Using a test set is crucial for evaluating the generalization performance of a machine learning model. It helps to estimate how well the model will perform on unseen data. Ideally, the dataset should be split into a training set and a test set, with common ratios being 80:20 or 70:30.

  • How does one make predictions using a trained decision tree classifier?

    -To make predictions using a trained decision tree classifier, one supplies the model with new input data that has been preprocessed and encoded in the same way as the training data. The model then uses the learned decision rules to classify the input data and provide a prediction.

  • What is the exercise provided at the end of the script for further practice?

    -The exercise involves using a Titanic dataset to predict the survival rate of passengers based on features such as class, sex, age, and fare. The task is to ignore certain columns, use the remaining ones to predict survival, and post the model's score as a comment for verification.

  • What is the importance of practicing with exercises after learning a new machine learning concept?

    -Practicing with exercises is essential for solidifying understanding and gaining hands-on experience with the concept. It allows learners to apply the theory to real or simulated datasets, troubleshoot issues, and improve their skills in using machine learning algorithms.

Outlines

00:00

🌟 Introduction to Decision Tree Algorithm

This paragraph introduces the concept of using a decision tree algorithm for classification problems. It explains that while logistic regression might be suitable for simpler datasets, decision trees are more appropriate for complex datasets that require multiple splits to define decision boundaries. The example given involves predicting whether a person's salary exceeds $100,000 based on their company, job title, and degree. The paragraph also touches on the idea of building a mental decision tree and emphasizes the importance of the order in which features are split, which impacts the algorithm's performance. It concludes by raising the question of how to select the ordering of features, which is crucial for the decision tree's effectiveness.

05:03

📊 Data Preprocessing for Decision Trees

This paragraph focuses on the initial steps of preparing data for a decision tree algorithm. It outlines the process of separating the dataset into target and independent variables, and the necessity of converting categorical data into numerical form using label encoding. The speaker demonstrates how to use the LabelEncoder from the sklearn.preprocessing module to encode categorical columns and then drop the original categorical columns, leaving only numerical data for the model. The paragraph also briefly mentions the importance of splitting the dataset into training and test sets, although for simplicity, this step is omitted in the example.

10:04

🔮 Training the Decision Tree Model and Predictions

The third paragraph describes the process of training a decision tree classifier using the prepared numerical data. It covers the importation of the decision tree module from the scikit-learn library and the creation of the decision tree model. The model is trained using the input features and the target variable. The speaker then discusses the use of the model to make predictions, highlighting that the accuracy of predictions on the training data is expected to be high but may be lower for unseen, complex datasets. An example prediction is made for a sales executive at Google with a master's degree, and the outcome is explained. The paragraph concludes with a disclaimer about the hypothetical nature of the dataset and an encouragement for viewers to practice on their own with a provided Titanic dataset, aiming to predict passenger survival rates.

Mindmap

Keywords

💡Decision Tree Algorithm

A decision tree algorithm is a type of machine learning model that uses a tree-like structure to make decisions. It recursively splits the data into subsets based on feature values, aiming to maximize the information gain at each split. In the video, it is used to classify whether a person's salary is more than $100,000 based on their company, job title, and degree.

💡Logistic Regression

Logistic regression is a statistical method for analyzing a dataset that is binary or dichotomous. It attempts to draw a decision boundary that separates different classes. The video mentions that logistic regression is more suitable for simpler datasets where a single decision boundary, like a line, can separate the classes.

💡Information Gain

Information gain is a measure used in decision tree algorithms to decide how to split the data at each node. It quantifies the amount of uncertainty (or impurity) reduced by a partition. The video explains that choosing attributes that result in high information gain at each split is crucial for the performance of the decision tree.

💡Entropy

Entropy, in the context of decision trees, measures the randomness or impurity of a sample. A lower entropy indicates that the samples are more 'pure' or belong to a single class. The video uses the concept of entropy to illustrate the concept of information gain when splitting the dataset based on different attributes.

💡Gini Impurity

Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. The video mentions Gini impurity as a criterion that can be used by decision tree algorithms, alongside entropy, to determine the best splits.

💡Label Encoding

Label encoding is the process of converting categorical variables into a numerical format. This is necessary because machine learning algorithms can only work with numerical data. In the video, label encoding is used to convert the categorical data of company, job, and degree into numbers that can be processed by the decision tree algorithm.

💡Feature Selection

Feature selection is the process of choosing the most relevant variables or features from a dataset that contribute the most to the predictive power of a model. The video emphasizes the importance of the order in which features are selected, as it impacts the performance of the decision tree algorithm.

💡Data Splitting

Data splitting involves dividing a dataset into separate subsets, typically for training and testing a machine learning model. The video simplifies this process by not explicitly splitting the data but mentions that ideally, one should use a training set and a test set to evaluate the model's performance.

💡Model Training

Model training is the process of teaching a machine learning model to make predictions or decisions by feeding it with data. In the video, the decision tree model is trained using the 'fit' method after preparing the dataset, which involves encoding labels and splitting the data.

💡Predictive Modeling

Predictive modeling involves using data and machine learning algorithms to predict outcomes. The video demonstrates predictive modeling by using a decision tree to predict if a person's salary is more than $100,000, based on their company, job title, and degree.

💡Titanic Dataset

The Titanic dataset is a famous dataset used for machine learning exercises, containing information about the passengers on the Titanic, including whether they survived or not. In the video, it is suggested as an exercise for the viewers to use a decision tree to predict the survival rate based on attributes like class, sex, age, and fare.

Highlights

Decision tree algorithm is used to solve a classification problem where logistic regression might not suffice due to the complexity of the dataset.

Decision trees are effective in handling complex datasets by iteratively splitting the data to create decision boundaries.

Humans naturally build a decision tree in their minds when solving classification problems, which is mirrored in the algorithmic approach.

The dataset used in the example predicts if a person's salary is over $100,000 based on their company, job title, and degree.

The order in which attributes are split in a decision tree significantly impacts the performance of the algorithm.

High information gain is sought at every split of the decision tree to improve the algorithm's performance.

Entropy is a measure of randomness in a sample and is used to determine the purity of subsets in the decision tree.

Gini impurity is another term used to describe the impurity in a dataset, which is similar to entropy.

The Jupyter Notebook is used for demonstrating the implementation of a decision tree algorithm.

Data is divided into target variables and independent variables before being fed into the machine learning model.

Label encoding is used to convert categorical data into numerical form for machine learning algorithms.

The decision tree classifier is trained using the 'tree' module from the 'sklearn' library.

The 'fit' method is used to train the model, and the 'predict' method is used for making predictions.

The choice between Gini impurity and entropy as a splitting criterion can be based on mathematical understanding or the 'sklearn' library's defaults.

The dataset used for training the model is also used for prediction, resulting in a score of 1, which may not be the case with more complex datasets.

A disclaimer is provided that the dataset used in the example is fabricated, and real-world data might yield different results.

An exercise is provided using a Titanic dataset to predict survival rates based on class, sex, age, and fare paid.

The 'survived' column in the Titanic dataset is the target variable for the exercise.

The importance of practicing the concept through exercises is emphasized for better understanding and learning.

Transcripts

play00:00

we are going to solve a classification

play00:02

problem using decision tree algorithm

play00:04

today

play00:05

when you have a data set like this it's

play00:07

easier to

play00:08

draw a decision boundary using logistic

play00:10

regression

play00:11

but if your data set is little complex

play00:13

like this

play00:14

you cannot just draw a single line you

play00:16

might have to split your data set

play00:18

again and again to come up with the

play00:20

decision boundaries

play00:21

and this is what decision tree algorithm

play00:24

does for you

play00:26

we will use this particular data set

play00:28

where you try to predict

play00:30

if person's salary is more than 100 000

play00:34

based on the company his job title and

play00:38

the degree that he has now

play00:41

when you look at the data set and when

play00:43

you give it to any human being to solve

play00:45

this problem

play00:46

you will naturally try to build a

play00:48

decision tree in your brain

play00:50

so first you will split the data set

play00:53

using the company

play00:55

and here you can see what happened is

play00:58

if your company's facebook no matter

play01:01

what your degree or job title is your

play01:03

answer is always yes you are always

play01:04

getting hundred thousand

play01:06

dollar per annum i mean they have a lot

play01:08

of money right now and their stock is

play01:10

going up revenue is going up so they

play01:12

don't mind paying such a high salary

play01:16

but in other two cases uh you have

play01:19

mixed samples so you need to ask further

play01:22

question for example for google i will

play01:24

ask

play01:24

what is the job position and based on

play01:27

that

play01:28

i have further conclusions such as if

play01:30

it's a business manager

play01:32

the answer is always yes sales executive

play01:34

answer is no

play01:36

computer programmer again i need to

play01:38

split my decision tree

play01:40

and you can do this iteratively to come

play01:43

up with

play01:44

a tree like this now this sounded very

play01:46

simple but in real life you will not

play01:48

have three attributes you will have

play01:50

probably 50 attributes

play01:52

and it matters in which order you split

play01:55

the three right now we

play01:56

chose company first then job title and

play01:59

then the degree

play02:00

in which order you select these

play02:01

attributes is going to impact the

play02:04

performance of your algorithm

play02:06

so the question arises how do you

play02:08

exactly select the ordering of these

play02:10

features

play02:11

so let's look at our example so here we

play02:14

used

play02:15

company first we might have used the

play02:18

degree instead of company in which case

play02:20

our data

play02:22

set would be split like this now observe

play02:25

carefully

play02:26

on the left hand side what's happening

play02:28

is we are getting

play02:30

a little bit of a pure subset what i

play02:33

mean by pure subset

play02:35

is in the case of facebook all the

play02:37

samples are green

play02:39

okay so this has a very low entropy

play02:42

now if you remember the definition of

play02:44

entropy from your school days

play02:46

it is basically the measure of

play02:48

randomness in your sample

play02:51

here there is no randomness everything

play02:53

is green

play02:54

so six green samples zero red hence low

play02:58

entropy

play02:59

here there is some entropy but still

play03:02

majority of the samples are red okay

play03:05

whereas on the right hand side

play03:07

for this case four red for green means

play03:10

there is total

play03:11

randomness it is 50 50 hence my entropy

play03:15

is one

play03:16

okay here it's little better and rupees

play03:19

little low

play03:20

so overall i'm thinking if i

play03:24

use company as shown on the left hand

play03:27

side

play03:28

i will have a high information gain

play03:31

okay whereas on the right hand side i

play03:33

have low information gain

play03:35

hence you should use an approach which

play03:38

gives you

play03:38

high information at every split hence we

play03:41

chose

play03:42

company as the first attribute and in

play03:44

the further split also

play03:47

you can use high information grain

play03:49

criteria

play03:50

to divide it further there is another

play03:53

term that

play03:54

you hear often when you are dealing with

play03:57

decision tree which is

play03:58

genie impurity now this is nothing but

play04:01

an

play04:01

impurity in your data set for example

play04:04

when i split

play04:06

my sample like this at the bottom

play04:10

most of the samples are red whereas one

play04:12

is green

play04:13

so this is almost pure but there is

play04:16

little bit of

play04:16

impurity all right it is sort of similar

play04:18

to entropy

play04:20

i'm not going to go into mathematics too

play04:22

much you can read articles on it

play04:24

uh will straight away jump into writing

play04:27

code

play04:28

i launched my jupyter notebook and

play04:30

loaded the same data set into my data

play04:32

frame you can see

play04:34

i have the same csv file that i am

play04:37

loading into my data frame

play04:40

now the first step is once i have my

play04:42

data from ready

play04:43

i want to divide it it in between

play04:47

the target variable and the independent

play04:51

variable

play04:52

so i will call the target varia

play04:54

independent variable

play04:55

data frame inputs

play04:59

and i will just

play05:02

drop see this is my target column okay

play05:06

target variable

play05:07

so i'm just going to drop that

play05:10

and i will say access is equal to

play05:18

[Music]

play05:22

columns

play05:24

okay so once i execute this

play05:27

what's happening is my

play05:30

input looks like this so it doesn't have

play05:34

the last column which is my answer and

play05:37

my target

play05:38

looks like this which is my last column

play05:42

now by this point you all know that

play05:46

machine learning algorithms can only

play05:47

work on numbers they cannot understand

play05:49

labels so what we have to do is we have

play05:52

to

play05:54

convert these particular columns these

play05:57

three columns

play05:58

into numbers and one of the ways is to

play06:01

use the label encoder

play06:03

so from sklearn sklearn.preprocessing

play06:11

i will

play06:16

all right i hit tab so it was

play06:18

autocompleting but it was slow

play06:20

but see if you hit tab again it's not

play06:23

working

play06:24

this is sometimes the night work it's

play06:32

funny

play06:34

all right once i uh import label encoder

play06:39

i am going to create the object of this

play06:42

class and

play06:43

i have three columns so i have to create

play06:45

like three different objects okay

play06:47

so first is ali company

play06:54

the second one is ali job

play06:59

and then degree

play07:03

once you have these three what you do is

play07:07

in your inputs data frame

play07:14

you are creating one more column and

play07:15

this is how you create extra column in a

play07:17

data frame

play07:23

you call fit and transform method on

play07:28

your company column

play07:32

and you can do the same thing for

play07:36

your job and degree column also

play07:43

so here you have job

play07:47

your degree

play07:52

once you do that and when then when you

play07:54

print head

play07:55

this is how your data frame going is

play07:57

going to look like it has three extra

play07:59

column

play08:00

and we have label encoded your label

play08:02

columns into numbers

play08:06

next step is to drop

play08:10

those label columns so i'm going to

play08:13

create a new data frame here

play08:15

and just say drop and you can draw

play08:17

multiple columns at the same time

play08:28

axis is equal to columns

play08:34

and when you look at your

play08:38

inputs and data frame what it did is

play08:41

dropped all the label columns now all

play08:43

you have is numbers

play08:45

so google it encoded as number two

play08:49

um the second one was

play08:53

abc pharma which was encoded as zero

play08:56

and facebook is encoded as one and same

play08:59

thing for like

play09:00

job title and degree it just assigns

play09:02

different numbers to different labels

play09:06

now we are ready to train our classifier

play09:10

so as usual i am going to

play09:15

import some module now for decision tree

play09:19

you import the tree from your ascalon

play09:24

library

play09:27

and then your model is nothing but tree

play09:30

dot

play09:32

d season tree this isn't reclassifier

play09:40

and then you can now train your model

play09:44

so you can call fit here

play09:48

and i'm going to call inputs n and my

play09:51

target variable

play09:53

so it train my model now i'm not using

play09:56

test train split here

play09:58

just to keep things simple but ideally

play10:00

you should

play10:01

split your data set into training set

play10:04

and test set

play10:05

80 20 70 30 whatever ratio you prefer

play10:08

all right but i'm just keeping it very

play10:10

simple here

play10:11

it use criteria as gini impurity by

play10:16

default you can change it to entropy

play10:18

also again

play10:20

for math i'm not going to go into very

play10:22

much detail

play10:23

you can google it to know the difference

play10:26

between genie and entropy

play10:28

uh these details are abstracted by

play10:30

sklearn library

play10:32

so you're fine although knowing math

play10:34

always

play10:35

helps in terms of what kind of criteria

play10:39

you should choose for a given problem so

play10:41

i

play10:41

i still suggest going through that all

play10:43

right now my model is ready

play10:46

to predict so the first thing i'm going

play10:49

to do

play10:49

is predict my score all right and the

play10:52

way you predict your score

play10:53

is you supply your input and target data

play10:58

set

play10:58

now pause this video for a moment and

play11:01

tell me what is going to be your score

play11:05

the score is going to be 1

play11:08

because i'm using the same data set

play11:12

which i use for training and my data

play11:14

step was also very simple

play11:16

so i was expecting that it will do

play11:19

very okay it will be very accurate with

play11:22

my prediction

play11:23

uh hence the score is one in real life

play11:26

when you have complex data set

play11:28

your score will be less than one okay so

play11:31

now let's do some prediction

play11:32

so i'm going to do predict

play11:36

all right what are we going to predict

play11:40

so

play11:42

let me predict a salary of person

play11:45

working in google

play11:47

sales executive is his job and master's

play11:49

degree

play11:50

okay so let's see

play11:54

so that's number two row okay so two two

play11:57

one

play12:00

two two one

play12:03

[Music]

play12:05

all right is expecting 2d array usually

play12:08

you supply data frame so i'm just going

play12:10

to

play12:10

do this and it says zero means

play12:14

the person who is working in google

play12:17

sales executive is his job master degree

play12:22

his seller is not going to be more than

play12:23

100 000 or and by the way just a

play12:25

disclaimer

play12:26

i just made made up this data set

play12:30

in reality google says executive might

play12:32

be getting much more than 100 000

play12:34

but i just made it up all right so

play12:36

that's a little disclaimer

play12:40

how about business manager

play12:45

so business managers number

play12:48

label encoded number is

play12:54

zero

play12:59

so his salary is one all right so we are

play13:02

doing

play13:02

perfectly all right here you can

play13:06

have this model and you can do further

play13:09

prediction

play13:10

using the train model and by calling a

play13:12

predict method on that

play13:15

already now the most important part

play13:18

which is the exercise so i expect all of

play13:21

you

play13:21

to work on the exercise once you learn

play13:23

this concept because just by watching

play13:25

the tutorial you are not going to learn

play13:27

anything

play13:27

so you must do an exercise on your own

play13:31

i have a titanic data set so this is

play13:35

showing the survival rate of passenger

play13:38

uh in a titanic crash this is the real

play13:41

data set

play13:42

and you can get this csv file uh by

play13:45

clicking on a link in the video

play13:48

description below

play13:49

so that link contains the jupyter

play13:52

notebook that was used in uh

play13:54

this tutorial and it has exercise

play13:57

subfolder

play13:58

within that you have titanic.csv

play14:01

here you should ignore all the red

play14:04

columns

play14:05

and use the remaining columns to predict

play14:08

the survival rate so here

play14:10

the survived column is your target

play14:13

variable

play14:14

and you have to predict the survival of

play14:17

a passenger based on the class the sex

play14:19

age and the fare that he paid before

play14:23

onboarding titanic ship okay so that's

play14:26

what you have to do

play14:27

uh come up with uh the score of your

play14:30

model and post your score

play14:32

as a comment in below

play14:35

and i will verify your answer and and

play14:37

we'll see

play14:38

how well you can do with it all right

play14:41

that's all i had for this tutorial thank

play14:43

you very much for watching bye

Rate This

5.0 / 5 (0 votes)

Related Tags
Decision TreesMachine LearningData ClassificationSalary PredictionInformation GainGini ImpurityEntropyLabel EncodingModel TrainingPrediction AccuracyTitanic DatasetExercise Challenge