Machine Learning Tutorial Python - 8 Logistic Regression (Multiclass Classification)

codebasics
21 Sept 201815:43

Summary

TLDRThis tutorial delves into the application of logistic regression for multi-class classification, exemplified by predicting a person's voting party or recognizing handwritten digits. The presenter begins by loading the 'digits' dataset from scikit-learn, which contains 1797 samples of 8x8 images representing digits from 0 to 9. They then demonstrate how to split the dataset into training and test sets using a 80/20 ratio to prevent overfitting. A logistic regression model is trained on the training data, and its accuracy is assessed using the test set, achieving an impressive 96.67%. The tutorial also includes a practical exercise for viewers to apply logistic regression on the Iris flower dataset, which contains features like petal and sepal width and height to classify three types of iris flowers. The presenter emphasizes the importance of hands-on practice and provides a link to a Jupyter notebook for further learning.

Takeaways

  • 📈 **Binary vs. Multi-Class Classification**: The tutorial begins with a recap of binary classification and then moves on to multi-class classification, which is used to predict multiple outcomes, such as identifying which political party a person will vote for.
  • 🔢 **Handwritten Digit Recognition**: The specific problem tackled is recognizing handwritten digits (0 to 9) using a training set of such characters.
  • 📚 **Using sklearn's Datasets**: The script demonstrates how to use `sklearn.datasets` to load predefined datasets, in this case, the 'digits' dataset comprising 8x8 images of handwritten digits.
  • 🖼️ **Data Representation**: The handwritten digit images are represented as one-dimensional arrays, which can be visualized using `matplotlib` to show the corresponding 8x8 image.
  • 🤖 **Model Training with Logistic Regression**: A logistic regression model is built using the training data, with the `fit` method applied to the training set (`x_train` and `y_train`).
  • ✅ **Model Accuracy Assessment**: The model's accuracy is evaluated using the test set (`x_test` and `y_test`), and the script shows that the model achieved a high accuracy rate of 96.67%.
  • 🔮 **Model Prediction**: The `predict` method is used to make predictions on new data, and the script highlights the need to supply numeric data that corresponds to the same index as the target variable.
  • 📊 **Confusion Matrix for Model Evaluation**: A confusion matrix is introduced as a tool to visualize the model's performance, showing where the model is making mistakes in its predictions.
  • 🌟 **Exercise with Iris Dataset**: The tutorial concludes with an exercise for the viewer to practice using the Iris flower dataset, which includes features like petal width and height, to build and evaluate a logistic regression model.
  • 🔍 **Importance of Data Splitting**: The script emphasizes the importance of splitting data into training and test sets to prevent overfitting and to ensure the model's generalizability.
  • 📝 **Documentation and Code Comments**: The tutorial uses `shift + tab` to show API documentation and includes comments in the code to explain each step, highlighting the importance of understanding the tools and methods used.

Q & A

  • What is the main topic of this tutorial?

    -The main topic of this tutorial is logistic regression for multi-class classification, specifically focusing on recognizing handwritten digits.

  • What is the dataset used in this tutorial for training the logistic regression model?

    -The dataset used is the 'digits' dataset from scikit-learn, which contains 1797 samples of handwritten digits of size 8x8.

  • How is the handwritten digit image data represented in the dataset?

    -The handwritten digit image data is represented as a one-dimensional array of 64 elements, corresponding to an 8x8 image.

  • What is the purpose of splitting the dataset into training and test sets?

    -The purpose of splitting the dataset is to prevent overfitting and to ensure that the model can generalize well to unseen data by testing it against a different set of data.

  • How is the logistic regression model trained in this tutorial?

    -The logistic regression model is trained by calling the 'fit' method with the training data (x_train and y_train).

  • What is the accuracy score of the logistic regression model on the test set?

    -The accuracy score of the logistic regression model on the test set is 96.67 percent.

  • How does the model make predictions on new handwritten digit images?

    -The model makes predictions by calling the 'predict' method with the numeric data of the new handwritten digit images.

  • What is a confusion matrix and how is it used?

    -A confusion matrix is a two-dimensional array that visualizes the performance of a classification model. It shows the instances where the model's predictions did not match the actual values, highlighting areas where the model is not performing well.

  • What is the exercise given at the end of the tutorial?

    -The exercise involves using the iris flower dataset to build a logistic regression model, calculate its accuracy, and make a few predictions.

  • What are the four features included in the iris flower dataset?

    -The four features included in the iris flower dataset are sepal width, sepal length, petal width, and petal length.

  • How can one visualize the confusion matrix?

    -One can visualize the confusion matrix using libraries like matplotlib or seaborn by calling a heatmap function to display the matrix as a color-coded grid.

  • What is the importance of practicing with the provided dataset and exercise?

    -Practicing with the provided dataset and exercise helps solidify the understanding of logistic regression and machine learning concepts, and is essential for developing expertise in the field.

Outlines

00:00

📘 Introduction to Multi-Class Logistic Regression

This paragraph introduces the concept of multi-class classification using logistic regression, contrasting it with binary classification. The tutorial focuses on a specific problem: recognizing handwritten digits (0 to 9). The speaker outlines the process of using a training set of handwritten digit characters to build a logistic regression model. The training set is sourced from the sklearn.dataset's 'load_digits' dataset, which contains 1797 8x8 images. The speaker demonstrates how to load the dataset and explore its contents, including the data and images, and how they correspond to each other. The target variable is also discussed, showing how it labels each image with the correct digit.

05:02

📚 Data Preparation and Model Training

The second paragraph details the process of data preparation for model training. The speaker explains how to use the 'train_test_split' function from 'model_selection' to divide the dataset into training and test samples. The purpose of this split is to prevent overfitting by ensuring the model is tested against data it hasn't seen before. The speaker specifies that 20% of the samples should be used as the test set, with the remaining 80% used for training. After splitting the data, the speaker proceeds to create a logistic regression model using the training data and target variable. The model's accuracy is then assessed using the test data, and the speaker finds that the model performs well with an accuracy of 96.67%. The paragraph concludes with the speaker demonstrating how to make predictions using the trained model.

10:04

🔍 Evaluating Model Performance with a Confusion Matrix

In this paragraph, the speaker discusses evaluating the logistic regression model's performance using a confusion matrix. The confusion matrix is a tool that visualizes the performance of a classification model by showing the true versus predicted classifications. The speaker explains how to obtain predicted values from the test set and then create a confusion matrix using these predictions and the true values. The confusion matrix is then visualized using a heatmap, which provides a clear picture of where the model is performing well and where it is making mistakes. The speaker emphasizes that areas of the matrix without zero indicate where the model is correctly predicting, while non-zero values show where it is not.

15:05

🌼 Exercise: Iris Flower Dataset and Logistic Regression

The final paragraph presents an exercise for the viewer. The exercise involves using the Iris flower dataset, which contains 150 samples with four features: petal width, petal length, sepal width, and sepal length. The task is to load the dataset, divide it into training and test samples, build a logistic regression model, and determine the model's accuracy. The speaker also encourages making a few predictions using the model. The paragraph concludes with the speaker providing a link to a Jupyter notebook containing the exercise and urging the viewer to practice to gain expertise in machine learning.

Mindmap

Keywords

💡Logistic Regression

Logistic Regression is a statistical method for binary classification tasks, but in the context of this video, it is extended to multi-class classification problems. It is used to predict the probabilities of different possible outcomes of a categorically distributed dependent variable. In the video, logistic regression is used to predict handwritten digits (0 to 9), which is a multi-class classification problem.

💡Binary Classification

Binary Classification is a type of supervised learning where the output is one of two possible classes, often termed as 'yes' or 'no'. The video mentions binary classification in the context of the previous tutorial, contrasting it with the multi-class classification problem discussed in the current tutorial.

💡Multi-Class Classification

Multi-Class Classification is an extension of binary classification where the output can belong to more than two classes. In the video, the task is to predict which digit (0 through 9) a handwritten character represents, thus involving multiple distinct classes.

💡Handwritten Digit Recognition

Handwritten Digit Recognition is a classic problem in machine learning where the goal is to identify the digit (0 to 9) represented by a handwritten image. The video uses this as an example to illustrate how logistic regression can be applied to a multi-class classification problem.

💡Jupyter Notebook

Jupyter Notebook is an open-source web application that allows creation and sharing of documents that contain live code, equations, visualizations, and narrative text. The video script mentions using a Jupyter Notebook as an IDE to write and run the code for logistic regression.

💡Matplotlib

Matplotlib is a plotting library for Python that helps in creating static, interactive, and animated visualizations. In the video, it is used to visualize the handwritten digit images and to plot the confusion matrix, which is a performance measurement for the classification model.

💡Scikit-learn

Scikit-learn is a popular machine learning library in Python that provides simple and efficient tools for data analysis and modeling. The video uses scikit-learn to load predefined datasets, split the data into training and test sets, and to create and train the logistic regression model.

💡Train-Test Split

Train-test split is a method used in machine learning to divide a dataset into a training set to train the model and a test set to evaluate its performance. The video demonstrates how to use the train-test split to prevent overfitting and to ensure the model's ability to generalize to new, unseen data.

💡Model Accuracy

Model accuracy is a performance measurement that indicates the proportion of correct predictions made by the model out of all the total predictions. In the video, the logistic regression model's accuracy is calculated using the test set, resulting in a score of 96.67%, which is considered very good.

💡Confusion Matrix

A Confusion Matrix is a table layout that allows visualization of the performance of a classification model. It shows the counts of the true positives, true negatives, false positives, and false negatives. The video uses a confusion matrix to analyze the specific instances where the logistic regression model made incorrect predictions.

💡Iris Flower Dataset

The Iris Flower Dataset is a well-known dataset in the field of machine learning, containing data on 150 samples of three different species of Iris flowers, each with four features: sepal length, sepal width, petal length, and petal width. The video sets an exercise where the viewer is to use this dataset to build a logistic regression model and evaluate its accuracy.

Highlights

The tutorial is a continuation of logistic regression, focusing on multi-class classification.

Binary classification is contrasted with multi-class classification, where outcomes can be one of three or more.

The concrete problem for the tutorial is recognizing handwritten digits (0 to 9).

A training set with numerous handwritten digit characters is used to build the logistic regression model.

The dataset used is the 'load digits' dataset from scikit-learn, consisting of 1797 8x8 images.

Each image is represented as a one-dimensional array with 64 elements.

Matplotlib is used to visualize the numeric data as actual images.

The target variable indicates the class label for each image, with a sequence from 0 to the total number of classes.

The data is split into training and test sets using the train_test_split function from scikit-learn.

The model is trained using the training data with the fit method.

The model's accuracy is evaluated using the test data and the score method.

A confusion matrix is introduced as a tool to visualize the model's performance.

The tutorial concludes with an exercise involving the Iris flower dataset and logistic regression.

The Iris dataset features four attributes: petal width, petal length, sepal width, and sepal length.

The exercise requires loading the dataset, splitting it into test and training sets, and predicting iris flower types.

The tutorial emphasizes the importance of not just watching but practicing with the provided code and exercises.

The Jupyter notebook and exercise link are provided for further practice.

Transcripts

play00:00

this is part two of logistic regression

play00:02

tutorial if you haven't

play00:04

watched the first part then you should

play00:06

watch that first

play00:07

in the previous tutorial we discussed

play00:09

about binary classification where the

play00:12

output classes are binary in nature they

play00:15

are either yes or no

play00:16

in this one we are going to discuss

play00:18

multi-class classification for example

play00:20

when you are trying to predict which

play00:21

party a person is going to vote for

play00:23

the possible outcomes are one of these

play00:25

three

play00:26

the concrete problem that we are going

play00:28

to solve today is to recognize the

play00:30

handwritten

play00:32

digit for example here this one maps to

play00:35

one of the output categories which is

play00:37

nothing but

play00:38

digit digits 0 to 9. similarly here for

play00:42

maps to this particular output category

play00:44

so we will

play00:45

uh use a training set with lot of hand

play00:49

digit

play00:49

uh characters and then we'll build a

play00:52

model using logistic regression

play00:54

and at the end of the tutorial you will

play00:56

have an interesting exercise to work on

play00:59

so let's uh jump straight into writing

play01:01

the code

play01:02

as usual i am going to use my jupyter

play01:04

notebook as

play01:05

an ide and here i have imported

play01:07

matplotlib

play01:09

and also scikit learns our data set

play01:12

so sklearn.dataset has some predefined

play01:16

ready-made datasets that you can use to

play01:19

learn machine learning

play01:21

from this i am using load digits data

play01:24

set

play01:24

so if you read the documentation all it

play01:27

is

play01:27

is 1797 97

play01:31

handwritten uh digits uh of size

play01:34

eight by eight okay so it looks

play01:36

something like this

play01:38

and what we are going to do is given

play01:40

these digits we are going to

play01:42

identify that what what digit that is

play01:45

all right so let me just run it

play01:48

so this has run fine i am going to now

play01:53

call load digits method to

play01:56

load my training set basically and i

play01:58

want to explore

play02:00

what this training set contains so it

play02:03

contains

play02:04

couple of things it has data

play02:07

which is your real data so let's print

play02:10

few elements

play02:11

so as it's written in the

play02:15

documentation there are 1797 sample

play02:18

so i'm just going to print the first one

play02:28

and it's an array okay as such

play02:32

it is an eight by eight uh

play02:35

image but the image is represented as

play02:38

a one dimensional array so if you count

play02:40

these elements it will be

play02:42

uh 64 which is eight by eight and if you

play02:45

want to

play02:45

see this particular element then you can

play02:49

use uh

play02:51

matplotlib so i'm going to do plot

play02:55

plt dot gray and plt

play02:59

has a method called mat show

play03:02

and what you can do is you can print the

play03:06

corresponding image

play03:08

so data has uh the numeric data and

play03:12

images will have the actual images

play03:15

so you can see that our data 0

play03:19

and image 0 they kind of relate to each

play03:23

other

play03:23

and the only difference between the two

play03:25

is that you have numeric

play03:27

numeric data here versus you have an

play03:29

actual image so if you want to print

play03:32

let's say first five sample

play03:37

then you can just print it like this

play03:41

and you will see that c 0 1 2 3

play03:45

four okay and corresponding

play03:49

numbers will be in this data array so

play03:51

that looks pretty straightforward

play03:53

now what we're going to do is use this

play03:56

uh to train our model

play03:57

now before we do that let's uh take a

play04:00

look at target and target names

play04:02

okay so our

play04:06

target so if i print

play04:10

digit.target 0

play04:13

let me print zero to five so you see

play04:16

like

play04:16

zero to five is literally in the

play04:18

sequence the first element is zero

play04:21

one two three and that's what this is

play04:23

printing here

play04:24

it is saying that this image

play04:28

is zero the last image this is

play04:31

four so this is our complete training

play04:34

set

play04:34

which has our image as well as the

play04:38

target variable you know like it says

play04:40

what it is so we can use

play04:43

our data data and target to train our

play04:47

model

play04:49

now before training our model the usual

play04:52

thing that we do is

play04:55

we import from model selection

play04:58

we import train

play05:01

taste split

play05:07

and we try to divide our data set

play05:10

into our training and test

play05:13

samples so the way you do it is you say

play05:17

x train x test

play05:21

i don't exactly remember the order of

play05:23

the argument so i'm going to

play05:26

what i'm going to do it okay let me do

play05:28

this so to interest split

play05:31

uh digits.data because that's your data

play05:34

set

play05:35

then you have digits.target because

play05:37

that's your target variable

play05:40

okay and if you hit shift tab it will

play05:43

show you all the

play05:45

nice documentation of that api

play05:49

so here it says this is the order in

play05:53

which

play05:54

it returns the output

play05:59

[Music]

play06:01

all right so now what i just did by

play06:03

executing this

play06:05

command is i had

play06:09

input account output variable from my

play06:11

training set and i divided them into

play06:13

test and train sets now the reason that

play06:16

we do this typically is

play06:18

we don't want to uh overfit our model we

play06:21

don't want to

play06:22

make our model such that we just uh bias

play06:25

it against the

play06:27

training data that's why the data that

play06:30

the model is trained against

play06:32

should be different than the data that

play06:34

uh

play06:35

the model is tested against okay so

play06:37

that's why we

play06:38

split these two so if you look at

play06:43

okay i have to supply

play06:47

the size so i'm going to probably supply

play06:52

taste size

play06:53

taste

play06:56

size so i want 20 percent of my

play07:00

samples to be test size and 80 percent

play07:02

to be

play07:03

the training okay so if i look at

play07:08

length of x train it is this

play07:12

and if i look at length of x this it is

play07:16

this so this is roughly 80 percent of

play07:18

all available

play07:20

samples all right so i have a training

play07:23

and test

play07:24

data set split now i can

play07:26

[Music]

play07:27

create my logistic regression

play07:31

model so from this i want to import

play07:35

logistic regression

play07:40

and create a model object

play07:44

so that you can train it later and you

play07:47

all know the way you train it

play07:49

is by calling a fit method

play07:53

and fit method you will call it against

play07:55

xtest

play07:57

train sorry ny train

play08:00

[Music]

play08:02

when you run that the model is getting

play08:04

trained using this

play08:05

x train and y train data set so again to

play08:08

repeat

play08:08

x train has the hand written characters

play08:12

and y train will have the corresponding

play08:15

output it will say okay for this image

play08:17

it is 4 etc

play08:20

now since my model is ready the first

play08:22

thing i always do is

play08:24

i calculate the score so the score tells

play08:27

you

play08:28

uh how accurate is your model and the

play08:30

way you do that

play08:31

is you have to supply x test and y taste

play08:34

so

play08:34

using the x test it will calculate the

play08:38

y predicted value and it will compare

play08:40

those y predicted value against the

play08:43

real value which is y test turns out

play08:46

that my model is doing pretty good

play08:48

the accuracy is 96.67 percent almost

play08:52

which is really good so now i'm going to

play08:56

make my actual prediction and you know

play08:59

that

play08:59

you have to call predict method for that

play09:03

now let's see so before i call this

play09:06

method

play09:07

what i want to do is i want to

play09:10

pick up a random sample so i will say

play09:14

plt dot mat show

play09:17

digits dot images

play09:21

let's say i'm just picking up a random

play09:23

sample okay

play09:25

hmm this is pretty hard even i don't

play09:28

know what this number is actually

play09:30

let's see so this number

play09:33

is actually digits dot target

play09:37

67 so you have to access the same

play09:40

index in your target

play09:44

okay so this is six okay so let's see

play09:47

what our

play09:48

model will predict for this guy so i

play09:50

will say model

play09:52

dot predict okay model.predict what

play09:56

okay what do i want to predict i want to

play09:59

predict

play10:01

now see i'm not going to supply images

play10:03

here

play10:04

because image is all binary data

play10:07

my model likes numeric data more so

play10:10

i will use the same index 67 but i am

play10:13

using data

play10:14

instead of images

play10:25

okay this is the error you get when

play10:27

you're not supplying multi-dimensional

play10:29

arrays i'm just going to supply

play10:31

multi-dimensional array just for the

play10:32

sake of it

play10:34

and you can see that it is predicting

play10:35

the target variable

play10:37

all right okay let's just

play10:41

okay let me just create a new cell here

play10:46

and let me predict okay what do i want

play10:49

to predict

play10:51

okay i want to predict zero to five now

play10:54

you all know zero to five is literally

play10:56

zero to five so

play10:58

zero is zero one is 1 and so on

play11:02

when executed see my model is doing

play11:04

pretty good

play11:06

so my score is 0.96

play11:10

how do i know where it didn't do well

play11:13

okay

play11:14

because all the samples i tried it seems

play11:16

to be doing pretty well

play11:18

so i want to know where exactly it fell

play11:20

and you know i want to get

play11:21

overall feeling of my model's accuracy

play11:24

and one of the ways of doing that is

play11:27

confusion matrix

play11:29

so i will show you what confusion matrix

play11:32

is really

play11:34

for that i have to import

play11:37

from this matrix i need to import

play11:42

confusion metrics okay and then

play11:47

before i do that i need to uh

play11:51

get the predicted values so i will say

play11:55

predict

play11:57

x taste when i run that i get all the

play12:00

predicted values for this

play12:02

x test okay and then i create a

play12:05

confusion matrix

play12:06

and in the confusion matrix what you

play12:09

supply

play12:10

is whitist which is the truth

play12:14

and then y predicted which is what your

play12:17

model predicted

play12:18

and then you get confusion matrix back

play12:22

when you run that you get this two by

play12:24

two

play12:25

dimensional array and you are wondering

play12:29

what the heck this is so this is better

play12:32

visualized

play12:33

in matplotlib or c bond right

play12:36

so i will use that library

play12:39

for the visualization here i'm just

play12:42

going to copy paste the code

play12:44

for confusion matrix visualization here

play12:47

i am using

play12:48

cbon library which is similar to

play12:50

matplotlib it's used for

play12:52

visualization and i'm calling a heat map

play12:55

here

play12:55

with the confusion matrix cm variable

play12:57

that we created here

play12:59

and when you run that

play13:02

this is the confusion matrix that you

play13:04

got now

play13:05

the way this works is see here you see

play13:08

37 number

play13:10

what it means is 37 time

play13:13

the truth was zero and my model

play13:16

predicted it to be zero

play13:19

this two means two times

play13:24

my truth was eight meaning i

play13:27

fed my model the image of eight but my

play13:30

model

play13:31

said no it is one so these are the

play13:33

instances

play13:34

where it's not doing good so you can see

play13:37

that in

play13:38

in anywhere in this area in this area

play13:40

when you don't see

play13:41

zero it means your model is not working

play13:44

right

play13:44

so here for example again two times

play13:48

my images were off digit

play13:51

four but my model predicted it to be

play13:54

one so that's what this is so confusion

play13:57

matrix is just a nice way of

play13:59

visualizing uh how well your model is

play14:02

doing

play14:03

all right now it's the time for exercise

play14:06

today's exercise

play14:07

is going to be uh using

play14:11

sql on data sets iris flower data set

play14:14

which has following four features so

play14:18

if you don't know about iris iris is a

play14:20

type of flower

play14:21

and the flower has a diff two type of

play14:25

leaves you know one leaf is called

play14:29

one leaf is called sepal the other one

play14:31

is called petal

play14:33

and they have like a height and width

play14:35

and based on these height and widths you

play14:37

can

play14:38

you can predict what kind of iris flower

play14:41

it is

play14:42

okay so our data set will have three

play14:44

kind of flowers

play14:45

these are the names of three different

play14:47

iris flowers

play14:48

and the features that we have are these

play14:52

four

play14:52

which is basically petal width and

play14:54

height and sample width and height

play14:56

and you will use uh this

play14:59

data set the iris data set and you will

play15:02

load all those 150 samples

play15:04

then divide them into test and training

play15:07

samples

play15:08

and then build a logistic regression

play15:10

model

play15:12

and tell me the accuracy

play15:15

that you can come up with and then you

play15:17

can just do a few predictions

play15:19

uh using that model all right that's all

play15:22

i had for this tutorial

play15:24

i have the link of this jupyter notebook

play15:27

down below

play15:28

and you can find the exercise also so

play15:30

make sure to refer to those

play15:32

useful links and please please do some

play15:36

practice

play15:36

yourself just by watching this video you

play15:39

are not going to become expert

play15:40

alright thanks for watching

Rate This

5.0 / 5 (0 votes)

Related Tags
Logistic RegressionMulti-ClassDigit RecognitionPythonscikit-learnMachine LearningData ScienceJupyter NotebookModel AccuracyConfusion MatrixIris Dataset