Machine Learning Tutorial Python - 8 Logistic Regression (Multiclass Classification)
Summary
TLDRThis tutorial delves into the application of logistic regression for multi-class classification, exemplified by predicting a person's voting party or recognizing handwritten digits. The presenter begins by loading the 'digits' dataset from scikit-learn, which contains 1797 samples of 8x8 images representing digits from 0 to 9. They then demonstrate how to split the dataset into training and test sets using a 80/20 ratio to prevent overfitting. A logistic regression model is trained on the training data, and its accuracy is assessed using the test set, achieving an impressive 96.67%. The tutorial also includes a practical exercise for viewers to apply logistic regression on the Iris flower dataset, which contains features like petal and sepal width and height to classify three types of iris flowers. The presenter emphasizes the importance of hands-on practice and provides a link to a Jupyter notebook for further learning.
Takeaways
- 📈 **Binary vs. Multi-Class Classification**: The tutorial begins with a recap of binary classification and then moves on to multi-class classification, which is used to predict multiple outcomes, such as identifying which political party a person will vote for.
- 🔢 **Handwritten Digit Recognition**: The specific problem tackled is recognizing handwritten digits (0 to 9) using a training set of such characters.
- 📚 **Using sklearn's Datasets**: The script demonstrates how to use `sklearn.datasets` to load predefined datasets, in this case, the 'digits' dataset comprising 8x8 images of handwritten digits.
- 🖼️ **Data Representation**: The handwritten digit images are represented as one-dimensional arrays, which can be visualized using `matplotlib` to show the corresponding 8x8 image.
- 🤖 **Model Training with Logistic Regression**: A logistic regression model is built using the training data, with the `fit` method applied to the training set (`x_train` and `y_train`).
- ✅ **Model Accuracy Assessment**: The model's accuracy is evaluated using the test set (`x_test` and `y_test`), and the script shows that the model achieved a high accuracy rate of 96.67%.
- 🔮 **Model Prediction**: The `predict` method is used to make predictions on new data, and the script highlights the need to supply numeric data that corresponds to the same index as the target variable.
- 📊 **Confusion Matrix for Model Evaluation**: A confusion matrix is introduced as a tool to visualize the model's performance, showing where the model is making mistakes in its predictions.
- 🌟 **Exercise with Iris Dataset**: The tutorial concludes with an exercise for the viewer to practice using the Iris flower dataset, which includes features like petal width and height, to build and evaluate a logistic regression model.
- 🔍 **Importance of Data Splitting**: The script emphasizes the importance of splitting data into training and test sets to prevent overfitting and to ensure the model's generalizability.
- 📝 **Documentation and Code Comments**: The tutorial uses `shift + tab` to show API documentation and includes comments in the code to explain each step, highlighting the importance of understanding the tools and methods used.
Q & A
What is the main topic of this tutorial?
-The main topic of this tutorial is logistic regression for multi-class classification, specifically focusing on recognizing handwritten digits.
What is the dataset used in this tutorial for training the logistic regression model?
-The dataset used is the 'digits' dataset from scikit-learn, which contains 1797 samples of handwritten digits of size 8x8.
How is the handwritten digit image data represented in the dataset?
-The handwritten digit image data is represented as a one-dimensional array of 64 elements, corresponding to an 8x8 image.
What is the purpose of splitting the dataset into training and test sets?
-The purpose of splitting the dataset is to prevent overfitting and to ensure that the model can generalize well to unseen data by testing it against a different set of data.
How is the logistic regression model trained in this tutorial?
-The logistic regression model is trained by calling the 'fit' method with the training data (x_train and y_train).
What is the accuracy score of the logistic regression model on the test set?
-The accuracy score of the logistic regression model on the test set is 96.67 percent.
How does the model make predictions on new handwritten digit images?
-The model makes predictions by calling the 'predict' method with the numeric data of the new handwritten digit images.
What is a confusion matrix and how is it used?
-A confusion matrix is a two-dimensional array that visualizes the performance of a classification model. It shows the instances where the model's predictions did not match the actual values, highlighting areas where the model is not performing well.
What is the exercise given at the end of the tutorial?
-The exercise involves using the iris flower dataset to build a logistic regression model, calculate its accuracy, and make a few predictions.
What are the four features included in the iris flower dataset?
-The four features included in the iris flower dataset are sepal width, sepal length, petal width, and petal length.
How can one visualize the confusion matrix?
-One can visualize the confusion matrix using libraries like matplotlib or seaborn by calling a heatmap function to display the matrix as a color-coded grid.
What is the importance of practicing with the provided dataset and exercise?
-Practicing with the provided dataset and exercise helps solidify the understanding of logistic regression and machine learning concepts, and is essential for developing expertise in the field.
Outlines
📘 Introduction to Multi-Class Logistic Regression
This paragraph introduces the concept of multi-class classification using logistic regression, contrasting it with binary classification. The tutorial focuses on a specific problem: recognizing handwritten digits (0 to 9). The speaker outlines the process of using a training set of handwritten digit characters to build a logistic regression model. The training set is sourced from the sklearn.dataset's 'load_digits' dataset, which contains 1797 8x8 images. The speaker demonstrates how to load the dataset and explore its contents, including the data and images, and how they correspond to each other. The target variable is also discussed, showing how it labels each image with the correct digit.
📚 Data Preparation and Model Training
The second paragraph details the process of data preparation for model training. The speaker explains how to use the 'train_test_split' function from 'model_selection' to divide the dataset into training and test samples. The purpose of this split is to prevent overfitting by ensuring the model is tested against data it hasn't seen before. The speaker specifies that 20% of the samples should be used as the test set, with the remaining 80% used for training. After splitting the data, the speaker proceeds to create a logistic regression model using the training data and target variable. The model's accuracy is then assessed using the test data, and the speaker finds that the model performs well with an accuracy of 96.67%. The paragraph concludes with the speaker demonstrating how to make predictions using the trained model.
🔍 Evaluating Model Performance with a Confusion Matrix
In this paragraph, the speaker discusses evaluating the logistic regression model's performance using a confusion matrix. The confusion matrix is a tool that visualizes the performance of a classification model by showing the true versus predicted classifications. The speaker explains how to obtain predicted values from the test set and then create a confusion matrix using these predictions and the true values. The confusion matrix is then visualized using a heatmap, which provides a clear picture of where the model is performing well and where it is making mistakes. The speaker emphasizes that areas of the matrix without zero indicate where the model is correctly predicting, while non-zero values show where it is not.
🌼 Exercise: Iris Flower Dataset and Logistic Regression
The final paragraph presents an exercise for the viewer. The exercise involves using the Iris flower dataset, which contains 150 samples with four features: petal width, petal length, sepal width, and sepal length. The task is to load the dataset, divide it into training and test samples, build a logistic regression model, and determine the model's accuracy. The speaker also encourages making a few predictions using the model. The paragraph concludes with the speaker providing a link to a Jupyter notebook containing the exercise and urging the viewer to practice to gain expertise in machine learning.
Mindmap
Keywords
💡Logistic Regression
💡Binary Classification
💡Multi-Class Classification
💡Handwritten Digit Recognition
💡Jupyter Notebook
💡Matplotlib
💡Scikit-learn
💡Train-Test Split
💡Model Accuracy
💡Confusion Matrix
💡Iris Flower Dataset
Highlights
The tutorial is a continuation of logistic regression, focusing on multi-class classification.
Binary classification is contrasted with multi-class classification, where outcomes can be one of three or more.
The concrete problem for the tutorial is recognizing handwritten digits (0 to 9).
A training set with numerous handwritten digit characters is used to build the logistic regression model.
The dataset used is the 'load digits' dataset from scikit-learn, consisting of 1797 8x8 images.
Each image is represented as a one-dimensional array with 64 elements.
Matplotlib is used to visualize the numeric data as actual images.
The target variable indicates the class label for each image, with a sequence from 0 to the total number of classes.
The data is split into training and test sets using the train_test_split function from scikit-learn.
The model is trained using the training data with the fit method.
The model's accuracy is evaluated using the test data and the score method.
A confusion matrix is introduced as a tool to visualize the model's performance.
The tutorial concludes with an exercise involving the Iris flower dataset and logistic regression.
The Iris dataset features four attributes: petal width, petal length, sepal width, and sepal length.
The exercise requires loading the dataset, splitting it into test and training sets, and predicting iris flower types.
The tutorial emphasizes the importance of not just watching but practicing with the provided code and exercises.
The Jupyter notebook and exercise link are provided for further practice.
Transcripts
this is part two of logistic regression
tutorial if you haven't
watched the first part then you should
watch that first
in the previous tutorial we discussed
about binary classification where the
output classes are binary in nature they
are either yes or no
in this one we are going to discuss
multi-class classification for example
when you are trying to predict which
party a person is going to vote for
the possible outcomes are one of these
three
the concrete problem that we are going
to solve today is to recognize the
handwritten
digit for example here this one maps to
one of the output categories which is
nothing but
digit digits 0 to 9. similarly here for
maps to this particular output category
so we will
uh use a training set with lot of hand
digit
uh characters and then we'll build a
model using logistic regression
and at the end of the tutorial you will
have an interesting exercise to work on
so let's uh jump straight into writing
the code
as usual i am going to use my jupyter
notebook as
an ide and here i have imported
matplotlib
and also scikit learns our data set
so sklearn.dataset has some predefined
ready-made datasets that you can use to
learn machine learning
from this i am using load digits data
set
so if you read the documentation all it
is
is 1797 97
handwritten uh digits uh of size
eight by eight okay so it looks
something like this
and what we are going to do is given
these digits we are going to
identify that what what digit that is
all right so let me just run it
so this has run fine i am going to now
call load digits method to
load my training set basically and i
want to explore
what this training set contains so it
contains
couple of things it has data
which is your real data so let's print
few elements
so as it's written in the
documentation there are 1797 sample
so i'm just going to print the first one
and it's an array okay as such
it is an eight by eight uh
image but the image is represented as
a one dimensional array so if you count
these elements it will be
uh 64 which is eight by eight and if you
want to
see this particular element then you can
use uh
matplotlib so i'm going to do plot
plt dot gray and plt
has a method called mat show
and what you can do is you can print the
corresponding image
so data has uh the numeric data and
images will have the actual images
so you can see that our data 0
and image 0 they kind of relate to each
other
and the only difference between the two
is that you have numeric
numeric data here versus you have an
actual image so if you want to print
let's say first five sample
then you can just print it like this
and you will see that c 0 1 2 3
four okay and corresponding
numbers will be in this data array so
that looks pretty straightforward
now what we're going to do is use this
uh to train our model
now before we do that let's uh take a
look at target and target names
okay so our
target so if i print
digit.target 0
let me print zero to five so you see
like
zero to five is literally in the
sequence the first element is zero
one two three and that's what this is
printing here
it is saying that this image
is zero the last image this is
four so this is our complete training
set
which has our image as well as the
target variable you know like it says
what it is so we can use
our data data and target to train our
model
now before training our model the usual
thing that we do is
we import from model selection
we import train
taste split
and we try to divide our data set
into our training and test
samples so the way you do it is you say
x train x test
i don't exactly remember the order of
the argument so i'm going to
what i'm going to do it okay let me do
this so to interest split
uh digits.data because that's your data
set
then you have digits.target because
that's your target variable
okay and if you hit shift tab it will
show you all the
nice documentation of that api
so here it says this is the order in
which
it returns the output
[Music]
all right so now what i just did by
executing this
command is i had
input account output variable from my
training set and i divided them into
test and train sets now the reason that
we do this typically is
we don't want to uh overfit our model we
don't want to
make our model such that we just uh bias
it against the
training data that's why the data that
the model is trained against
should be different than the data that
uh
the model is tested against okay so
that's why we
split these two so if you look at
okay i have to supply
the size so i'm going to probably supply
taste size
taste
size so i want 20 percent of my
samples to be test size and 80 percent
to be
the training okay so if i look at
length of x train it is this
and if i look at length of x this it is
this so this is roughly 80 percent of
all available
samples all right so i have a training
and test
data set split now i can
[Music]
create my logistic regression
model so from this i want to import
logistic regression
and create a model object
so that you can train it later and you
all know the way you train it
is by calling a fit method
and fit method you will call it against
xtest
train sorry ny train
[Music]
when you run that the model is getting
trained using this
x train and y train data set so again to
repeat
x train has the hand written characters
and y train will have the corresponding
output it will say okay for this image
it is 4 etc
now since my model is ready the first
thing i always do is
i calculate the score so the score tells
you
uh how accurate is your model and the
way you do that
is you have to supply x test and y taste
so
using the x test it will calculate the
y predicted value and it will compare
those y predicted value against the
real value which is y test turns out
that my model is doing pretty good
the accuracy is 96.67 percent almost
which is really good so now i'm going to
make my actual prediction and you know
that
you have to call predict method for that
now let's see so before i call this
method
what i want to do is i want to
pick up a random sample so i will say
plt dot mat show
digits dot images
let's say i'm just picking up a random
sample okay
hmm this is pretty hard even i don't
know what this number is actually
let's see so this number
is actually digits dot target
67 so you have to access the same
index in your target
okay so this is six okay so let's see
what our
model will predict for this guy so i
will say model
dot predict okay model.predict what
okay what do i want to predict i want to
predict
now see i'm not going to supply images
here
because image is all binary data
my model likes numeric data more so
i will use the same index 67 but i am
using data
instead of images
okay this is the error you get when
you're not supplying multi-dimensional
arrays i'm just going to supply
multi-dimensional array just for the
sake of it
and you can see that it is predicting
the target variable
all right okay let's just
okay let me just create a new cell here
and let me predict okay what do i want
to predict
okay i want to predict zero to five now
you all know zero to five is literally
zero to five so
zero is zero one is 1 and so on
when executed see my model is doing
pretty good
so my score is 0.96
how do i know where it didn't do well
okay
because all the samples i tried it seems
to be doing pretty well
so i want to know where exactly it fell
and you know i want to get
overall feeling of my model's accuracy
and one of the ways of doing that is
confusion matrix
so i will show you what confusion matrix
is really
for that i have to import
from this matrix i need to import
confusion metrics okay and then
before i do that i need to uh
get the predicted values so i will say
predict
x taste when i run that i get all the
predicted values for this
x test okay and then i create a
confusion matrix
and in the confusion matrix what you
supply
is whitist which is the truth
and then y predicted which is what your
model predicted
and then you get confusion matrix back
when you run that you get this two by
two
dimensional array and you are wondering
what the heck this is so this is better
visualized
in matplotlib or c bond right
so i will use that library
for the visualization here i'm just
going to copy paste the code
for confusion matrix visualization here
i am using
cbon library which is similar to
matplotlib it's used for
visualization and i'm calling a heat map
here
with the confusion matrix cm variable
that we created here
and when you run that
this is the confusion matrix that you
got now
the way this works is see here you see
37 number
what it means is 37 time
the truth was zero and my model
predicted it to be zero
this two means two times
my truth was eight meaning i
fed my model the image of eight but my
model
said no it is one so these are the
instances
where it's not doing good so you can see
that in
in anywhere in this area in this area
when you don't see
zero it means your model is not working
right
so here for example again two times
my images were off digit
four but my model predicted it to be
one so that's what this is so confusion
matrix is just a nice way of
visualizing uh how well your model is
doing
all right now it's the time for exercise
today's exercise
is going to be uh using
sql on data sets iris flower data set
which has following four features so
if you don't know about iris iris is a
type of flower
and the flower has a diff two type of
leaves you know one leaf is called
one leaf is called sepal the other one
is called petal
and they have like a height and width
and based on these height and widths you
can
you can predict what kind of iris flower
it is
okay so our data set will have three
kind of flowers
these are the names of three different
iris flowers
and the features that we have are these
four
which is basically petal width and
height and sample width and height
and you will use uh this
data set the iris data set and you will
load all those 150 samples
then divide them into test and training
samples
and then build a logistic regression
model
and tell me the accuracy
that you can come up with and then you
can just do a few predictions
uh using that model all right that's all
i had for this tutorial
i have the link of this jupyter notebook
down below
and you can find the exercise also so
make sure to refer to those
useful links and please please do some
practice
yourself just by watching this video you
are not going to become expert
alright thanks for watching
Browse More Related Video
![](https://i.ytimg.com/vi/zM4VZR0px8E/hq720.jpg)
Machine Learning Tutorial Python - 8: Logistic Regression (Binary Classification)
![](https://i.ytimg.com/vi/nHIUYwN-5rM/hq720.jpg)
Machine Learning Tutorial Python - 15: Naive Bayes Classifier Algorithm Part 2
![](https://i.ytimg.com/vi/J_LnPL3Qg70/hq720.jpg)
Machine Learning Tutorial Python - 3: Linear Regression Multiple Variables
![](https://i.ytimg.com/vi/PHxYNGo8NcI/hq720.jpg)
Machine Learning Tutorial Python - 9 Decision Tree
![](https://i.ytimg.com/vi/51nHKDWIc9s/hq720.jpg)
Logit model explained: regression with binary variables (Excel)
![](https://i.ytimg.com/vi/QYwESy6isuc/hq720.jpg)
MIT 6.S191 (2018): Issues in Image Classification
5.0 / 5 (0 votes)