Machine Learning Tutorial Python - 13: K Means Clustering Algorithm
Summary
TLDRThe video introduces machine learning, focusing on unsupervised learning, particularly the K-means clustering algorithm. It explains how K-means helps identify clusters within a dataset without predefined labels. The steps of choosing K, initializing random centroids, calculating distances, and refining clusters are demonstrated. The video covers the Elbow Method to determine the optimal number of clusters and explores coding this in Python. Using a dataset with age and income, the tutorial shows how clustering reveals hidden group characteristics. The video concludes with an exercise using the iris dataset and elbow plot to find the optimal K.
Takeaways
- ๐ Machine learning algorithms are categorized into three main types: supervised learning, unsupervised learning, and reinforcement learning.
- ๐ Unsupervised learning focuses on identifying underlying structures or clusters in the data without knowing the target variable.
- ๐ K-means is a popular clustering algorithm used to divide a dataset into clusters based on proximity to centroids.
- ๐ข The parameter 'k' in K-means represents the number of clusters and needs to be specified before running the algorithm.
- ๐ฏ The process involves initializing random centroids, calculating distances between points and centroids, and adjusting clusters iteratively.
- ๐งฒ The centroid positions are recalculated until no data points change clusters, leading to final, stable clusters.
- ๐ก The elbow method helps determine the optimal number of clusters by plotting the sum of squared errors against different values of 'k'.
- ๐ Scaling features like age and income using MinMaxScaler can improve clustering accuracy by normalizing the feature range.
- ๐ Visualizing clusters with scatter plots helps understand the clustering results better, especially after scaling.
- ๐ฌ For real-world datasets with many features, the elbow method is used to avoid manual cluster visualization and to identify the best number of clusters efficiently.
Q & A
What are the three main categories of machine learning algorithms?
-The three main categories of machine learning algorithms are supervised learning, unsupervised learning, and reinforcement learning.
How is supervised learning different from unsupervised learning?
-In supervised learning, the dataset contains a target variable or class label, which helps in training the model to make predictions. In unsupervised learning, there is no target variable, and the goal is to identify patterns or structures within the data, such as clusters.
What is the K-means algorithm used for?
-K-means is a popular clustering algorithm used in unsupervised learning to group data points into clusters based on their similarities. It starts by randomly selecting centroids and then iteratively refines the clusters by minimizing the distance between the data points and their respective centroids.
How do you determine the initial value of K in K-means?
-The value of K, which represents the number of clusters, is a free parameter that must be specified before running the K-means algorithm. The choice of K can be based on visual inspection of the data or using methods like the elbow method to find the optimal value.
What is the role of centroids in K-means clustering?
-Centroids represent the center of each cluster. The algorithm assigns each data point to the nearest centroid, and these centroids are adjusted iteratively to minimize the distance between the data points and the centroid of their assigned cluster.
What is the elbow method, and how is it used in K-means?
-The elbow method helps in selecting the optimal value of K by plotting the sum of squared errors (SSE) for different values of K. The goal is to find a point on the curve where the SSE starts to decrease at a slower rate, forming an 'elbow.' This point indicates a good value for K.
Why is feature scaling important in K-means?
-Feature scaling is important because K-means uses distance calculations to form clusters. If the features have different scales (e.g., one in thousands and another in tens), the larger scale feature will dominate the distance calculation, leading to incorrect clustering. Scaling ensures that all features contribute equally.
What is inertia in the context of K-means clustering?
-Inertia refers to the sum of squared distances between each data point and the nearest centroid. It is used as a measure of how well the clusters are formed, with lower inertia indicating better clustering. In the elbow method, inertia is plotted to determine the optimal number of clusters.
How does the K-means algorithm stop, and what does the final result represent?
-The K-means algorithm stops when no data points change clusters between iterations. At this point, the clusters are considered stable, and the centroids represent the center of each cluster. The final result is the partitioning of the data into distinct clusters.
How would you apply K-means clustering to a real-world dataset like the age and income dataset mentioned in the script?
-To apply K-means clustering to the age and income dataset, the first step is to preprocess the data, including scaling the features (age and income). Next, the K-means algorithm is run with an initial value of K (e.g., 3 for three clusters). The clusters are then refined iteratively by adjusting centroids and reassigning data points until the clusters stabilize. Finally, an elbow plot can be used to determine the optimal K.
Outlines
๐ค Introduction to Machine Learning Categories
The script introduces the three main categories of machine learning: supervised, unsupervised, and reinforcement learning. It explains that supervised learning involves datasets with target variables, while unsupervised learning works without labeled data, focusing on identifying underlying structures or clusters. The concept of clustering and the popular k-means algorithm are introduced, with a preview of how the tutorial will be divided into theory, coding, and an exercise.
๐ Identifying Clusters in Unsupervised Learning
The script explores the process of identifying clusters in a dataset with no target labels using k-means clustering. The algorithm requires specifying a parameter 'k', representing the number of clusters. The example demonstrates how random centroids are chosen and data points are assigned to clusters based on proximity. The centroids are adjusted iteratively until the clusters stabilize, ensuring that no points change their clusters.
๐ฏ Using the Elbow Method to Determine Optimal Clusters
The elbow method is introduced as a technique to find the optimal number of clusters ('k'). By plotting the sum of squared errors (SSE) for different values of k, the elbow point, where SSE reduces significantly, is identified as the optimal number of clusters. The concept is demonstrated with an example dataset, and it's explained that choosing k can vary based on the dataset's interpretation.
๐ผ Data Preparation and Initial Clustering in Python
The script transitions to the coding part, where a dataset containing 'age' and 'income' is loaded into a pandas DataFrame. A scatter plot is generated to visualize potential clusters, and k-means is applied to group the data into three clusters. The resulting clusters are visualized, but a problem arises due to unscaled features, leading to inaccurate cluster formations.
๐ Scaling Features with Min-Max Scaler
To address the issue of unscaled features, the Min-Max Scaler is used to scale the 'age' and 'income' features between 0 and 1. This improves the accuracy of the clustering process, resulting in better-defined clusters. The script shows the importance of preprocessing data by scaling features before applying k-means, ensuring that the distance between data points is correctly measured.
๐ Visualizing Centroids and Using the Elbow Method
The script demonstrates how to visualize cluster centroids on a scatter plot, providing a clearer view of the k-means clustering result. Additionally, the elbow plot method is revisited to help determine the best value of 'k'. By plotting the sum of squared errors for different values of k, the elbow point indicates that three clusters are optimal for the given dataset.
๐ธ Exercise: Clustering the Iris Dataset
For the exercise, the script introduces the Iris flower dataset, a common dataset for practicing machine learning. The task is to cluster the data using petal length and width features, while ignoring the class labels. Participants are encouraged to apply the elbow method to determine the optimal number of clusters and share their results in the comments. A link to the Jupyter notebook is provided for reference.
๐ Conclusion and Engagement
The final paragraph encourages viewers to give a thumbs up if they enjoyed the tutorial and to share it with friends. The tutorial aims to help viewers understand k-means clustering and how to implement it in Python, with practical applications like the exercise on the Iris dataset.
Mindmap
Keywords
๐กSupervised Learning
๐กUnsupervised Learning
๐กReinforcement Learning
๐กK-Means Clustering
๐กCentroid
๐กElbow Method
๐กSum of Squared Error (SSE)
๐กMin-Max Scaling
๐กFit and Predict
๐กCluster
Highlights
Machine learning algorithms are categorized into three main categories: supervised, unsupervised, and reinforcement learning.
In unsupervised learning, we try to identify the underlying structure in the data or find clusters in the dataset.
K-means is a popular clustering algorithm, where the user must define the number of clusters (k) before running the algorithm.
The algorithm begins by randomly placing k points (centroids) in the data space, and assigns each data point to the nearest centroid.
The centroid positions are updated iteratively by calculating the 'center of gravity' of the clusters until no data points change their clusters.
A critical aspect of K-means is determining the optimal value of k, which can be done using the elbow method.
The elbow method involves computing the sum of squared errors (SSE) for different values of k and plotting them to identify the 'elbow' point, indicating the best k value.
In real-world datasets, where features are not limited to two dimensions, visualizing clusters becomes challenging, making the elbow method a useful tool.
Preprocessing and feature scaling using techniques like Min-Max scaling can improve the results of K-means clustering.
When features like income and age are scaled, clusters become more accurate, as seen in the example where three clusters are identified.
K-means returns centroids for each cluster, which can be visualized to better understand the grouping of data points.
To illustrate the elbow method, a range of k values from 1 to 10 is tested, and the SSE is calculated for each value.
Plotting the SSE for each k value reveals that k=3 is optimal for the sample dataset, based on the elbow method.
An exercise is provided for users to apply K-means clustering on the Iris dataset, ignoring the target label and using only petal length and width.
The tutorial emphasizes practical applications by guiding users through theory, coding, and an exercise section.
Users are encouraged to experiment with different datasets, apply the elbow method, and share their results.
Transcripts
machine learning algorithms are
categorized into three main categories
supervised unsupervised and
reinforcement learning
up till now we have looked into
supervised learning where
in the given data set you have your
class label or a target variable present
in unsupervised learning all you have is
set of features you don't know about
your
target variable or a class label using
this data set we try to identify the
underlying structure in that data
or we sometimes try to find the clusters
in that data
and we can make useful predictions out
of it k
means is a very popular clustering
algorithm and that's what we are going
to
look into today as usual the tutorial
will be
in three parts the first part is theory
then coding and then exercise
let's say you have a data set like this
where x and y
axis represent the two different
features
and you want to identify clusters in
this data set
now when the data set is given to you
you don't have any information on target
variables so
you don't know what you're looking for
all you're trying to do is identify
some structure into it and one way of
looking into this
is these two clusters just by visual
examination we can say
that this data set has these two
clusters and
k-means uh helps you identify
uh these clusters now k in k means is a
free parameter
wherein before you start the algorithm
you have to tell the algorithm what is
the value of
k that you are looking for here our k is
is equal to two
so let's say you have the this data set
you start with k is equal to two
and the first step is to identify
uh two random points which you consider
as the center of those two clusters
we call them centroids as well so you
just put
two random points here if your k was
let's say three then you will put
three random points okay and these could
be placed
anywhere in this 2d place doesn't matter
next step is to identify the distance
of each of these data points
from these centroids so for example this
data point is more near to this centroid
hence we'll say it belongs to red
cluster
whereas this data point is more near to
green so we'll say this belongs to green
cluster
the simple mathematical way to identify
the distance
is to draw this kind of line connecting
the line between the
those those two centroids and then draw
a perpendicular line
anything on the left hand side is red
cluster on right hand side is green
cluster
so there you go you already have your
two
imperfect clunky clusters and now we
try to improve these clusters okay so
you started
you only got your two clusters now we'll
make them
better and better at every stage and the
way you do that
is you will try to adjust
the centroid centroids for these two
clusters for example
for this raid cluster which is these
four data points
you will try to find the center of
gravity almost
and you'll put the red center there
and you do the same thing for green one
so you get this when you make the
adjustment
and now you repeat the same process
again again you recompute
the distance of each of these points
from these centroids
and then if the point is more near to
red you put it
them in a red cluster otherwise you put
it in a clean green cluster
okay so you repeat the same method and
see now these points got changed from
green to red so
they're more near to red that's why
they're in red cluster
and you keep on repeating this process
you just
recalculate your centroids then
recalculate the distance of
individual data points from these
centroids and readjust the clusters
until the point that none of the data
points
change the cluster so here right now see
there is only one green
which is changing its cluster so now
it's in red
but after this we are done even if you
try to recompute everything
uh none of these data points will change
their position
hence we can say that this is final so
these are now
my final clusters now the most important
point here is you need to supply
k uh to your algorithm but what is a
good
number on k because here we have
two dimensional space in reality you
will have so many
features and it is hard to visualize
that data on a scatter plot
so which case should you start with well
there is a technique called elbow method
okay and we'll look into it but just to
look at our
data set we started with two
clusters but someone might say no these
are actually four cluster
third person might say oh they are
actually six clusters
so you can see like different uh people
might
interpret these things in a different
way and your job is to find out the best
possible k number okay and that
technique is
called elbow method and the way that
method works is you start with some k
okay so let's say we start with k is
equal to 2 and we
try to compute sum of square error
what it means is for each of the
clusters
you try to compute the distance of
individual data points from the centroid
you square it and then you sum it up so
for
this cluster we got sum of square
error one similarly for the second
cluster you will get
uh the error number two and you do that
for all your cluster
and in the end you get the total sum of
squared errors
now we do square just to handle negate
value there is
nothing more than that okay so now we
computed ssc for k equal to 2 you
repeat the same process for k equal to 3
4 and so on
okay and once you have that
number you draw a plot like this
here i have k going from 1 to 11
and then on the y axis i have sum of
squared error
you'll realize that as you increase
number of uh clusters
it will decrease the error now it's kind
of
intuitive if you think about it at some
point
you can consider all your data points as
one
cluster individual where your sum of
square error becomes almost zero
okay so let's assume we have only 11
data points
at 11 value of k the error will become
zero
okay so error will keep on reducing
and the general guideline is to find out
an elbow
so the elbow is on this chart
this point is short of like an elbow
okay so
here is a good cluster number okay so
for example
for whatever the data set this chart is
representing
uh a good k number would be four
all right so that was an elbow technique
let's
uh get into python coding now all right
so the problem we are going to
solve today is cluster uh this
particular data set
where you have age and income of
different people now
by clustering these uh data points
into various groups what you're trying
to find out is some characteristics of
these groups
maybe the group belongs to a particular
region in u.s where the salaries
are higher or the salaries are lower
or maybe that though that group belongs
to a certain profession
where the salaries are higher versus
less okay
so you try to identify some
characteristics of
these groups so right now we have just
name age and income and
first thing i'm going to do is import
that data set into pandas data frame so
you here you can see that i imported
essential libraries and then i have my
data frame ready
with that and uh since the data set is
simple enough
i will first try to plot it on a scatter
plot
okay so when you plot it on a scatter
plot
of course i don't want to include name i
just want to plot
the age against the income so
df dot h df
income in dollar
i'll just use the same convention you
can use dot also
but since there's a bracket here i'll
use the same convention
okay when you plot this on scatter chart
you can kind of see
three clusters one two and three
so for this particular case
choosing k is pretty straightforward
so i will use
k means so k means is something imported
here okay and
of course you need to specify your k
which is n
underscore clusters and by the way in
jupiter notebook when you
type something and when you hit tab it
will auto complete
okay so it creates
this k means object for you
and it has all these default parameters
you can
tweak all these parameters later but i'm
just
trusting on the default parameters
the second step is fit and predict
so in previous supervised learning
algorithms we used to do fit and then
calculate the score here i'm just
directly doing
fit and predict so fit and predict what
okay i'm going to fit and predict
the data frame excluding the name column
because name column is string and it's
not going to be useful
in our numeric computation so i want to
ignore it
all right so you do fit
and predict and what you get back is
y predicted
so now what this statement did is it ran
k-means algorithm on agent income
which is this scatter plot and it
computed the cluster
as per our client criteria where we told
algorithm to identify
three clusters somehow okay and it did
it
it just assigned them different labels
so you can see
three clusters 0 1 and 2. now
visualizing this array is not very
very much fun so what we want to do is
we want to
plot it again on on a scatter plot so
that we can see
what kind of clustering result did it
produced
okay so i am in my data frame i am going
to append
uh this particular column so that
my data frame looks like this so now
this is a little better where i can see
these two guys belongs to same group
these two belongs to same group
and so on but it is still not as good as
scatter plot
okay so let's
do this plot
dot scatter plot
all right now uh what we need to do
is we need to separate these three
clusters into
three different data frames so let me do
that
df1 is equal to df
df dot cluster
cluster is equal to zero
okay so what this is doing is it's
returning all the rows from dataframe
where cluster is zero
and the second one will be
this and the third one will be
this so now we have three different data
frames each belonging to one cluster
and i want to plot these three data
frames
onto
one scatter plot
okay now just to save some time let me
just
copy paste the code here
okay i will come at this little later
but see three different data frames
and we are plotting these uh data frames
into different color
okay so cluster zero is green then red
and black
let's see how that looks
okay so df
oh i'm made a mistake here
i had a typo
good all right so i see a scatter plot
here but there's a little problem
so this red cluster looks okay but there
is a problem with these two clusters you
know they are not
grouped correctly so this problem
happened because our scaling is not
right
our y-axis is scaled from let's say 40
000 260 000 and the range of
x-axis is pretty narrow see it's like
hardly
20 versus here is 120 000.
so when you don't scale your features
properly
properly you might get into this problem
that's why we need to do some
pre-processing
and use min max killer to scale
these two features and then only we can
run our algorithm
all right so we are going to use min
max scalar so the way you do it
is you will say scalar is min max scalar
and this is something if you already
noticed
we imported here okay
all right so scalar is this
and
scalar dot
fit df
so now i want to fit first the
income
all right so my scalar min max scaler
will try to
make the scale 0 to 1 so after i'm done
with my scaling
i will have a scale of 0 to 1 on y as
well as x
axis all right so df
let me just uh copy paste this guy here
is equal to scalar dot transform
okay so now scalar will
um scale the income feature
all right so df this
okay let's see how that did it
so you can see that the income is
a scale right it's like say
0.38 and so on so it is in a range of
one to zero you will not see any value
outside zero to one range
we want to do the same thing for our age
also
okay so let's do that
scalar dot fit df
dot h df
dot age is equal to scalar dot
transform df dot
h and then we
print our df
and you can see the age is also scaled
okay i have this extra column because i
made a mistake previously
but you can ignore that you can ignore
cluster also
so we have age and income features
properly scaled now okay and even if you
plot these
on to scatter plot they will look
structure wise at least they will look
like this
okay all right so the next step
is to use k-means algorithm once again
to train our scale data set
so it's gonna be fun now let's see what
scaling can give us
and as usual y predicted is equal to km
dot fit and predict
so again i started with three clusters
and i am using
um i'm just fitting my scale data
age income
all right and let's see my
y predicted so it predicted some values
which yet don't know how good they are
so i will just do cluster
is equal to y predicted
i will also just drop
the column that we
typod
and then let's look at df
okay in places in place is equal to
true okay so now this is my
new clustering result uh
let's plot this on to
our scatter plot
i'm just going to remove this for now
now you can see that i have a pretty
good cluster see black
green and red they look very nicely
formed
uh one of the things we studied in the
theory section was centroids
so if you look at km
which is your train a k-means
model that has
a variable called cluster centers
and these centers are basically your
centroids okay
so this is x this is y so this is the
first centroid of your first cluster
second centroid and third centroid
and if you can plot this into a scatter
plot
uh it can give a nice visualization to
us right so
pld dot scatter
so first let's plot x
axis okay so x axis
for this will be it will be what
okay so using this syntax you can say i
want to
go through all the rows which is three
rows here
and then the zero means first column
which is this
okay and your y
is your first column and
just to differentiate them
with regular data points
i will use some special marker and color
so you can see that these are the
centers
of my clusters
all right let's look into now elbow plot
method see this data sort was simple but
when you're trying to solve a real life
problem you will come across data set
which will have like 20 features
it will be hard to plot it on scatter
plot and it will just get messy
and you will be like what do i do now
well you use your elbow plot method
so in elbow plot um
as we saw in theory we uh
go through number of case okay so let's
say we'll go
from k equal to 1 to 10 in our case okay
and then we try to calculate sse which
is sum of square
error and then plot them and try to find
this elbow
so let's define our
k range let's say i want to do
1 2 10.
this will be 1 2 9 but whatever okay and
then
sum of squared error
is an array so for k is equal to 1
you'll find
sse k equal to 2 you will find sse you
will store all of that
into this array and then use matplotlib
to plot the result
okay so 4k in k
range so i'm just going through one to
nine
and then each iteration i
create a new model with clusters equal
to
k and then
i call fit okay
and what what do i try to fit okay i try
to fit my
data frame but i use this syntax because
my data frame has name column i don't
want to use name column
all right you'll be like what the heck
this guy is doing
all the time using this crazy syntax but
that's to avoid
name if you want you can just create a
new data frame
just drop name column that is fine too
and all right so now
what is my sum of square error how do i
get that
when you call km dot fit after that
on your k means there is
a parameter called inertia that will
give you
the sum of square error and that error
we want to just append
it to our array that we have
all right that was pretty fast because
our data set is very small
okay let's see what is sse so sse
you can see that sum of squared error
was very high initially then it kept on
reducing
and now let's plot this guy
into nice chart
okay when you do that you get
our elbow plot remember elbow plot
elbow all right where is my elbow where
is my elbow
okay here is my elbow you can see that k
is equal to 3
for my elbow and that's what happened
see i have
three clusters for exercise we are going
to use
our iris flower data set from sklearn
library
and what you have to do is use pattern
length and width features
just drop sample length and width
because it's
it makes your clustering little bit
difficult so just drop these two
features for simplicity
use the pattern length and with features
and try to form
clusters in that data set now that data
set has a class label
in the target variable but you should
just ignore it
okay you can use that uh just to confirm
your results
and in the end you will draw an elbow
plot
to find out the optimal value of k
alright so just
do the exercise post your results into
the video comments below also
i have provided a link of jupyter
notebook used in this tutorial
in the video description so look at it
when you go towards the end you will
find the exercise sections
also don't forget to give it a thumbs up
if you like the content of this tutorial
you can also share it with your friends
5.0 / 5 (0 votes)