Machine Learning Tutorial Python - 13: K Means Clustering Algorithm

codebasics
4 Feb 201925:15

Summary

TLDRThe video introduces machine learning, focusing on unsupervised learning, particularly the K-means clustering algorithm. It explains how K-means helps identify clusters within a dataset without predefined labels. The steps of choosing K, initializing random centroids, calculating distances, and refining clusters are demonstrated. The video covers the Elbow Method to determine the optimal number of clusters and explores coding this in Python. Using a dataset with age and income, the tutorial shows how clustering reveals hidden group characteristics. The video concludes with an exercise using the iris dataset and elbow plot to find the optimal K.

Takeaways

  • ๐Ÿ“Š Machine learning algorithms are categorized into three main types: supervised learning, unsupervised learning, and reinforcement learning.
  • ๐Ÿ” Unsupervised learning focuses on identifying underlying structures or clusters in the data without knowing the target variable.
  • ๐Ÿ“‰ K-means is a popular clustering algorithm used to divide a dataset into clusters based on proximity to centroids.
  • ๐Ÿ”ข The parameter 'k' in K-means represents the number of clusters and needs to be specified before running the algorithm.
  • ๐ŸŽฏ The process involves initializing random centroids, calculating distances between points and centroids, and adjusting clusters iteratively.
  • ๐Ÿงฒ The centroid positions are recalculated until no data points change clusters, leading to final, stable clusters.
  • ๐Ÿ’ก The elbow method helps determine the optimal number of clusters by plotting the sum of squared errors against different values of 'k'.
  • ๐Ÿ“ Scaling features like age and income using MinMaxScaler can improve clustering accuracy by normalizing the feature range.
  • ๐Ÿ“Š Visualizing clusters with scatter plots helps understand the clustering results better, especially after scaling.
  • ๐Ÿ”ฌ For real-world datasets with many features, the elbow method is used to avoid manual cluster visualization and to identify the best number of clusters efficiently.

Q & A

  • What are the three main categories of machine learning algorithms?

    -The three main categories of machine learning algorithms are supervised learning, unsupervised learning, and reinforcement learning.

  • How is supervised learning different from unsupervised learning?

    -In supervised learning, the dataset contains a target variable or class label, which helps in training the model to make predictions. In unsupervised learning, there is no target variable, and the goal is to identify patterns or structures within the data, such as clusters.

  • What is the K-means algorithm used for?

    -K-means is a popular clustering algorithm used in unsupervised learning to group data points into clusters based on their similarities. It starts by randomly selecting centroids and then iteratively refines the clusters by minimizing the distance between the data points and their respective centroids.

  • How do you determine the initial value of K in K-means?

    -The value of K, which represents the number of clusters, is a free parameter that must be specified before running the K-means algorithm. The choice of K can be based on visual inspection of the data or using methods like the elbow method to find the optimal value.

  • What is the role of centroids in K-means clustering?

    -Centroids represent the center of each cluster. The algorithm assigns each data point to the nearest centroid, and these centroids are adjusted iteratively to minimize the distance between the data points and the centroid of their assigned cluster.

  • What is the elbow method, and how is it used in K-means?

    -The elbow method helps in selecting the optimal value of K by plotting the sum of squared errors (SSE) for different values of K. The goal is to find a point on the curve where the SSE starts to decrease at a slower rate, forming an 'elbow.' This point indicates a good value for K.

  • Why is feature scaling important in K-means?

    -Feature scaling is important because K-means uses distance calculations to form clusters. If the features have different scales (e.g., one in thousands and another in tens), the larger scale feature will dominate the distance calculation, leading to incorrect clustering. Scaling ensures that all features contribute equally.

  • What is inertia in the context of K-means clustering?

    -Inertia refers to the sum of squared distances between each data point and the nearest centroid. It is used as a measure of how well the clusters are formed, with lower inertia indicating better clustering. In the elbow method, inertia is plotted to determine the optimal number of clusters.

  • How does the K-means algorithm stop, and what does the final result represent?

    -The K-means algorithm stops when no data points change clusters between iterations. At this point, the clusters are considered stable, and the centroids represent the center of each cluster. The final result is the partitioning of the data into distinct clusters.

  • How would you apply K-means clustering to a real-world dataset like the age and income dataset mentioned in the script?

    -To apply K-means clustering to the age and income dataset, the first step is to preprocess the data, including scaling the features (age and income). Next, the K-means algorithm is run with an initial value of K (e.g., 3 for three clusters). The clusters are then refined iteratively by adjusting centroids and reassigning data points until the clusters stabilize. Finally, an elbow plot can be used to determine the optimal K.

Outlines

00:00

๐Ÿค– Introduction to Machine Learning Categories

The script introduces the three main categories of machine learning: supervised, unsupervised, and reinforcement learning. It explains that supervised learning involves datasets with target variables, while unsupervised learning works without labeled data, focusing on identifying underlying structures or clusters. The concept of clustering and the popular k-means algorithm are introduced, with a preview of how the tutorial will be divided into theory, coding, and an exercise.

05:01

๐Ÿ“Š Identifying Clusters in Unsupervised Learning

The script explores the process of identifying clusters in a dataset with no target labels using k-means clustering. The algorithm requires specifying a parameter 'k', representing the number of clusters. The example demonstrates how random centroids are chosen and data points are assigned to clusters based on proximity. The centroids are adjusted iteratively until the clusters stabilize, ensuring that no points change their clusters.

10:05

๐ŸŽฏ Using the Elbow Method to Determine Optimal Clusters

The elbow method is introduced as a technique to find the optimal number of clusters ('k'). By plotting the sum of squared errors (SSE) for different values of k, the elbow point, where SSE reduces significantly, is identified as the optimal number of clusters. The concept is demonstrated with an example dataset, and it's explained that choosing k can vary based on the dataset's interpretation.

15:06

๐Ÿผ Data Preparation and Initial Clustering in Python

The script transitions to the coding part, where a dataset containing 'age' and 'income' is loaded into a pandas DataFrame. A scatter plot is generated to visualize potential clusters, and k-means is applied to group the data into three clusters. The resulting clusters are visualized, but a problem arises due to unscaled features, leading to inaccurate cluster formations.

20:08

๐Ÿ›  Scaling Features with Min-Max Scaler

To address the issue of unscaled features, the Min-Max Scaler is used to scale the 'age' and 'income' features between 0 and 1. This improves the accuracy of the clustering process, resulting in better-defined clusters. The script shows the importance of preprocessing data by scaling features before applying k-means, ensuring that the distance between data points is correctly measured.

25:09

๐Ÿ“ˆ Visualizing Centroids and Using the Elbow Method

The script demonstrates how to visualize cluster centroids on a scatter plot, providing a clearer view of the k-means clustering result. Additionally, the elbow plot method is revisited to help determine the best value of 'k'. By plotting the sum of squared errors for different values of k, the elbow point indicates that three clusters are optimal for the given dataset.

๐ŸŒธ Exercise: Clustering the Iris Dataset

For the exercise, the script introduces the Iris flower dataset, a common dataset for practicing machine learning. The task is to cluster the data using petal length and width features, while ignoring the class labels. Participants are encouraged to apply the elbow method to determine the optimal number of clusters and share their results in the comments. A link to the Jupyter notebook is provided for reference.

๐Ÿ‘ Conclusion and Engagement

The final paragraph encourages viewers to give a thumbs up if they enjoyed the tutorial and to share it with friends. The tutorial aims to help viewers understand k-means clustering and how to implement it in Python, with practical applications like the exercise on the Iris dataset.

Mindmap

Keywords

๐Ÿ’กSupervised Learning

Supervised learning is a type of machine learning where the model is trained on a labeled dataset, meaning that each input is paired with an output label or target variable. In the video, it is mentioned as one of the three main categories of machine learning algorithms, where the model learns to make predictions based on the provided labels. This concept is fundamental in understanding how algorithms can learn from data to make informed predictions.

๐Ÿ’กUnsupervised Learning

Unsupervised learning refers to a type of machine learning where the algorithm is provided with input data that has no corresponding output labels. The goal is to identify patterns, structures, or clusters within the data. In the video, unsupervised learning is discussed in the context of clustering algorithms, like K-Means, where the objective is to find hidden structures within the data without any predefined labels.

๐Ÿ’กReinforcement Learning

Reinforcement learning is a type of machine learning where an agent learns to make decisions by performing certain actions in an environment to maximize some notion of cumulative reward. Although briefly mentioned in the video as one of the main categories of machine learning, reinforcement learning differs from supervised and unsupervised learning in that it is based on a system of rewards and punishments, rather than labeled data.

๐Ÿ’กK-Means Clustering

K-Means is a popular clustering algorithm in unsupervised learning used to partition data into K distinct clusters. Each cluster is defined by a centroid, and the algorithm iteratively refines these centroids to minimize the variance within clusters. The video explains how K-Means works, including the initial random placement of centroids, the assignment of data points to the nearest centroid, and the iterative process of adjusting centroids until the clusters stabilize.

๐Ÿ’กCentroid

A centroid is the center point of a cluster in K-Means clustering. It represents the average position of all the data points within the cluster. In the video, centroids are discussed as the points around which clusters are formed and how the algorithm iteratively adjusts these centroids to refine the clusters. The concept is crucial for understanding how the K-Means algorithm operates.

๐Ÿ’กElbow Method

The Elbow Method is a technique used to determine the optimal number of clusters in a K-Means clustering algorithm. It involves plotting the sum of squared errors (SSE) for different values of K and identifying the 'elbow point' where the rate of decrease in SSE sharply slows. The video uses this method to illustrate how to choose the best K value, demonstrating its importance in achieving effective clustering.

๐Ÿ’กSum of Squared Error (SSE)

Sum of Squared Error (SSE) is a metric used in clustering to measure the total variance within clusters. It is calculated by summing the squared distances between each data point and its corresponding centroid. In the video, SSE is used to evaluate the quality of clustering, with lower SSE values indicating tighter clusters. The Elbow Method relies on plotting SSE against different K values to find the optimal number of clusters.

๐Ÿ’กMin-Max Scaling

Min-Max Scaling is a preprocessing technique that scales data features to a fixed range, typically [0, 1]. It is used to ensure that different features have a comparable scale, which is crucial for algorithms like K-Means that are sensitive to the magnitude of the data. The video demonstrates how scaling can impact clustering results, showing that unscaled features can lead to inaccurate clusters.

๐Ÿ’กFit and Predict

Fit and Predict are common steps in applying machine learning models. 'Fit' refers to training the model on a dataset, while 'Predict' involves using the trained model to make predictions on new data. In the video, these steps are used with the K-Means algorithm to cluster data after scaling, illustrating the practical application of these methods in machine learning workflows.

๐Ÿ’กCluster

A cluster in the context of K-Means and unsupervised learning is a group of data points that are more similar to each other than to those in other clusters. The video explains how the K-Means algorithm groups data points into clusters and how these clusters can be used to make predictions or identify patterns within the data. Clustering is a key concept in understanding how unsupervised learning can organize and interpret complex datasets.

Highlights

Machine learning algorithms are categorized into three main categories: supervised, unsupervised, and reinforcement learning.

In unsupervised learning, we try to identify the underlying structure in the data or find clusters in the dataset.

K-means is a popular clustering algorithm, where the user must define the number of clusters (k) before running the algorithm.

The algorithm begins by randomly placing k points (centroids) in the data space, and assigns each data point to the nearest centroid.

The centroid positions are updated iteratively by calculating the 'center of gravity' of the clusters until no data points change their clusters.

A critical aspect of K-means is determining the optimal value of k, which can be done using the elbow method.

The elbow method involves computing the sum of squared errors (SSE) for different values of k and plotting them to identify the 'elbow' point, indicating the best k value.

In real-world datasets, where features are not limited to two dimensions, visualizing clusters becomes challenging, making the elbow method a useful tool.

Preprocessing and feature scaling using techniques like Min-Max scaling can improve the results of K-means clustering.

When features like income and age are scaled, clusters become more accurate, as seen in the example where three clusters are identified.

K-means returns centroids for each cluster, which can be visualized to better understand the grouping of data points.

To illustrate the elbow method, a range of k values from 1 to 10 is tested, and the SSE is calculated for each value.

Plotting the SSE for each k value reveals that k=3 is optimal for the sample dataset, based on the elbow method.

An exercise is provided for users to apply K-means clustering on the Iris dataset, ignoring the target label and using only petal length and width.

The tutorial emphasizes practical applications by guiding users through theory, coding, and an exercise section.

Users are encouraged to experiment with different datasets, apply the elbow method, and share their results.

Transcripts

play00:00

machine learning algorithms are

play00:02

categorized into three main categories

play00:04

supervised unsupervised and

play00:06

reinforcement learning

play00:08

up till now we have looked into

play00:10

supervised learning where

play00:11

in the given data set you have your

play00:13

class label or a target variable present

play00:16

in unsupervised learning all you have is

play00:19

set of features you don't know about

play00:21

your

play00:22

target variable or a class label using

play00:25

this data set we try to identify the

play00:28

underlying structure in that data

play00:30

or we sometimes try to find the clusters

play00:32

in that data

play00:34

and we can make useful predictions out

play00:36

of it k

play00:37

means is a very popular clustering

play00:39

algorithm and that's what we are going

play00:41

to

play00:41

look into today as usual the tutorial

play00:44

will be

play00:45

in three parts the first part is theory

play00:48

then coding and then exercise

play00:50

let's say you have a data set like this

play00:52

where x and y

play00:53

axis represent the two different

play00:56

features

play00:57

and you want to identify clusters in

play00:59

this data set

play01:00

now when the data set is given to you

play01:02

you don't have any information on target

play01:04

variables so

play01:05

you don't know what you're looking for

play01:07

all you're trying to do is identify

play01:09

some structure into it and one way of

play01:11

looking into this

play01:13

is these two clusters just by visual

play01:15

examination we can say

play01:17

that this data set has these two

play01:19

clusters and

play01:20

k-means uh helps you identify

play01:24

uh these clusters now k in k means is a

play01:27

free parameter

play01:28

wherein before you start the algorithm

play01:31

you have to tell the algorithm what is

play01:32

the value of

play01:33

k that you are looking for here our k is

play01:36

is equal to two

play01:37

so let's say you have the this data set

play01:40

you start with k is equal to two

play01:42

and the first step is to identify

play01:46

uh two random points which you consider

play01:49

as the center of those two clusters

play01:51

we call them centroids as well so you

play01:54

just put

play01:55

two random points here if your k was

play01:58

let's say three then you will put

play02:01

three random points okay and these could

play02:03

be placed

play02:04

anywhere in this 2d place doesn't matter

play02:08

next step is to identify the distance

play02:11

of each of these data points

play02:15

from these centroids so for example this

play02:17

data point is more near to this centroid

play02:20

hence we'll say it belongs to red

play02:22

cluster

play02:23

whereas this data point is more near to

play02:25

green so we'll say this belongs to green

play02:27

cluster

play02:28

the simple mathematical way to identify

play02:31

the distance

play02:32

is to draw this kind of line connecting

play02:35

the line between the

play02:36

those those two centroids and then draw

play02:38

a perpendicular line

play02:40

anything on the left hand side is red

play02:42

cluster on right hand side is green

play02:44

cluster

play02:45

so there you go you already have your

play02:47

two

play02:48

imperfect clunky clusters and now we

play02:52

try to improve these clusters okay so

play02:55

you started

play02:56

you only got your two clusters now we'll

play02:58

make them

play02:59

better and better at every stage and the

play03:02

way you do that

play03:03

is you will try to adjust

play03:06

the centroid centroids for these two

play03:09

clusters for example

play03:11

for this raid cluster which is these

play03:14

four data points

play03:15

you will try to find the center of

play03:17

gravity almost

play03:18

and you'll put the red center there

play03:22

and you do the same thing for green one

play03:26

so you get this when you make the

play03:28

adjustment

play03:29

and now you repeat the same process

play03:31

again again you recompute

play03:34

the distance of each of these points

play03:36

from these centroids

play03:38

and then if the point is more near to

play03:40

red you put it

play03:41

them in a red cluster otherwise you put

play03:44

it in a clean green cluster

play03:46

okay so you repeat the same method and

play03:49

see now these points got changed from

play03:52

green to red so

play03:53

they're more near to red that's why

play03:56

they're in red cluster

play03:58

and you keep on repeating this process

play04:01

you just

play04:02

recalculate your centroids then

play04:04

recalculate the distance of

play04:06

individual data points from these

play04:08

centroids and readjust the clusters

play04:10

until the point that none of the data

play04:14

points

play04:14

change the cluster so here right now see

play04:17

there is only one green

play04:18

which is changing its cluster so now

play04:21

it's in red

play04:23

but after this we are done even if you

play04:25

try to recompute everything

play04:26

uh none of these data points will change

play04:29

their position

play04:30

hence we can say that this is final so

play04:32

these are now

play04:33

my final clusters now the most important

play04:36

point here is you need to supply

play04:38

k uh to your algorithm but what is a

play04:42

good

play04:43

number on k because here we have

play04:46

two dimensional space in reality you

play04:49

will have so many

play04:50

features and it is hard to visualize

play04:52

that data on a scatter plot

play04:54

so which case should you start with well

play04:58

there is a technique called elbow method

play05:01

okay and we'll look into it but just to

play05:04

look at our

play05:05

data set we started with two

play05:08

clusters but someone might say no these

play05:11

are actually four cluster

play05:12

third person might say oh they are

play05:14

actually six clusters

play05:16

so you can see like different uh people

play05:18

might

play05:19

interpret these things in a different

play05:20

way and your job is to find out the best

play05:24

possible k number okay and that

play05:27

technique is

play05:28

called elbow method and the way that

play05:31

method works is you start with some k

play05:34

okay so let's say we start with k is

play05:36

equal to 2 and we

play05:38

try to compute sum of square error

play05:41

what it means is for each of the

play05:43

clusters

play05:44

you try to compute the distance of

play05:46

individual data points from the centroid

play05:49

you square it and then you sum it up so

play05:52

for

play05:52

this cluster we got sum of square

play05:55

error one similarly for the second

play05:58

cluster you will get

play06:00

uh the error number two and you do that

play06:03

for all your cluster

play06:04

and in the end you get the total sum of

play06:07

squared errors

play06:08

now we do square just to handle negate

play06:11

value there is

play06:12

nothing more than that okay so now we

play06:16

computed ssc for k equal to 2 you

play06:19

repeat the same process for k equal to 3

play06:21

4 and so on

play06:22

okay and once you have that

play06:25

number you draw a plot like this

play06:29

here i have k going from 1 to 11

play06:33

and then on the y axis i have sum of

play06:35

squared error

play06:36

you'll realize that as you increase

play06:38

number of uh clusters

play06:40

it will decrease the error now it's kind

play06:43

of

play06:44

intuitive if you think about it at some

play06:46

point

play06:47

you can consider all your data points as

play06:50

one

play06:50

cluster individual where your sum of

play06:53

square error becomes almost zero

play06:55

okay so let's assume we have only 11

play06:58

data points

play06:58

at 11 value of k the error will become

play07:02

zero

play07:03

okay so error will keep on reducing

play07:06

and the general guideline is to find out

play07:08

an elbow

play07:10

so the elbow is on this chart

play07:14

this point is short of like an elbow

play07:16

okay so

play07:17

here is a good cluster number okay so

play07:19

for example

play07:20

for whatever the data set this chart is

play07:23

representing

play07:24

uh a good k number would be four

play07:27

all right so that was an elbow technique

play07:30

let's

play07:30

uh get into python coding now all right

play07:33

so the problem we are going to

play07:35

solve today is cluster uh this

play07:38

particular data set

play07:39

where you have age and income of

play07:42

different people now

play07:44

by clustering these uh data points

play07:47

into various groups what you're trying

play07:49

to find out is some characteristics of

play07:51

these groups

play07:53

maybe the group belongs to a particular

play07:56

region in u.s where the salaries

play07:58

are higher or the salaries are lower

play08:01

or maybe that though that group belongs

play08:04

to a certain profession

play08:06

where the salaries are higher versus

play08:08

less okay

play08:09

so you try to identify some

play08:11

characteristics of

play08:12

these groups so right now we have just

play08:16

name age and income and

play08:19

first thing i'm going to do is import

play08:22

that data set into pandas data frame so

play08:25

you here you can see that i imported

play08:27

essential libraries and then i have my

play08:29

data frame ready

play08:30

with that and uh since the data set is

play08:34

simple enough

play08:35

i will first try to plot it on a scatter

play08:38

plot

play08:39

okay so when you plot it on a scatter

play08:42

plot

play08:43

of course i don't want to include name i

play08:45

just want to plot

play08:47

the age against the income so

play08:50

df dot h df

play08:54

income in dollar

play08:58

i'll just use the same convention you

play09:00

can use dot also

play09:01

but since there's a bracket here i'll

play09:03

use the same convention

play09:08

okay when you plot this on scatter chart

play09:10

you can kind of see

play09:11

three clusters one two and three

play09:16

so for this particular case

play09:20

choosing k is pretty straightforward

play09:23

so i will use

play09:27

k means so k means is something imported

play09:31

here okay and

play09:35

of course you need to specify your k

play09:37

which is n

play09:38

underscore clusters and by the way in

play09:40

jupiter notebook when you

play09:42

type something and when you hit tab it

play09:44

will auto complete

play09:46

okay so it creates

play09:50

this k means object for you

play09:54

and it has all these default parameters

play09:56

you can

play09:57

tweak all these parameters later but i'm

play10:00

just

play10:01

trusting on the default parameters

play10:05

the second step is fit and predict

play10:08

so in previous supervised learning

play10:11

algorithms we used to do fit and then

play10:15

calculate the score here i'm just

play10:17

directly doing

play10:18

fit and predict so fit and predict what

play10:22

okay i'm going to fit and predict

play10:26

the data frame excluding the name column

play10:30

because name column is string and it's

play10:32

not going to be useful

play10:34

in our numeric computation so i want to

play10:36

ignore it

play10:43

all right so you do fit

play10:46

and predict and what you get back is

play10:50

y predicted

play10:56

so now what this statement did is it ran

play10:59

k-means algorithm on agent income

play11:03

which is this scatter plot and it

play11:05

computed the cluster

play11:06

as per our client criteria where we told

play11:09

algorithm to identify

play11:10

three clusters somehow okay and it did

play11:13

it

play11:14

it just assigned them different labels

play11:16

so you can see

play11:17

three clusters 0 1 and 2. now

play11:20

visualizing this array is not very

play11:24

very much fun so what we want to do is

play11:27

we want to

play11:28

plot it again on on a scatter plot so

play11:30

that we can see

play11:31

what kind of clustering result did it

play11:33

produced

play11:35

okay so i am in my data frame i am going

play11:37

to append

play11:39

uh this particular column so that

play11:43

my data frame looks like this so now

play11:46

this is a little better where i can see

play11:47

these two guys belongs to same group

play11:49

these two belongs to same group

play11:51

and so on but it is still not as good as

play11:54

scatter plot

play11:56

okay so let's

play11:59

do this plot

play12:02

dot scatter plot

play12:08

all right now uh what we need to do

play12:11

is we need to separate these three

play12:15

clusters into

play12:16

three different data frames so let me do

play12:19

that

play12:20

df1 is equal to df

play12:24

df dot cluster

play12:28

cluster is equal to zero

play12:34

okay so what this is doing is it's

play12:35

returning all the rows from dataframe

play12:37

where cluster is zero

play12:39

and the second one will be

play12:43

this and the third one will be

play12:46

this so now we have three different data

play12:50

frames each belonging to one cluster

play12:53

and i want to plot these three data

play12:57

frames

play12:58

onto

play13:01

one scatter plot

play13:06

okay now just to save some time let me

play13:08

just

play13:09

copy paste the code here

play13:15

okay i will come at this little later

play13:19

but see three different data frames

play13:23

and we are plotting these uh data frames

play13:26

into different color

play13:27

okay so cluster zero is green then red

play13:29

and black

play13:32

let's see how that looks

play13:36

okay so df

play13:39

oh i'm made a mistake here

play13:43

i had a typo

play13:46

good all right so i see a scatter plot

play13:50

here but there's a little problem

play13:51

so this red cluster looks okay but there

play13:54

is a problem with these two clusters you

play13:56

know they are not

play13:57

grouped correctly so this problem

play14:00

happened because our scaling is not

play14:02

right

play14:03

our y-axis is scaled from let's say 40

play14:07

000 260 000 and the range of

play14:10

x-axis is pretty narrow see it's like

play14:12

hardly

play14:14

20 versus here is 120 000.

play14:18

so when you don't scale your features

play14:20

properly

play14:21

properly you might get into this problem

play14:24

that's why we need to do some

play14:26

pre-processing

play14:27

and use min max killer to scale

play14:30

these two features and then only we can

play14:32

run our algorithm

play14:34

all right so we are going to use min

play14:37

max scalar so the way you do it

play14:40

is you will say scalar is min max scalar

play14:44

and this is something if you already

play14:46

noticed

play14:48

we imported here okay

play14:53

all right so scalar is this

play14:56

and

play15:00

scalar dot

play15:05

fit df

play15:11

so now i want to fit first the

play15:14

income

play15:19

all right so my scalar min max scaler

play15:23

will try to

play15:25

make the scale 0 to 1 so after i'm done

play15:28

with my scaling

play15:29

i will have a scale of 0 to 1 on y as

play15:32

well as x

play15:33

axis all right so df

play15:39

let me just uh copy paste this guy here

play15:44

is equal to scalar dot transform

play15:48

okay so now scalar will

play15:51

um scale the income feature

play15:58

all right so df this

play16:04

okay let's see how that did it

play16:14

so you can see that the income is

play16:17

a scale right it's like say

play16:21

0.38 and so on so it is in a range of

play16:24

one to zero you will not see any value

play16:26

outside zero to one range

play16:28

we want to do the same thing for our age

play16:31

also

play16:32

okay so let's do that

play16:36

scalar dot fit df

play16:40

dot h df

play16:43

dot age is equal to scalar dot

play16:46

transform df dot

play16:50

h and then we

play16:54

print our df

play16:57

and you can see the age is also scaled

play17:00

okay i have this extra column because i

play17:02

made a mistake previously

play17:03

but you can ignore that you can ignore

play17:06

cluster also

play17:07

so we have age and income features

play17:10

properly scaled now okay and even if you

play17:13

plot these

play17:14

on to scatter plot they will look

play17:17

structure wise at least they will look

play17:18

like this

play17:19

okay all right so the next step

play17:23

is to use k-means algorithm once again

play17:28

to train our scale data set

play17:32

so it's gonna be fun now let's see what

play17:36

scaling can give us

play17:42

and as usual y predicted is equal to km

play17:45

dot fit and predict

play17:47

so again i started with three clusters

play17:49

and i am using

play17:51

um i'm just fitting my scale data

play17:59

age income

play18:05

all right and let's see my

play18:08

y predicted so it predicted some values

play18:12

which yet don't know how good they are

play18:16

so i will just do cluster

play18:19

is equal to y predicted

play18:22

i will also just drop

play18:26

the column that we

play18:29

typod

play18:35

and then let's look at df

play18:39

okay in places in place is equal to

play18:43

true okay so now this is my

play18:47

new clustering result uh

play18:50

let's plot this on to

play18:54

our scatter plot

play19:05

i'm just going to remove this for now

play19:08

now you can see that i have a pretty

play19:10

good cluster see black

play19:12

green and red they look very nicely

play19:14

formed

play19:15

uh one of the things we studied in the

play19:18

theory section was centroids

play19:20

so if you look at km

play19:24

which is your train a k-means

play19:27

model that has

play19:30

a variable called cluster centers

play19:34

and these centers are basically your

play19:36

centroids okay

play19:38

so this is x this is y so this is the

play19:41

first centroid of your first cluster

play19:44

second centroid and third centroid

play19:46

and if you can plot this into a scatter

play19:49

plot

play19:49

uh it can give a nice visualization to

play19:52

us right so

play19:53

pld dot scatter

play19:57

so first let's plot x

play20:00

axis okay so x axis

play20:03

for this will be it will be what

play20:07

okay so using this syntax you can say i

play20:11

want to

play20:12

go through all the rows which is three

play20:14

rows here

play20:16

and then the zero means first column

play20:18

which is this

play20:20

okay and your y

play20:25

is your first column and

play20:28

just to differentiate them

play20:32

with regular data points

play20:35

i will use some special marker and color

play20:44

so you can see that these are the

play20:47

centers

play20:48

of my clusters

play20:52

all right let's look into now elbow plot

play20:54

method see this data sort was simple but

play20:57

when you're trying to solve a real life

play20:58

problem you will come across data set

play21:01

which will have like 20 features

play21:03

it will be hard to plot it on scatter

play21:05

plot and it will just get messy

play21:07

and you will be like what do i do now

play21:09

well you use your elbow plot method

play21:12

so in elbow plot um

play21:15

as we saw in theory we uh

play21:19

go through number of case okay so let's

play21:22

say we'll go

play21:23

from k equal to 1 to 10 in our case okay

play21:27

and then we try to calculate sse which

play21:29

is sum of square

play21:30

error and then plot them and try to find

play21:32

this elbow

play21:33

so let's define our

play21:36

k range let's say i want to do

play21:40

1 2 10.

play21:44

this will be 1 2 9 but whatever okay and

play21:47

then

play21:47

sum of squared error

play21:50

is an array so for k is equal to 1

play21:54

you'll find

play21:54

sse k equal to 2 you will find sse you

play21:57

will store all of that

play21:58

into this array and then use matplotlib

play22:01

to plot the result

play22:02

okay so 4k in k

play22:05

range so i'm just going through one to

play22:08

nine

play22:09

and then each iteration i

play22:13

create a new model with clusters equal

play22:17

to

play22:17

k and then

play22:21

i call fit okay

play22:24

and what what do i try to fit okay i try

play22:27

to fit my

play22:28

data frame but i use this syntax because

play22:32

my data frame has name column i don't

play22:34

want to use name column

play22:35

all right you'll be like what the heck

play22:38

this guy is doing

play22:39

all the time using this crazy syntax but

play22:41

that's to avoid

play22:42

name if you want you can just create a

play22:44

new data frame

play22:45

just drop name column that is fine too

play22:50

and all right so now

play22:53

what is my sum of square error how do i

play22:56

get that

play22:57

when you call km dot fit after that

play23:01

on your k means there is

play23:04

a parameter called inertia that will

play23:08

give you

play23:09

the sum of square error and that error

play23:12

we want to just append

play23:16

it to our array that we have

play23:20

all right that was pretty fast because

play23:22

our data set is very small

play23:26

okay let's see what is sse so sse

play23:29

you can see that sum of squared error

play23:31

was very high initially then it kept on

play23:34

reducing

play23:35

and now let's plot this guy

play23:39

into nice chart

play23:46

okay when you do that you get

play23:49

our elbow plot remember elbow plot

play23:53

elbow all right where is my elbow where

play23:55

is my elbow

play23:56

okay here is my elbow you can see that k

play23:59

is equal to 3

play24:00

for my elbow and that's what happened

play24:03

see i have

play24:04

three clusters for exercise we are going

play24:08

to use

play24:08

our iris flower data set from sklearn

play24:11

library

play24:12

and what you have to do is use pattern

play24:15

length and width features

play24:16

just drop sample length and width

play24:18

because it's

play24:19

it makes your clustering little bit

play24:21

difficult so just drop these two

play24:23

features for simplicity

play24:24

use the pattern length and with features

play24:26

and try to form

play24:28

clusters in that data set now that data

play24:30

set has a class label

play24:32

in the target variable but you should

play24:35

just ignore it

play24:36

okay you can use that uh just to confirm

play24:39

your results

play24:41

and in the end you will draw an elbow

play24:43

plot

play24:44

to find out the optimal value of k

play24:46

alright so just

play24:47

do the exercise post your results into

play24:51

the video comments below also

play24:54

i have provided a link of jupyter

play24:56

notebook used in this tutorial

play24:58

in the video description so look at it

play25:01

when you go towards the end you will

play25:04

find the exercise sections

play25:06

also don't forget to give it a thumbs up

play25:09

if you like the content of this tutorial

play25:12

you can also share it with your friends

Rate This
โ˜…
โ˜…
โ˜…
โ˜…
โ˜…

5.0 / 5 (0 votes)

Related Tags
K-meansClusteringMachine LearningUnsupervised LearningPython CodingData ScienceElbow MethodAlgorithmsData ClustersAI Tutorial