StatQuest: K-means clustering

StatQuest with Josh Starmer
23 May 201808:30

Summary

TLDRIn this episode of Stat Quest, Josh Starmer introduces k-means clustering, a method for organizing data points into distinct groups. The tutorial covers selecting the number of clusters (K), randomly choosing initial points, calculating distances to assign points to the nearest cluster, and iteratively updating cluster means until convergence. It also explores determining the optimal K value using an 'elbow plot' and discusses the difference between k-means and hierarchical clustering. The video concludes with examples of applying k-means to data on a number line, in two dimensions, and on a heatmap, illustrating the versatility of this clustering technique.

Takeaways

  • 📚 The video is an educational tutorial on k-means clustering, a method used to group data points into clusters.
  • 🔍 It explains how to identify the optimal number of clusters, 'K', for the data set, which is a crucial step in k-means clustering.
  • 🎯 The process begins by randomly selecting 'K' distinct data points as initial clusters and then iteratively refining these clusters.
  • 📏 The script describes how to calculate the distance of each data point from the centroids of the clusters to assign them to the nearest cluster.
  • ⚖️ It discusses the importance of calculating the mean of each cluster to update the centroids and reassign points based on these new values.
  • 🔄 The iterative process continues until the clusters no longer change, indicating that the clustering has converged to a stable solution.
  • 📊 The quality of clustering can be assessed by measuring the total variation within each cluster, with lower variation indicating better clustering.
  • 📈 The 'elbow method' is introduced as a way to determine the best value for 'K' by observing the point where the reduction in variance per additional cluster diminishes.
  • 🤖 The video highlights that k-means clustering is different from hierarchical clustering in that it aims to partition data into a specified number of clusters.
  • 📈 The script also covers how to apply k-means clustering to data that is not on a number line, using Euclidean distance in two or more dimensions.
  • 🌡️ It mentions that k-means can be used with data represented as a heatmap, by treating the data points in a two-dimensional space and applying the same clustering principles.
  • 🎶 The video ends with a call to action for viewers to subscribe for more content and support the channel by liking the video and considering purchasing the creator's songs.

Q & A

  • What is the main topic of the video?

    -The main topic of the video is k-means clustering, a method used to group data into clusters based on their similarities.

  • Who is the presenter of the video?

    -The presenter of the video is Josh Stormer.

  • What is the purpose of using k-means clustering in the video?

    -The purpose of using k-means clustering in the video is to identify clusters in data without relying on visual inspection, such as distinguishing between different types of tumors or cell types.

  • What is the 'K' in k-means clustering?

    -The 'K' in k-means clustering represents the number of clusters that the algorithm is instructed to identify in the data.

  • How does k-means clustering start the clustering process?

    -K-means clustering starts by randomly selecting 'K' distinct data points as the initial clusters.

  • What is the method used to assign data points to clusters in k-means clustering?

    -Data points are assigned to the nearest cluster based on the calculated distance between the point and the centroids of the clusters.

  • How does the k-means algorithm determine the best value for K?

    -The best value for K can be determined by trying different values and observing the total variation within the clusters, often using an elbow plot to identify the point where the reduction in variance decreases significantly.

  • What is an elbow plot in the context of k-means clustering?

    -An elbow plot is a graphical representation that shows the reduction in variance as the number of clusters increases, helping to identify the optimal number of clusters by finding the 'elbow' point where the decrease in variance slows down.

  • How does k-means clustering differ from hierarchical clustering?

    -K-means clustering aims to partition the data into a specified number of clusters, whereas hierarchical clustering builds a tree of clusters and does not require a predetermined number of clusters.

  • Can k-means clustering be applied to data that is not plotted on a number line?

    -Yes, k-means clustering can be applied to data in two or more dimensions using the Euclidean distance to measure the distance between points.

  • How is the Euclidean distance calculated for data with more than two dimensions?

    -The Euclidean distance for data with more than two dimensions is calculated as the square root of the sum of the squared differences between corresponding coordinates of points in the multidimensional space.

  • What is the significance of calculating the mean of each cluster in k-means clustering?

    -Calculating the mean of each cluster is important because it helps to reassign the data points to the nearest updated cluster center, which is essential for the iterative process of refining the clusters.

Outlines

00:00

📊 Introduction to K-Means Clustering

Josh Stormer introduces the concept of k-means clustering, a method for organizing data points into clusters based on their similarity. The video aims to demonstrate how to apply this technique to data that can be plotted on a line, an XY graph, or a heatmap. The process begins by selecting the number of clusters (K), choosing initial cluster centers, and iteratively refining the clusters by reassigning points to the nearest mean until no changes occur. The video also hints at a more sophisticated method for determining the optimal value of K.

05:01

🔍 Exploring K-Means Clustering Techniques and Cluster Quality Assessment

This paragraph delves deeper into the k-means clustering process, emphasizing the iterative nature of the algorithm and the importance of selecting the right value for K. It explains how to assess the quality of clustering by measuring the total variation within clusters. The video introduces the concept of an 'elbow plot' to help determine the optimal number of clusters by observing the rate of variance reduction. Additionally, it touches on the differences between k-means and hierarchical clustering, and how k-means can be applied to data in two or more dimensions using Euclidean distance. The summary concludes with a mention of how to handle heatmap data by treating it as a two-dimensional case for clustering.

Mindmap

Keywords

💡StatQuest

StatQuest is the name of the educational channel presented by Josh Starmer, which focuses on explaining statistical concepts in an accessible and engaging manner. In the context of the video, StatQuest serves as the platform through which the concept of k-means clustering is introduced and explained to the audience.

💡K-means Clustering

K-means clustering is a method of vector quantization, used to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. In the video, k-means clustering is the central theme, where Josh explains how to use this technique to group data points into clusters based on their similarities.

💡Cluster

A cluster in the context of k-means clustering refers to a grouping of data points that are similar to each other and dissimilar to points in other groups. The video script describes the process of assigning each data point to the nearest cluster based on calculated distances.

💡XY Graph

An XY graph, also known as a scatter plot, is a type of chart used to display data points on horizontal and vertical axes. In the video, the concept of plotting data points on an XY graph is used to visually represent the clusters that the k-means algorithm would identify.

💡Heat Map

A heat map is a graphical representation of data where individual values are represented as colors. In the script, Josh mentions the application of k-means clustering to a heat map, which could represent complex data with multiple dimensions.

💡K (in k-means)

The 'K' in k-means clustering refers to the number of clusters to be identified in the data set. The script explains that selecting the appropriate value for K is a crucial step in the clustering process, with Josh demonstrating how to choose K by using an elbow plot.

💡Euclidean Distance

Euclidean distance is the 'ordinary' straight-line distance between two points in Euclidean space. In the video, the Euclidean distance is used to measure the distance between data points and the centroids of the clusters during the k-means clustering process.

💡Centroid

The centroid is the central point of a shape, such as a triangle or a circle. In the context of clustering, the centroid is the mean point of all the data points in a cluster. The script describes how the mean (centroid) of each cluster is calculated and used to reassign data points in subsequent iterations of the k-means algorithm.

💡Iteration

In the context of the k-means algorithm, an iteration refers to a complete run through the process of assigning data points to clusters, recalculating centroids, and reassigning points based on the new centroids. The script mentions that the algorithm repeats this process until the clusters no longer change.

💡Elbow Plot

An elbow plot is a graphical tool used to determine the optimal number of clusters for k-means clustering. It plots the variance within the clusters against the number of clusters. In the script, Josh explains how to use an elbow plot to identify the point where adding more clusters does not significantly decrease the variance, indicating the optimal K.

💡Hierarchical Clustering

Hierarchical clustering is a clustering technique that builds a hierarchy of clusters. Unlike k-means, which requires the number of clusters to be specified in advance, hierarchical clustering does not require this specification and can provide a dendrogram to visualize the clustering structure. The script contrasts hierarchical clustering with k-means clustering.

Highlights

Introduction to k-means clustering and its application in grouping data on a line or an XY graph and even on a heatmap.

The necessity of choosing the best value for K in k-means clustering to identify the correct number of clusters.

Step-by-step explanation of the k-means clustering process starting from selecting the number of clusters.

The method of randomly selecting initial clusters and the importance of this step in the k-means algorithm.

Calculating the distance between data points and clusters to assign points to the nearest cluster.

The iterative process of recalculating the mean of each cluster and re-clustering based on the new means.

Assessing the quality of clustering by measuring the total variation within each cluster.

The concept of using multiple starting points to improve the reliability of k-means clustering results.

The use of an elbow plot to determine the optimal value for K by observing the reduction in variance.

Comparing k-means clustering with hierarchical clustering and their distinct approaches to data grouping.

Application of k-means clustering in two-dimensional space using Euclidean distance.

Explanation of how to handle data that is not plotted on a number line by using Euclidean distance in higher dimensions.

Approach to clustering heatmap data by renaming samples as x and y and plotting them on an XY graph.

The significance of calculating Euclidean distances for clustering data with more than two dimensions.

Encouragement for viewers to subscribe for more educational content on statistical methods.

Invitation for viewers to support the channel by liking the video and considering purchasing original songs.

Transcripts

play00:01

statcast

play00:02

[Music]

play00:04

stat quest stat quest stat quest hello

play00:14

I'm Josh stormer and welcome to stat

play00:16

quest today we're going to be talking

play00:18

about k-means clustering we're gonna

play00:21

learn how to cluster samples that can be

play00:23

put on a line on an XY graph and even on

play00:27

a heat map and lastly we'll also talk

play00:30

about how to pick the best value for K

play00:33

imagine you had some data that you could

play00:36

plot on a line and you knew you needed

play00:38

to put it into three clusters maybe they

play00:40

are measurements from three different

play00:42

types of tumors or other cell types in

play00:45

this case the data make three relatively

play00:48

obvious clusters but rather than rely on

play00:52

our eye let's see if we can get a

play00:54

computer to identify the same three

play00:57

clusters to do this we'll use k-means

play01:01

clustering we'll start with raw data

play01:04

that we haven't yet clustered step one

play01:08

select the number of clusters you want

play01:10

to identify in your data this is the K

play01:13

in k-means clustering in this case we'll

play01:17

select K equals three that is to say we

play01:21

want to identify three clusters there is

play01:25

a fancier way to select a value for K

play01:27

but we'll talk about that later

play01:30

step two randomly select three distinct

play01:34

data points these are the initial

play01:37

clusters

play01:39

step 3 measure the distance between the

play01:42

first point and the three initial

play01:44

clusters this is the distance from the

play01:48

first point to the blue cluster this is

play01:52

the distance from the first point to the

play01:54

green cluster

play01:56

and this is the distance from the first

play01:58

point to the orange cluster well it's

play02:01

kind of yellow but we'll just call it

play02:03

orange for now step 4 assign the first

play02:07

point to the nearest cluster in this

play02:09

case the nearest cluster is the blue

play02:11

cluster now we do the same thing for the

play02:15

next point we measure the distances and

play02:19

then assign the point to the nearest

play02:22

cluster now we figure out which cluster

play02:25

the third point belongs to we measure

play02:29

the distances and then assign the point

play02:32

to the nearest cluster the rest of these

play02:36

points are closest to the orange cluster

play02:38

so they'll go in that one two now that

play02:42

all the points are in clusters we go on

play02:44

to step 5 calculate the mean of each

play02:49

cluster then we repeat what we just did

play02:52

measure and cluster using the mean

play02:55

values

play02:58

since the clustering did not change at

play03:01

all during the last iteration were done

play03:04

BAM the k-means clustering is pretty

play03:09

terrible compared to what we did by eye

play03:12

we can assess the quality of the

play03:14

clustering by adding up the variation

play03:16

within each cluster here's the total

play03:20

variation within the clusters since

play03:23

k-means clustering can't see the best

play03:26

clustering it's only option is to keep

play03:28

track of these clusters and their total

play03:30

variance and do the whole thing over

play03:32

again with different starting points so

play03:36

here we are again back at the beginning

play03:40

k-means clustering picks three initial

play03:42

clusters and then clusters all the

play03:45

remaining points calculates the mean of

play03:47

each cluster and then re clusters based

play03:50

on the new means it repeats until the

play03:52

cluster is no longer change bit bit bit

play03:55

of bit of boop boop boop now that the

play03:59

data are clustered we sum the variation

play04:01

within each cluster

play04:04

and then we do it all again

play04:07

at this point k-means clustering knows

play04:10

that the second clustering is the best

play04:12

clustering so far but it doesn't know if

play04:15

it's the best overall so it will do a

play04:18

few more clusters it does as many as you

play04:20

tell it to do and then come back and

play04:23

return that one if it is still the best

play04:26

question how do you figure out what

play04:29

value to use for K with this data it's

play04:33

obvious that we should set K to three

play04:35

but other times it is not so clear one

play04:40

way to decide is to just try different

play04:42

values for K

play04:44

we'll start with k equals 1 k equals 1

play04:49

is the worst case scenario we can

play04:52

quantify its badness with the total

play04:54

variation now try K equals 2 K equals 2

play05:01

is better and we can quantify how much

play05:03

better by comparing the total variation

play05:05

within the two clusters to K equals 1

play05:09

now try K equals 3 k equals 3 is even

play05:15

better we can quantify how much better

play05:18

by comparing the total variation within

play05:20

the three clusters to k equals 2 now try

play05:24

k equals 4 the total variation within

play05:28

each cluster is less than when K equals

play05:30

3 each time we add a new cluster the

play05:35

total variation within each cluster is

play05:37

smaller than before and when there is

play05:39

only one point per cluster the variation

play05:42

equals 0 however if we plot the

play05:46

reduction in variance per value for K

play05:50

there is a huge reduction in variation

play05:53

with K equals three but after that the

play05:56

variation doesn't go down as quickly

play05:59

this is called an elbow plot and you can

play06:02

pick K by finding the elbow in the plot

play06:06

question how is k-means clustering

play06:09

different from hierarchical clustering

play06:13

k-means clustering specifically tries to

play06:16

put the data into the number of clusters

play06:18

you tell it to hierarchical clustering

play06:21

just tells you pairwise what two things

play06:25

are most similar question what if our

play06:30

data isn't plotted on a number line just

play06:33

like before you pick three random points

play06:36

and we use the Euclidean distance in two

play06:40

dimensions the Euclidean distance is the

play06:43

same thing as the Pythagorean theorem

play06:46

then just like before we assign the

play06:49

point to the nearest cluster and just

play06:53

like before we then calculate the center

play06:55

of each cluster and re cluster BAM

play07:01

although this looks good the computer

play07:03

doesn't know that until it does the

play07:05

clustering a few more times question

play07:09

what if my data is a heatmap well if we

play07:14

just have two samples we can rename them

play07:17

x and y and we can then plot the data in

play07:21

an XY graph then we can cluster just

play07:26

like before note we don't actually need

play07:30

to plot the data in order to cluster it

play07:32

we just need to calculate the distances

play07:34

between things when we have two samples

play07:38

or two axes the Euclidean distance is

play07:41

the square root of x squared plus y

play07:44

squared when we have three samples or

play07:47

three axes the Euclidean distance is the

play07:51

square root of x squared plus y squared

play07:53

plus Z squared and when we have four

play07:57

samples or four axes the Euclidean

play07:59

distance is the square root of x squared

play08:02

plus y squared plus Z squared plus a

play08:04

squared etc etc etc hooray

play08:09

we've made it to the end of another

play08:11

exciting stat quest if you like this

play08:14

stat quest and want to see more please

play08:16

subscribe and if you want to support

play08:18

stack quest well click the like button

play08:21

down below and consider buying one or

play08:23

two of my original songs alright tune in

play08:26

next time for another exciting stat

play08:28

quest

Rate This

5.0 / 5 (0 votes)

Etiquetas Relacionadas
K-meansClusteringData AnalysisMachine LearningOptimal KEuclidean DistanceHeatmap DataStatistical MethodTutorialJosh Stormer
¿Necesitas un resumen en inglés?