t-SNE Simply Explained

ritvikmath
5 Sept 202325:49

Summary

TLDRThis video explores the T-Distributed Stochastic Neighbor Embedding (t-SNE), a powerful method in machine learning for reducing high-dimensional data to two or three dimensions for visualization. Unlike PCA, t-SNE focuses on preserving local structures by comparing data points pair-by-pair. The video delves into the algorithm's mechanics, including the use of a Gaussian distribution in high dimensions and a heavier-tailed distribution in lower dimensions to combat the curse of dimensionality. It concludes with a comparison to PCA, highlighting t-SNE's non-linearity and computational complexity.

Takeaways

  • ๐Ÿ“š The video introduces a method called t-Distributed Stochastic Neighbor Embedding (t-SNE), used for reducing the dimensionality of high-dimensional datasets.
  • ๐Ÿ” t-SNE is distinct from Principal Component Analysis (PCA) in that it operates on a pair-by-pair basis, focusing on preserving local relationships between data points.
  • ๐Ÿ“ˆ The method begins by defining the probability of one data point choosing another as its neighbor based on their distance in the high-dimensional space.
  • ๐Ÿ“‰ t-SNE uses a function for calculating probabilities that resembles the division of two Gaussian probability density functions, aiming to maintain local structures in the lower-dimensional representation.
  • ๐ŸŒ The algorithm dynamically selects a standard deviation (ฯƒ) for each data point to fit a certain number of neighboring data points within a specified number of standard deviations.
  • ๐Ÿ”— t-SNE creates a similarity score (p_ij) that represents the likelihood of data points i and j being grouped together, based on their closeness in the high-dimensional space.
  • ๐Ÿ“‰ The goal of t-SNE is to find a lower-dimensional representation (y1, y2, ..., yn) that preserves the similarity (p_ij) between data points as computed in the high-dimensional space.
  • ๐Ÿ”‘ The challenge of dimensionality reduction is addressed by using a Student's t-distribution with a degree of freedom of one (a Cauchy distribution) in the lower-dimensional space, which has heavier tails than a Gaussian distribution.
  • ๐Ÿ“ The algorithm employs the KL Divergence to measure the difference between the original high-dimensional similarities (p_ij) and the new low-dimensional similarities (q_ij), aiming to minimize this difference.
  • ๐Ÿค– t-SNE uses gradient descent to adjust the positions of the lower-dimensional points (y_i, y_j) to minimize the KL Divergence, thus finding an optimal representation.
  • ๐Ÿš€ t-SNE is more computationally complex than PCA due to its focus on individual data point relationships and is particularly effective for datasets that are not linearly separable.

Q & A

  • What is the full name of the method discussed in the video?

    -The full name of the method discussed in the video is T-Distributed Stochastic Neighbor Embedding, often abbreviated as t-SNE.

  • What is the primary goal of t-SNE?

    -The primary goal of t-SNE is to reduce the dimensionality of a high-dimensional data set to a much lower-dimensional space, typically two or three dimensions, to enable visualization of the data.

  • How does t-SNE differ from Principal Component Analysis (PCA) in terms of its approach to dimensionality reduction?

    -t-SNE operates on a pair-by-pair basis, focusing on preserving the local relationships between individual data points in the new lower-dimensional space, whereas PCA works on a global level, focusing on directions of maximum variance in the data set.

  • What is the significance of the probability P(J|I) in t-SNE?

    -P(J|I) represents the probability that, given data point I, it would choose data point J as its neighbor in the original high-dimensional space. It is a key component in defining the local relationships between data points.

  • How does t-SNE address the 'curse of dimensionality' when mapping high-dimensional data to a lower-dimensional space?

    -t-SNE addresses the curse of dimensionality by using a heavier-tailed distribution, specifically a Student's t-distribution with one degree of freedom (also known as a Cauchy distribution), to maintain higher probabilities for longer distances in the lower-dimensional space.

  • What role does the KL Divergence play in the t-SNE algorithm?

    -The KL Divergence is used as a loss function in t-SNE to measure the difference between the original high-dimensional similarities (p_ij) and the new low-dimensional similarities (q_ij). The goal is to minimize the KL Divergence to ensure that the low-dimensional representation preserves the local structures of the high-dimensional data.

  • How does t-SNE handle the dynamic setting of the standard deviation ฯƒ_i for each data point?

    -t-SNE dynamically sets the standard deviation ฯƒ_i for each data point based on the number of neighboring data points within a certain number of standard deviations of the Gaussian distribution. This allows t-SNE to adapt to different densities in the data, with lower ฯƒ_i for denser regions and higher ฯƒ_i for sparser regions.

  • What is the motivation behind using the Gaussian distribution to define probabilities in the high-dimensional space?

    -The motivation behind using the Gaussian distribution is that it naturally incorporates the distance between data points into the probability calculation, with closer points having higher probabilities of being chosen as neighbors and more distant points having lower probabilities.

  • How does t-SNE ensure that the local structures between data points are preserved in the lower-dimensional space?

    -t-SNE ensures the preservation of local structures by minimizing the KL Divergence between the original high-dimensional similarities and the new low-dimensional similarities, thus maintaining the relative closeness of data points that were close in the original space.

  • What computational method is used to find the low-dimensional representations (y_i) in t-SNE?

    -Gradient descent is used to find the low-dimensional representations (y_i) in t-SNE. The algorithm iteratively adjusts the positions of the data points in the lower-dimensional space to minimize the KL Divergence between the original and new similarity measures.

Outlines

00:00

๐Ÿ“š Introduction to t-SNE for Dimensionality Reduction

The video introduces a method called t-Distributed Stochastic Neighbor Embedding (t-SNE), used in machine learning and data science to reduce the dimensionality of high-dimensional datasets. Unlike Principal Component Analysis (PCA), t-SNE operates on a pair-by-pair basis, aiming to preserve the local relationships between data points in a lower-dimensional space, typically two or three dimensions, for visualization purposes. The method begins by defining the probability of one data point choosing another as its neighbor based on their distance in the high-dimensional space.

05:00

๐Ÿ” Defining Neighbor Probabilities in High-Dimensional Space

This paragraph delves into how t-SNE quantifies the likelihood of data points being neighbors based on their distances in high-dimensional space. It explains the formulation of probabilities using a Gaussian distribution, where the probability of choosing a neighbor decreases with increasing distance. The function accounts for all other data points in the dataset, ensuring that the selection of a neighbor is relative to the local context of each data point's position.

10:01

๐Ÿ“‰ Adjusting Probabilities for Dynamic Standard Deviation

The script discusses the dynamic selection of the standard deviation (ฯƒ) for the Gaussian distribution used in calculating neighbor probabilities. The choice of ฯƒ is tailored to fit a specific number of neighboring data points within a certain range of standard deviations. This dynamic adjustment accommodates varying densities of data points in different regions of the high-dimensional space.

15:01

๐Ÿ”— Preserving Local Structures in Lower-Dimensional Space

The paragraph explains the goal of t-SNE to learn a lower-dimensional representation of data points while preserving the local similarity computed in the high-dimensional space. It introduces the concept of using a Gaussian distribution to measure similarities between points in the lower-dimensional space and the challenge of ensuring that these similarities match the original high-dimensional similarities.

20:02

๐Ÿ“Š The Curse of Dimensionality and t-SNE's Solution

This section addresses the challenge of the curse of dimensionality, which arises when trying to map high-dimensional data into a lower-dimensional space. The video illustrates how the area occupied by data points at a moderate distance from a target point increases exponentially with the number of dimensions. To overcome this, t-SNE modifies the similarity definition in the lower-dimensional space by using a heavier-tailed distribution, such as the Cauchy distribution, to maintain higher probabilities for larger distances.

25:04

๐Ÿ“ˆ Utilizing the Student's t-Distribution for Similarity

The video explains the rationale behind using the Student's t-distribution with one degree of freedom (equivalent to the Cauchy distribution) in the lower-dimensional space. This distribution has heavier tails, allowing t-SNE to accommodate the increased probability of selecting a neighbor over a broader range of distances, thus addressing the limitations imposed by the curse of dimensionality.

๐Ÿ”ง Gradient Descent Optimization for t-SNE

The final paragraph outlines the optimization process of t-SNE, which involves using gradient descent to minimize the KL Divergence between the original high-dimensional similarities (p_ij) and the new low-dimensional similarities (q_ij). The goal is to find the lower-dimensional representations (y_i, y_j) that best preserve the local structures of the high-dimensional data.

๐ŸŒ Comparing t-SNE with PCA and Conclusion

The video concludes by contrasting t-SNE with PCA, highlighting that t-SNE is more focused on preserving local structures and does not assume linearity in the data, making it suitable for non-linear datasets. In comparison, PCA is a linear method that works well with linearly correlated data. The presenter invites viewers to leave questions or comments and encourages them to like, subscribe, and watch more videos on similar topics.

Mindmap

Keywords

๐Ÿ’กDimensionality Reduction

Dimensionality reduction refers to the process of reducing the number of random variables under consideration and can also involve reducing the number of observations in a dataset. In the context of the video, dimensionality reduction is achieved through the T-Distributed Stochastic Neighbor Embedding (t-SNE) method to transform high-dimensional data into a lower-dimensional space, typically for visualization purposes.

๐Ÿ’กt-SNE

t-SNE, short for T-Distributed Stochastic Neighbor Embedding, is a machine learning algorithm used for visualization by reducing the dimensionality of high-dimensional data sets. The video script explains t-SNE as a method that operates on a unit by unit level, preserving the local structure of the data in a lower-dimensional space, unlike PCA which works on a more global level.

๐Ÿ’กPrincipal Component Analysis (PCA)

PCA is a statistical technique that transforms a set of observations of possibly correlated variables into a set of linearly uncorrelated variables called principal components. The script contrasts PCA with t-SNE, highlighting that PCA focuses on global variance while t-SNE is concerned with local structures and pairwise relationships.

๐Ÿ’กProbability Distribution

A probability distribution is a mathematical function that provides the probabilities of each possible outcome in an experiment. In the video, the script describes how t-SNE uses a probability distribution to define the likelihood of choosing a neighbor in the high-dimensional space, which is essential for preserving local structures in the lower-dimensional space.

๐Ÿ’กGaussian Distribution

The Gaussian distribution, also known as the normal distribution, is a probability distribution that is used in the initial steps of t-SNE to model the probability of choosing a neighbor based on distance in the high-dimensional space. The script mentions that the probability of selecting a neighbor is proportional to the normal distribution of the distances.

๐Ÿ’กKL Divergence

KL Divergence, or Kullback-Leibler divergence, is a measure of how one probability distribution diverges from a second, expected probability distribution. In the context of t-SNE, the script explains that KL Divergence is used to minimize the difference between the original high-dimensional similarities (pij) and the new low-dimensional similarities (qij).

๐Ÿ’กStochastic

Stochastic refers to a process involving randomness or probability. The script uses the term 'stochastic' in the name t-SNE to indicate that the method is based on random processes, particularly in the way it defines the probability of choosing a neighbor.

๐Ÿ’กEmbedding

In the context of the video, embedding refers to the process of mapping high-dimensional data into a lower-dimensional space. t-SNE is described as an embedding method that aims to preserve the local structure of the data while reducing its dimensionality.

๐Ÿ’กCurse of Dimensionality

The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces. The script discusses this concept in relation to t-SNE, explaining that as dimensionality increases, the volume of the space becomes increasingly larger, making it difficult to represent mid-level distances in a lower-dimensional space.

๐Ÿ’กStudent's T-Distribution

Student's T-Distribution, also known as the T-distribution, is a type of probability distribution used in t-SNE for the lower-dimensional space. The script explains that this distribution has heavier tails than the normal distribution, allowing for a broader range of distances to be considered similar, which is necessary to overcome the limitations posed by the curse of dimensionality.

Highlights

Introduction to T-Distributed Stochastic Neighbor Embedding (t-SNE), a method for reducing the dimensionality of high-dimensional data sets.

Comparison between t-SNE and Principal Component Analysis (PCA), highlighting t-SNE's unique approach to dimensionality reduction.

Explanation of t-SNE's operation on a unit-by-unit level, preserving the relationship between data points in a pair-by-pair basis.

The concept of probability in selecting a neighbor data point based on distance in the high-dimensional space.

The formulation of probability as a function of distance, emphasizing the likelihood of choosing closer data points as neighbors.

Inclusion of all other data points in the probability function to account for the global context within the data set.

The properties required for the probability function, including negative correlation with distance and positive correlation with relative closeness.

Introduction of the specific function recommended by t-SNE authors, resembling the division of two Gaussian probability density functions.

Validation of the function's properties to ensure it meets the criteria for neighbor selection based on distance.

The dynamic selection of standard deviation (ฯƒ) for each data point to fit a certain number of neighboring data points within a Gaussian distribution.

The definition of similarity score (p_ij) as a balance between the probabilities of data points grouping together.

The challenge of preserving local structures in the lower-dimensional space while reducing dimensionality.

The mechanics of translating high-dimensional data points to a lower-dimensional space while preserving similarity.

The issue of the curse of dimensionality and its impact on the algorithm's ability to pack mid-level data points into a lower-dimensional space.

The solution to the curse of dimensionality by using a heavier-tailed Student's T-distribution with one degree of freedom in the lower-dimensional space.

The rationale behind using a Student's T-distribution to maintain similarity over a broader range of distances in the lower-dimensional space.

The use of KL Divergence to minimize the difference between the original high-dimensional similarities (p_ij) and the new low-dimensional similarities (q_ij).

The final step of finding the low-dimensional representations (y's) using gradient descent to minimize the KL Divergence.

The contrast between t-SNE's pair-by-pair approach and PCA's global variance approach, with t-SNE being more suitable for non-linear data sets.

The computational complexity of t-SNE compared to PCA, due to its focus on individual data point relationships.

Transcripts

play00:00

[Music]

play00:01

hey what's up everybody today we're

play00:04

going to be talking about a very cool

play00:05

method in machine learning and data

play00:07

science to reduce the dimensionality of

play00:10

a data set who has very very high

play00:11

dimensional data in a very different way

play00:14

than for example principal component

play00:16

analysis and at the end of the video

play00:17

we'll talk about pros and cons of those

play00:19

two methods

play00:20

but this method is officially called the

play00:23

T distributed stochastic neighbor

play00:24

embedding that's a very long title I

play00:27

couldn't even fit all of the words

play00:28

without these little abbreviations here

play00:30

so people typically call it t Snee so we

play00:34

have t s and e t Snee and as we were

play00:39

saying before the idea is that we come

play00:41

to this method with a bunch of data

play00:43

points in a pretty high dimensional

play00:44

space

play00:45

so we have n data points X1 X2 all the

play00:48

way to xn and each of these are vectors

play00:50

in a very high dimensional space you can

play00:53

think of dozens or hundreds or even

play00:54

thousands of dimensions for each of

play00:56

these vectors

play00:57

and our goal is to eventually get a set

play01:00

of n vectors y1 Y2 all the way to YN who

play01:04

are in a much much lower dimensional

play01:05

space typically a two or three

play01:07

dimensional space which lets us do

play01:10

things like visualize this data which we

play01:12

couldn't do before

play01:14

so we're going to start by asking a

play01:15

series of questions the key Insight the

play01:18

key idea behind TC and what makes it

play01:20

fundamentally different from PCA is that

play01:23

TC is operating on a unit by unit level

play01:25

so it's looking at X1 and X2 and trying

play01:28

to preserve something about the

play01:30

relationship between those two data

play01:32

points in your original data

play01:34

it's trying to preserve that

play01:35

relationship in the new data which is

play01:37

going to be in lower Dimensions so it's

play01:39

working on a pair-by-pair basis whereas

play01:42

PCA is working on a more Global level

play01:43

we'll say a lot more on that later

play01:46

the first question we're going to ask

play01:48

here is given some data point in this

play01:50

set of n data points x i

play01:52

what's the probability that it would

play01:54

pick some other data point x j as its

play01:57

neighbor

play01:58

this kind of seems like a vague question

play02:00

but we can quantify we can Define that

play02:02

quantity as P J given I so this notation

play02:05

says given that you are the data point I

play02:08

what's the probability that you would

play02:10

pick the data point J as your neighbor

play02:12

now there's lots of ways you could

play02:14

Define this but one very natural way is

play02:17

going to be making it a function of the

play02:20

distance between x i and XJ again in

play02:22

this High dimensional space

play02:24

so we're going to say the probability

play02:26

that if I'm I I would pick J as my

play02:28

neighbor is going to be some function I

play02:30

don't know what the function looks like

play02:31

yet but it's going to be some function

play02:33

of the distance between x i and x j the

play02:36

smaller this distance is the more likely

play02:38

I would want to pick that one as my

play02:40

neighbor because it's closer to me

play02:42

the further this data point x j is the

play02:44

less likely I want to pick it as my

play02:46

neighbor

play02:48

now this formulation almost makes sense

play02:51

but it ignores the fact that there's all

play02:53

these other data points going on in our

play02:54

data there's not just x i and XJ imagine

play02:57

your x i and you're being asked who are

play02:59

you going to pick as your neighbor well

play03:00

I need to know information about all the

play03:02

other data points are there others that

play03:03

are way closer to me is XJ actually the

play03:06

closest to me and everyone else is

play03:07

further away

play03:09

so while this attempt number one does

play03:11

start getting us to the right place

play03:12

we're going to need to expand what our

play03:14

function is based on so we're going to

play03:16

say this probability is now a function

play03:18

of I've just made a shorter notation

play03:20

here d i j squared is just this distance

play03:23

between I and J squared and it's going

play03:26

to include information about the same

play03:28

distances from I to all of the other

play03:31

data points in our data so all distances

play03:33

ik where K is not equal to I or not

play03:36

equal to J and whatever function this is

play03:38

is going to need to have a couple

play03:39

properties and I've noticed now I've got

play03:41

these reversed so we're doing this in

play03:43

real time let me fix this so this is

play03:46

going to be negative and this is going

play03:48

to be positive so the first property is

play03:50

that the bigger the distance gets

play03:51

between I and J the lower we're going to

play03:54

have the probability I would pick that

play03:55

as my neighbor it's going to be

play03:56

negatively correlated and the other

play03:58

property is for all the other data

play03:59

points that I'm not considering that are

play04:01

not equal to irj if the distance to them

play04:03

gets bigger and bigger and bigger then

play04:05

everything else held constant J becomes

play04:07

more and more attractive because it's

play04:09

relatively closer to me and so my

play04:11

probability should go up in that

play04:13

situation so what might such a function

play04:15

f look like

play04:17

and the authors of T Snee recommend

play04:20

using this function here which is going

play04:22

to seemingly come out of nowhere but

play04:23

I'll try to motivate it as best I can

play04:25

so they say this function pji the

play04:27

probability I would pick J is my

play04:29

neighbor if I am the data point I is

play04:31

going to be this very ugly looking form

play04:33

but if you stare at it for a little bit

play04:35

and try to remember what it looks like

play04:37

from your introductory statistics

play04:39

courses this exactly looks like the

play04:41

division of two normal probability

play04:44

density functions two gaussian

play04:46

probability density functions and that's

play04:48

exactly where the motivating comes from

play04:50

so the paper authors say that we're

play04:53

going to assign these probabilities PJ

play04:55

given I proportional to the density of a

play04:59

normal

play05:00

of a normal distribution of the distance

play05:06

so the main things we should check first

play05:08

are do the properties you want out of

play05:11

this function based on what we just came

play05:12

up with here actually hold well this

play05:15

distance here so this guy here is the

play05:17

distance between x i and x j squared if

play05:20

this increases then because there's a

play05:22

negative sign in front of it the whole

play05:23

exponent here is going to decrease and

play05:26

if we take e to the power of an Ever

play05:28

decreasing number then the entire

play05:30

numerator is going to decrease so we

play05:31

exactly have the first criteria and we

play05:34

can see it we also have the second

play05:35

criteria met if any of the other data

play05:37

points who are not equal to IR J in the

play05:40

denominator if they increase then the

play05:43

entire denominator is going to decrease

play05:45

and therefore the entire probability is

play05:47

going to increase exactly as we wanted

play05:49

so while this form itself May seemingly

play05:52

come out of nowhere and to be quite

play05:53

honest with you we probably could have

play05:55

chosen different forms this was just the

play05:57

form that was chosen perhaps for

play05:59

mathematical convenience in the actual

play06:01

algorithm the properties we want do in

play06:03

fact hold and that's all we really care

play06:05

about for right now but the motivating

play06:07

factor behind this was that we're going

play06:09

to assume that as the distance moves

play06:11

further away the probability that you

play06:13

would choose data point J if you were

play06:15

data point I is going to be proportional

play06:17

to the normal distribution of the

play06:19

distances between those two and also the

play06:21

distances between you and all the other

play06:23

data points that exist out there

play06:25

so the next question we're going to ask

play06:27

is well this was telling us about data

play06:30

points I and data point J let's zoom out

play06:32

a little bit and say if we were to pick

play06:34

a point at random from all of our n data

play06:36

points what's the probability that I and

play06:39

J end up grouped together

play06:42

well this could happen in two different

play06:44

situations the first is that the random

play06:46

point we chose is data point I

play06:49

that happens with the 1 over n

play06:51

probability because all the data points

play06:52

have equal probability of being picked

play06:54

in this situation and given we chose

play06:57

data point I what's the probability that

play06:58

it would choose data point J as its

play07:00

neighbor well that's exactly what we

play07:01

spent so much time coming up with here

play07:03

and that's exactly the quantity PJ given

play07:06

I

play07:07

now there's one other way that we could

play07:09

have inj grouped and that would be if

play07:10

the data point we pick in this little

play07:12

experiment here was data point J

play07:15

and if we picked data point J which

play07:17

happens with the one over n probability

play07:18

then the probability that inj would be

play07:21

grouped is now the same as the

play07:22

probability that data point J would pick

play07:25

data point I which is going to be just

play07:27

the opposite here so given I'm data

play07:29

point J what the probability I picked

play07:31

data point I

play07:32

and now these two things together

play07:35

if we do this multiplication in this

play07:37

addition we're going to Define as p i j

play07:39

which is going to be some form of

play07:41

similarity score

play07:43

between the data points I and J because

play07:46

that represents the probability that

play07:48

they would end up together in a group

play07:50

and this captures both it's kind of a

play07:52

balance in fact it's a literal averaging

play07:55

between this pji and this p i j so we

play07:58

can put p j i plus p i j and we divide

play08:04

that by the number of data points there

play08:05

are and this gives us a score which we

play08:07

can interpret as a probability of I and

play08:09

J being similar to each other and this

play08:12

is all going to be based on their

play08:13

physical closeness to each other in this

play08:15

High dimensional space

play08:17

now there's a note that I kind of

play08:18

glanced over here some of you might be

play08:20

wondering how do I pick this Sigma I

play08:22

because when you have a gaussian

play08:23

distribution it's defined by a mean and

play08:25

a standard deviation and I didn't really

play08:27

talk about how that standard deviation

play08:29

which based on the subscript I seems to

play08:31

be a function of I gets set

play08:34

there's a ton more details in the

play08:35

original paper which I'll link in the

play08:36

description below but here's the general

play08:38

intuition

play08:39

we're going to pick that standard

play08:41

deviation dynamically for each of our

play08:42

data points from 1 to n

play08:44

so that it's going to fit a certain

play08:47

number in general it's going to fit a

play08:49

certain number of neighboring data

play08:50

points in some number of standard

play08:52

deviations of that gaussian distribution

play08:54

and so that seems a little bit abstract

play08:56

but let me make it a little bit more

play08:58

concrete here now let's say you're

play08:59

looking at some data point I and you're

play09:01

looking at all the other data points and

play09:02

what distance they have away from your

play09:04

data point I so this vertical line here

play09:06

is going to be a distance zero and I

play09:08

didn't even need the negative side of

play09:09

this plot should have really just drawn

play09:11

the positive side but let's say you have

play09:13

four other data points that are right

play09:15

here

play09:16

so you would use this red distribution

play09:18

here if your rule was something like

play09:21

within the first two standard deviations

play09:23

of my normal distribution which would

play09:25

cut off around here I want four data

play09:27

points and that's exactly what we've

play09:29

been able to achieve within this first

play09:30

two standard deviations you have four

play09:32

data points now here's a different case

play09:34

you have for a different data point x

play09:36

sub I that you might be looking at let's

play09:38

say that its neighbors are here here

play09:41

here and here

play09:44

now you can't use that same red normal

play09:46

distribution because if you were to then

play09:48

the first two standard deviations you

play09:50

only have two data points there's the

play09:51

sparsity issue going on and so we allow

play09:55

this Sigma sub I to be dynamic so that

play09:57

we can use this black normal

play09:59

distribution here because within its

play10:01

first two standard deviations we do have

play10:03

now those four neighboring data points

play10:06

so there are a lot more details I am

play10:08

oversimplifying it a little bit here but

play10:10

the general idea is that if around

play10:13

whatever Point you're looking at

play10:15

there's a lot of very close neighbors so

play10:17

it's in a very dense part of the space

play10:19

then you can set Sigma sub I to be low

play10:21

you don't need a really big standard

play10:23

deviation for your normal distribution

play10:25

if on the other hand your data point x i

play10:27

is in a very sparse part of the space

play10:28

where it's here but its closest

play10:31

neighbors are like really far away

play10:33

which is the situation we were looking

play10:35

at with this black points here then you

play10:37

want to set that standard deviation

play10:38

appropriately to be higher so let's take

play10:40

stock of where we're at so far we have p

play10:42

i j which is going to give us some

play10:44

measure of similarity between the data

play10:46

point I and the data point J

play10:48

and now we can start talking about how

play10:49

do we get to that lower dimensional

play10:51

space

play10:52

the goal is to learn a low dimensional

play10:55

representation again in tcne that

play10:57

usually means a two or three dimensional

play10:59

representation because people typically

play11:01

want to do this so that they can

play11:03

actually visualize their data on some

play11:05

kind of plot

play11:06

so we're going to translate our X1 X2

play11:09

all the way to xn which are high

play11:10

dimensional into y1 Y2 all the way to YN

play11:13

which are low dimensional for example

play11:15

they could be in two-dimensional space

play11:16

and we want to do it in such a way that

play11:19

we're going to preserve the similarity

play11:21

that we just computed which are again

play11:23

these p i JS between data points x i and

play11:28

XJ so if the similarity between data

play11:30

points X1 and X2 was like 0.1 then we

play11:33

want to do our best to preserve that

play11:35

similarity in the data points y1 and Y2

play11:37

even though they're in a lower

play11:39

dimensional space and this is probably

play11:41

the most important part of the video the

play11:42

intuition the goal we're trying to do

play11:44

the how is really just more mechanics

play11:46

but the power behind tisney is that it's

play11:50

looking at pairs of data points and

play11:52

seeing what similarity we gave it based

play11:54

on this formulation here and saying that

play11:56

you know what I know we want to put the

play11:58

data in a lower dimensional space but I

play12:00

don't want to sacrifice on these

play12:02

similarities I want to preserve these

play12:04

local structures between my data points

play12:07

that were there before I want those

play12:08

local structures to still be there right

play12:10

now

play12:12

and now we can get into the actual

play12:13

mechanics how are we going to actually

play12:15

do this well why don't we just try the

play12:17

most straightforward thing first based

play12:18

on what we've already used in this

play12:20

method so far and that's going to be

play12:22

just like with the X's how we measured

play12:25

this similarity using this gaussian

play12:27

framework why don't we just do the exact

play12:30

same thing with the Y's so just like for

play12:32

the X's we measure the similarities

play12:34

using the gaussian just do the same

play12:36

thing with the Y's measure them using

play12:37

the gaussian and now you'll have some

play12:39

values q i j which are the similarities

play12:41

between the Y's and now all you need to

play12:43

do is make sure the qijs and the pijs

play12:46

are similar to each other

play12:48

now that is a great first attempt and

play12:52

avoids any unnecessary complications but

play12:54

we should talk about a very subtle thing

play12:56

that goes wrong when we do that and this

play12:58

is all under the umbrella of the curse

play13:00

of dimensionality which we knew was

play13:02

going to come along when we're dealing

play13:03

with very high dimensional data and here

play13:05

is how it rears its ugly head

play13:07

to demonstrate this I'm just going to

play13:09

show in two and three dimensions so

play13:11

we're going to treat our two-dimensional

play13:12

space as our Target space for tisney and

play13:15

we're going to say the three-dimensional

play13:16

space is the original space the data

play13:18

points are in of course your data points

play13:21

are probably going to be in a way higher

play13:23

dimensional space but this is just for

play13:24

illustration

play13:25

what I want to do here in two Dimensions

play13:27

because I can draw in two dimensions for

play13:29

you all not well apparently but what I'm

play13:32

going to do in two Dimensions is draw an

play13:34

area of two-dimensional space who has a

play13:37

distance of one from the origin and

play13:39

that's this little green circle here

play13:42

and we're going to call this the nearby

play13:43

area all these points are going to be

play13:45

nearby to whatever my target data point

play13:47

in the origin is

play13:49

this Blue Area we're going to call the

play13:51

mid distance area so it's between a

play13:53

distance of one and two

play13:55

and what I want to do here is a very

play13:56

simple equation I want to divide the

play13:59

area of the mid region which is this

play14:01

outer blue ring by the area of the near

play14:04

region which is this inner green circle

play14:06

now we know how to do it just pi r

play14:08

squared is the area of a circle and so

play14:10

the area of the mid region is going to

play14:12

be the area of this big circle minus the

play14:14

area of the small circle which is the

play14:16

numerator and the area of the nearby is

play14:19

just going to be the area of the small

play14:21

circle or Pi 1 squared and we get that

play14:22

that ratio is equal to 3.

play14:26

so what this means is that the mid size

play14:28

the area of the mid region where those

play14:31

data points were a middle distance away

play14:32

from me is three times the area of the

play14:35

data points we're allowed to be a near

play14:37

distance from me now stay with me here

play14:39

let's do this for one dimension up in

play14:41

this case we're not dealing with areas

play14:42

we're dealing with volumes and I haven't

play14:44

even dared to draw it but you can

play14:46

visualize a sphere of points who all

play14:48

have a distance of one or less from the

play14:50

origin or the unit sphere

play14:52

is going to be what we're calling the

play14:54

volume of the nearby region in three

play14:56

dimensions very analogous to the area of

play14:58

the nearby region in two dimensions and

play15:01

the volume of the mid in three

play15:02

dimensions is going to be the area

play15:04

between a sphere who has distance two

play15:07

just like we had this circle with

play15:09

distance two everything between the

play15:12

outside of that and the outside of the

play15:14

inner sphere

play15:16

so we can do a very simpler formula we

play15:17

appeal to the volume of a sphere here

play15:19

and we get that the ratio is equal to

play15:21

seven

play15:23

and now here's the punch line

play15:26

in two-dimensional space we found that

play15:29

the area of the mid data points was

play15:32

three times the area of the near data

play15:34

points but in three-dimensional space

play15:36

the volume in which the mid-level data

play15:39

points are allowed to occupy is a whole

play15:41

seven times bigger than the volume of

play15:43

the nearby data points are allowed to

play15:45

occupy

play15:46

and where this is all leading up to how

play15:49

we can verbalize the problem here is

play15:50

exactly like this

play15:52

when we're going from a high dimensional

play15:54

space to a low dimensional space in this

play15:56

case just kind of trivializing it going

play15:58

from a three-dimensional space to a

play16:00

two-dimensional space we're trying to

play16:02

take all these mid-level data points

play16:04

where a mid distance away from whatever

play16:05

Target data point we care about

play16:07

and that has you can say a size of seven

play16:10

and we're trying to pack that into a

play16:13

mid-size area in the lower dimensional

play16:16

space who has a you can say size of

play16:18

three

play16:19

so the problem is that we're trying to

play16:20

pack all the midpoints in high

play16:22

dimensional space the midpoints in a

play16:24

high dimensional space into a much much

play16:26

smaller space in a lower Dimension and

play16:29

the problem doesn't seem so pronounced

play16:31

with just two and three dimensions but

play16:32

imagine that your data point has just 10

play16:34

Dimensions which is not crazy by any

play16:36

data science standards typically when we

play16:39

use T Snee we're trying to embed like

play16:40

1000 or 100 dimensional embeddings

play16:43

even if your data just has 10 Dimensions

play16:46

this problem becomes such that you're

play16:48

trying to take a mid-size space and I'm

play16:51

kind of using this word loosey-goosey

play16:53

but what I'm really trying to say is the

play16:56

area in that higher dimensional space

play16:58

which is occupied by the data points who

play17:00

are a moderate distance away from you

play17:02

and with a 10-dimensional space that has

play17:04

a 1023 ratio with the nearby area and as

play17:09

we saw in a two-dimensional space that

play17:11

has a ratio of just three

play17:13

so we run into just this kind of like

play17:15

I'm running out of space I can't take

play17:17

all the data points who have a certain

play17:18

characteristic in a high dimension and

play17:20

pack them into the same characteristic

play17:22

in the low Dimension there's just not

play17:25

enough room in that low dimensional

play17:26

space

play17:28

and the way we solve for that is

play17:30

actually changing up our definition of

play17:32

similarity in that lower dimensional

play17:34

space and we do that in a visual way so

play17:37

in the high dimensional space we are

play17:38

using this normal distribution where the

play17:41

x-axis here is distance away from your

play17:43

target data point

play17:45

and the y-axis is telling you what's the

play17:47

probability that I would pick that data

play17:49

point as my neighbor

play17:50

and as we said this drops off as a

play17:52

gaussian the further the data point away

play17:54

the lower that we have this probability

play17:57

you pick it as your neighbor and the

play17:58

trend is gaussian

play18:00

now as we said let's focus on a very

play18:02

specific area of this plot between

play18:04

distances of one and two which is what

play18:05

we were classifying as moderate

play18:07

distances in the high dimensional space

play18:09

now if it's a distance of one away then

play18:11

the probability I would pick that as my

play18:13

neighbor is around here and if it's a

play18:15

distance of 2 away the probability I

play18:16

would pick that as my neighbor is like

play18:18

right here but as we said in the low

play18:21

dimensional space we don't have enough

play18:22

room and so the way we solve that is

play18:25

just by changing expanding our

play18:27

definition of what it means to be in a

play18:28

mid-dimensional space the the area

play18:30

between one and two distance is just too

play18:33

restrictive I need a lot more range to

play18:36

work with in order to fit all these data

play18:38

points that you want me to fit and so

play18:40

what we might say for example is I want

play18:42

the whole range between distances 1 and

play18:44

4.

play18:45

and what that implies is that whatever

play18:47

probability we had for picking J as our

play18:50

neighbor if its distance was two we now

play18:52

need to accept that same probability if

play18:54

the distance is four because we're

play18:56

expanding the definition of what it

play18:58

means to have a mid-dimension since we

play19:00

didn't have enough space in that lower

play19:02

dimensional space

play19:04

so what that means is that if we

play19:06

continue this line which was at 2 we're

play19:08

now asking for that line to be at four

play19:12

and what this implies even though I

play19:14

really should have drawn this more

play19:15

exaggerated what this implies is that

play19:17

whatever distribution we're now using to

play19:20

translate distances to scores or

play19:22

probabilities is going to need to have

play19:24

much heavier Tails than the normal

play19:25

distribution because if we try to draw

play19:28

this thing now it can't fall off at the

play19:30

same rate as the normal distribution it

play19:33

needs to fall off at a much heavier tail

play19:35

it needs to maintain a higher

play19:37

probability for longer and longer and

play19:40

longer

play19:41

and therefore we can't use a normal

play19:42

distribution at all because its tails

play19:44

fall off at the same rate no matter what

play19:46

standard deviation you're talking about

play19:48

so we switch in our lower dimensional

play19:51

space to a heavier tailed students T

play19:54

distribution with a degree of Freedom

play19:57

equals equals one which is actually the

play19:59

same thing as a kaushi distribution

play20:01

now you don't need to at all get into

play20:03

the nitty-gritty of what these

play20:05

distributions are and their PDFs or

play20:06

anything really I just want you to focus

play20:08

on the picture here

play20:10

we're just picking a distribution who

play20:12

has a heavier tail than the normal

play20:14

distribution and the reason we're doing

play20:16

that is to deal with this cursive

play20:17

dimensionality issue

play20:19

and the main thing you need to realize

play20:21

between these two plots is that whatever

play20:23

the probability was at 2 in the higher

play20:26

dimensional space we're asking the

play20:28

probability to remain there for longer

play20:30

and longer and longer to accommodate

play20:32

higher and higher and higher distances

play20:33

away to have the same property

play20:36

to have the same similarity as we did

play20:39

for these smaller distances in the high

play20:42

dimensional space

play20:43

so this honestly was the part of

play20:45

learning about this algorithm that took

play20:46

the most time for me to wrap my head

play20:48

around I was like what is going on why

play20:50

are we not just using the normal

play20:51

distribution the seems to have come out

play20:53

of nowhere but I'm hoping that perhaps I

play20:56

over explained it here and it was pretty

play20:57

obvious too but I would rather over

play20:59

explain than under explain because you

play21:01

can always just seek the video to where

play21:03

you want but hopefully this helped

play21:05

motivate why we are using a student's T

play21:07

distribution and the other reason I

play21:09

really wanted to get that across is

play21:11

because it's in the name of this whole

play21:12

thing it's called the T distributed

play21:14

stochastic neighbor embedding and while

play21:16

we're on this page let's make sure we

play21:18

understand where all these terms come

play21:20

from we know exactly about the

play21:21

t-distribution part now uh stochastic

play21:24

because this is all based on random

play21:26

processes we're dealing with

play21:27

probabilities here neighbor because

play21:30

we're dealing on a neighbor by neighbor

play21:31

basis That's the basis on which this

play21:33

algorithm works and embedding because

play21:36

we're trying to get our high dimensional

play21:37

embeddings which we are at here to a

play21:40

much lower dimensional space which is

play21:43

where we we are going right now

play21:46

so we're almost done here

play21:49

now that we've accepted that we're going

play21:51

to use a one degree of Freedom students

play21:54

D distribution more simply known as a

play21:56

kaushi distribution we can go ahead and

play21:58

just do the same form as we had a couple

play22:01

of pages ago so this form while it looks

play22:04

complicated is really just mirroring the

play22:07

pij form we had before

play22:09

so now we're saying qij which is going

play22:11

to be the similarity or the probability

play22:13

between the if and jth of your new your

play22:17

lower dimensional embeddings y i y j is

play22:21

going to be defined in this way and the

play22:22

reason the form looks like this is

play22:24

because that's the probability density

play22:25

function of the kaushi distribution or

play22:28

the student's T distribution with one

play22:30

degree of freedom and the denominator

play22:32

we're of course normalizing that by

play22:34

dividing by all the other data points

play22:36

that exist in there

play22:38

the final step is that well now we have

play22:42

our pijs which tell us how similar the

play22:44

higher dimensional embeddings were for I

play22:47

and J we have qij which tells us how

play22:49

similar our lower dimensional embeddings

play22:51

were for data points I and J we want

play22:54

them to be close to each other because

play22:56

that's the whole point of statistney is

play22:58

that we want to make sure that if pij

play23:00

was a certain number qij should be as

play23:01

close to that as possible

play23:03

and we can show that pij and qij are

play23:06

both probability distributions and what

play23:07

does it mean for us to enforce the two

play23:10

probability distributions are close to

play23:11

each other we can use our good friend

play23:13

the KL Divergence

play23:15

and we have a whole video on KL

play23:17

Divergence I'll post that below but in a

play23:19

nutshell it's just telling us what's the

play23:20

level of difference between two

play23:22

probability distributions we want the KL

play23:24

Divergence between the P which is the

play23:27

original High dimensional similarities

play23:29

and Q which is the new low dimensional

play23:31

similarities to be very small ideally

play23:34

you want to be zero but we can't have

play23:36

everything we want because we're going

play23:37

we're losing information by going to

play23:39

load Dimensions so we want it to be as

play23:41

low as possible

play23:43

and finally the way we actually find

play23:45

these y's is via gradient descent so you

play23:47

can think of this as our loss function

play23:49

and we could just take the derivative of

play23:51

this loss function with respect to all

play23:52

these y I's and yjs and move the Y I's

play23:56

and Y J's until the KL Divergence is as

play23:58

low as possible

play24:00

and so that's it that is the idea behind

play24:03

T Snee the key idea even though there

play24:06

was seemingly a lot of math here the key

play24:07

idea is really that t Snee is looking on

play24:10

an individual data point by data point

play24:12

basis and here we can have a quick

play24:13

discussion on how it's different than

play24:14

PCA TC is more complicated than PCA it's

play24:18

in it's asking for more it's saying not

play24:20

only do I care about General Trends in

play24:22

your data set like PCA does PCA remember

play24:24

is talking about what is the direction

play24:26

of the maximum variance of your data set

play24:28

I'm going to choose that as my first

play24:29

principal component then what's the next

play24:31

dimension of the highest variance of

play24:33

your data set that's kind of looking on

play24:34

a global level about what's the variance

play24:36

of your entire data set T Snee is much

play24:39

more zoomed in it's looking on a

play24:40

individual I and J data point level and

play24:43

trying to enforce that local structure

play24:45

there so it typically does a lot better

play24:47

job of clustering of making sure that if

play24:50

two data points were close in the high

play24:51

dimension they're close in this lower

play24:53

dimension

play24:54

but it also comes with that added

play24:55

computational complexity because we do

play24:58

need to care about things on a

play24:59

pair-by-pair basis it's going to take a

play25:01

lot longer for teeth need to run on the

play25:03

same data set than PCA

play25:05

because PCA at the end of the day even

play25:07

though it is a bit complicated is just

play25:09

doing a bunch of these linear

play25:10

Transformations on your data and that

play25:12

brings us the final point for this video

play25:14

is that PCA is a linear method as much

play25:17

as it's going on in PCA it's just doing

play25:18

linear Transformations it's assuming

play25:20

linear correlations and so it works

play25:22

really well for linear data

play25:24

batistney has no such assumptions it's

play25:27

again just looking at pairs of data

play25:28

points makes no assumption about

play25:30

linearity and therefore it can work a

play25:32

lot better when your data set is not

play25:34

linear so hopefully you enjoyed this

play25:36

video hopefully you learned a lot better

play25:37

than you did before about how TC works

play25:40

if you have any questions or comments

play25:41

please leave them in the section below

play25:43

thank you so much for watching like And

play25:45

subscribe for more videos just like this

play25:46

I'll catch all you wonderful people next

play25:48

time

Rate This
โ˜…
โ˜…
โ˜…
โ˜…
โ˜…

5.0 / 5 (0 votes)

Related Tags
T-SNEDimensionality ReductionMachine LearningData SciencePCA ComparisonStochastic NeighborEmbedding TechniqueLocal StructureData VisualizationGaussian DistributionKL Divergence