t-SNE Simply Explained
Summary
TLDRThis video explores the T-Distributed Stochastic Neighbor Embedding (t-SNE), a powerful method in machine learning for reducing high-dimensional data to two or three dimensions for visualization. Unlike PCA, t-SNE focuses on preserving local structures by comparing data points pair-by-pair. The video delves into the algorithm's mechanics, including the use of a Gaussian distribution in high dimensions and a heavier-tailed distribution in lower dimensions to combat the curse of dimensionality. It concludes with a comparison to PCA, highlighting t-SNE's non-linearity and computational complexity.
Takeaways
- 📚 The video introduces a method called t-Distributed Stochastic Neighbor Embedding (t-SNE), used for reducing the dimensionality of high-dimensional datasets.
- 🔍 t-SNE is distinct from Principal Component Analysis (PCA) in that it operates on a pair-by-pair basis, focusing on preserving local relationships between data points.
- 📈 The method begins by defining the probability of one data point choosing another as its neighbor based on their distance in the high-dimensional space.
- 📉 t-SNE uses a function for calculating probabilities that resembles the division of two Gaussian probability density functions, aiming to maintain local structures in the lower-dimensional representation.
- 🌐 The algorithm dynamically selects a standard deviation (σ) for each data point to fit a certain number of neighboring data points within a specified number of standard deviations.
- 🔗 t-SNE creates a similarity score (p_ij) that represents the likelihood of data points i and j being grouped together, based on their closeness in the high-dimensional space.
- 📉 The goal of t-SNE is to find a lower-dimensional representation (y1, y2, ..., yn) that preserves the similarity (p_ij) between data points as computed in the high-dimensional space.
- 🔑 The challenge of dimensionality reduction is addressed by using a Student's t-distribution with a degree of freedom of one (a Cauchy distribution) in the lower-dimensional space, which has heavier tails than a Gaussian distribution.
- 📝 The algorithm employs the KL Divergence to measure the difference between the original high-dimensional similarities (p_ij) and the new low-dimensional similarities (q_ij), aiming to minimize this difference.
- 🤖 t-SNE uses gradient descent to adjust the positions of the lower-dimensional points (y_i, y_j) to minimize the KL Divergence, thus finding an optimal representation.
- 🚀 t-SNE is more computationally complex than PCA due to its focus on individual data point relationships and is particularly effective for datasets that are not linearly separable.
Q & A
What is the full name of the method discussed in the video?
-The full name of the method discussed in the video is T-Distributed Stochastic Neighbor Embedding, often abbreviated as t-SNE.
What is the primary goal of t-SNE?
-The primary goal of t-SNE is to reduce the dimensionality of a high-dimensional data set to a much lower-dimensional space, typically two or three dimensions, to enable visualization of the data.
How does t-SNE differ from Principal Component Analysis (PCA) in terms of its approach to dimensionality reduction?
-t-SNE operates on a pair-by-pair basis, focusing on preserving the local relationships between individual data points in the new lower-dimensional space, whereas PCA works on a global level, focusing on directions of maximum variance in the data set.
What is the significance of the probability P(J|I) in t-SNE?
-P(J|I) represents the probability that, given data point I, it would choose data point J as its neighbor in the original high-dimensional space. It is a key component in defining the local relationships between data points.
How does t-SNE address the 'curse of dimensionality' when mapping high-dimensional data to a lower-dimensional space?
-t-SNE addresses the curse of dimensionality by using a heavier-tailed distribution, specifically a Student's t-distribution with one degree of freedom (also known as a Cauchy distribution), to maintain higher probabilities for longer distances in the lower-dimensional space.
What role does the KL Divergence play in the t-SNE algorithm?
-The KL Divergence is used as a loss function in t-SNE to measure the difference between the original high-dimensional similarities (p_ij) and the new low-dimensional similarities (q_ij). The goal is to minimize the KL Divergence to ensure that the low-dimensional representation preserves the local structures of the high-dimensional data.
How does t-SNE handle the dynamic setting of the standard deviation σ_i for each data point?
-t-SNE dynamically sets the standard deviation σ_i for each data point based on the number of neighboring data points within a certain number of standard deviations of the Gaussian distribution. This allows t-SNE to adapt to different densities in the data, with lower σ_i for denser regions and higher σ_i for sparser regions.
What is the motivation behind using the Gaussian distribution to define probabilities in the high-dimensional space?
-The motivation behind using the Gaussian distribution is that it naturally incorporates the distance between data points into the probability calculation, with closer points having higher probabilities of being chosen as neighbors and more distant points having lower probabilities.
How does t-SNE ensure that the local structures between data points are preserved in the lower-dimensional space?
-t-SNE ensures the preservation of local structures by minimizing the KL Divergence between the original high-dimensional similarities and the new low-dimensional similarities, thus maintaining the relative closeness of data points that were close in the original space.
What computational method is used to find the low-dimensional representations (y_i) in t-SNE?
-Gradient descent is used to find the low-dimensional representations (y_i) in t-SNE. The algorithm iteratively adjusts the positions of the data points in the lower-dimensional space to minimize the KL Divergence between the original and new similarity measures.
Outlines
📚 Introduction to t-SNE for Dimensionality Reduction
The video introduces a method called t-Distributed Stochastic Neighbor Embedding (t-SNE), used in machine learning and data science to reduce the dimensionality of high-dimensional datasets. Unlike Principal Component Analysis (PCA), t-SNE operates on a pair-by-pair basis, aiming to preserve the local relationships between data points in a lower-dimensional space, typically two or three dimensions, for visualization purposes. The method begins by defining the probability of one data point choosing another as its neighbor based on their distance in the high-dimensional space.
🔍 Defining Neighbor Probabilities in High-Dimensional Space
This paragraph delves into how t-SNE quantifies the likelihood of data points being neighbors based on their distances in high-dimensional space. It explains the formulation of probabilities using a Gaussian distribution, where the probability of choosing a neighbor decreases with increasing distance. The function accounts for all other data points in the dataset, ensuring that the selection of a neighbor is relative to the local context of each data point's position.
📉 Adjusting Probabilities for Dynamic Standard Deviation
The script discusses the dynamic selection of the standard deviation (σ) for the Gaussian distribution used in calculating neighbor probabilities. The choice of σ is tailored to fit a specific number of neighboring data points within a certain range of standard deviations. This dynamic adjustment accommodates varying densities of data points in different regions of the high-dimensional space.
🔗 Preserving Local Structures in Lower-Dimensional Space
The paragraph explains the goal of t-SNE to learn a lower-dimensional representation of data points while preserving the local similarity computed in the high-dimensional space. It introduces the concept of using a Gaussian distribution to measure similarities between points in the lower-dimensional space and the challenge of ensuring that these similarities match the original high-dimensional similarities.
📊 The Curse of Dimensionality and t-SNE's Solution
This section addresses the challenge of the curse of dimensionality, which arises when trying to map high-dimensional data into a lower-dimensional space. The video illustrates how the area occupied by data points at a moderate distance from a target point increases exponentially with the number of dimensions. To overcome this, t-SNE modifies the similarity definition in the lower-dimensional space by using a heavier-tailed distribution, such as the Cauchy distribution, to maintain higher probabilities for larger distances.
📈 Utilizing the Student's t-Distribution for Similarity
The video explains the rationale behind using the Student's t-distribution with one degree of freedom (equivalent to the Cauchy distribution) in the lower-dimensional space. This distribution has heavier tails, allowing t-SNE to accommodate the increased probability of selecting a neighbor over a broader range of distances, thus addressing the limitations imposed by the curse of dimensionality.
🔧 Gradient Descent Optimization for t-SNE
The final paragraph outlines the optimization process of t-SNE, which involves using gradient descent to minimize the KL Divergence between the original high-dimensional similarities (p_ij) and the new low-dimensional similarities (q_ij). The goal is to find the lower-dimensional representations (y_i, y_j) that best preserve the local structures of the high-dimensional data.
🌐 Comparing t-SNE with PCA and Conclusion
The video concludes by contrasting t-SNE with PCA, highlighting that t-SNE is more focused on preserving local structures and does not assume linearity in the data, making it suitable for non-linear datasets. In comparison, PCA is a linear method that works well with linearly correlated data. The presenter invites viewers to leave questions or comments and encourages them to like, subscribe, and watch more videos on similar topics.
Mindmap
Keywords
💡Dimensionality Reduction
💡t-SNE
💡Principal Component Analysis (PCA)
💡Probability Distribution
💡Gaussian Distribution
💡KL Divergence
💡Stochastic
💡Embedding
💡Curse of Dimensionality
💡Student's T-Distribution
Highlights
Introduction to T-Distributed Stochastic Neighbor Embedding (t-SNE), a method for reducing the dimensionality of high-dimensional data sets.
Comparison between t-SNE and Principal Component Analysis (PCA), highlighting t-SNE's unique approach to dimensionality reduction.
Explanation of t-SNE's operation on a unit-by-unit level, preserving the relationship between data points in a pair-by-pair basis.
The concept of probability in selecting a neighbor data point based on distance in the high-dimensional space.
The formulation of probability as a function of distance, emphasizing the likelihood of choosing closer data points as neighbors.
Inclusion of all other data points in the probability function to account for the global context within the data set.
The properties required for the probability function, including negative correlation with distance and positive correlation with relative closeness.
Introduction of the specific function recommended by t-SNE authors, resembling the division of two Gaussian probability density functions.
Validation of the function's properties to ensure it meets the criteria for neighbor selection based on distance.
The dynamic selection of standard deviation (σ) for each data point to fit a certain number of neighboring data points within a Gaussian distribution.
The definition of similarity score (p_ij) as a balance between the probabilities of data points grouping together.
The challenge of preserving local structures in the lower-dimensional space while reducing dimensionality.
The mechanics of translating high-dimensional data points to a lower-dimensional space while preserving similarity.
The issue of the curse of dimensionality and its impact on the algorithm's ability to pack mid-level data points into a lower-dimensional space.
The solution to the curse of dimensionality by using a heavier-tailed Student's T-distribution with one degree of freedom in the lower-dimensional space.
The rationale behind using a Student's T-distribution to maintain similarity over a broader range of distances in the lower-dimensional space.
The use of KL Divergence to minimize the difference between the original high-dimensional similarities (p_ij) and the new low-dimensional similarities (q_ij).
The final step of finding the low-dimensional representations (y's) using gradient descent to minimize the KL Divergence.
The contrast between t-SNE's pair-by-pair approach and PCA's global variance approach, with t-SNE being more suitable for non-linear data sets.
The computational complexity of t-SNE compared to PCA, due to its focus on individual data point relationships.
Transcripts
[Music]
hey what's up everybody today we're
going to be talking about a very cool
method in machine learning and data
science to reduce the dimensionality of
a data set who has very very high
dimensional data in a very different way
than for example principal component
analysis and at the end of the video
we'll talk about pros and cons of those
two methods
but this method is officially called the
T distributed stochastic neighbor
embedding that's a very long title I
couldn't even fit all of the words
without these little abbreviations here
so people typically call it t Snee so we
have t s and e t Snee and as we were
saying before the idea is that we come
to this method with a bunch of data
points in a pretty high dimensional
space
so we have n data points X1 X2 all the
way to xn and each of these are vectors
in a very high dimensional space you can
think of dozens or hundreds or even
thousands of dimensions for each of
these vectors
and our goal is to eventually get a set
of n vectors y1 Y2 all the way to YN who
are in a much much lower dimensional
space typically a two or three
dimensional space which lets us do
things like visualize this data which we
couldn't do before
so we're going to start by asking a
series of questions the key Insight the
key idea behind TC and what makes it
fundamentally different from PCA is that
TC is operating on a unit by unit level
so it's looking at X1 and X2 and trying
to preserve something about the
relationship between those two data
points in your original data
it's trying to preserve that
relationship in the new data which is
going to be in lower Dimensions so it's
working on a pair-by-pair basis whereas
PCA is working on a more Global level
we'll say a lot more on that later
the first question we're going to ask
here is given some data point in this
set of n data points x i
what's the probability that it would
pick some other data point x j as its
neighbor
this kind of seems like a vague question
but we can quantify we can Define that
quantity as P J given I so this notation
says given that you are the data point I
what's the probability that you would
pick the data point J as your neighbor
now there's lots of ways you could
Define this but one very natural way is
going to be making it a function of the
distance between x i and XJ again in
this High dimensional space
so we're going to say the probability
that if I'm I I would pick J as my
neighbor is going to be some function I
don't know what the function looks like
yet but it's going to be some function
of the distance between x i and x j the
smaller this distance is the more likely
I would want to pick that one as my
neighbor because it's closer to me
the further this data point x j is the
less likely I want to pick it as my
neighbor
now this formulation almost makes sense
but it ignores the fact that there's all
these other data points going on in our
data there's not just x i and XJ imagine
your x i and you're being asked who are
you going to pick as your neighbor well
I need to know information about all the
other data points are there others that
are way closer to me is XJ actually the
closest to me and everyone else is
further away
so while this attempt number one does
start getting us to the right place
we're going to need to expand what our
function is based on so we're going to
say this probability is now a function
of I've just made a shorter notation
here d i j squared is just this distance
between I and J squared and it's going
to include information about the same
distances from I to all of the other
data points in our data so all distances
ik where K is not equal to I or not
equal to J and whatever function this is
is going to need to have a couple
properties and I've noticed now I've got
these reversed so we're doing this in
real time let me fix this so this is
going to be negative and this is going
to be positive so the first property is
that the bigger the distance gets
between I and J the lower we're going to
have the probability I would pick that
as my neighbor it's going to be
negatively correlated and the other
property is for all the other data
points that I'm not considering that are
not equal to irj if the distance to them
gets bigger and bigger and bigger then
everything else held constant J becomes
more and more attractive because it's
relatively closer to me and so my
probability should go up in that
situation so what might such a function
f look like
and the authors of T Snee recommend
using this function here which is going
to seemingly come out of nowhere but
I'll try to motivate it as best I can
so they say this function pji the
probability I would pick J is my
neighbor if I am the data point I is
going to be this very ugly looking form
but if you stare at it for a little bit
and try to remember what it looks like
from your introductory statistics
courses this exactly looks like the
division of two normal probability
density functions two gaussian
probability density functions and that's
exactly where the motivating comes from
so the paper authors say that we're
going to assign these probabilities PJ
given I proportional to the density of a
normal
of a normal distribution of the distance
so the main things we should check first
are do the properties you want out of
this function based on what we just came
up with here actually hold well this
distance here so this guy here is the
distance between x i and x j squared if
this increases then because there's a
negative sign in front of it the whole
exponent here is going to decrease and
if we take e to the power of an Ever
decreasing number then the entire
numerator is going to decrease so we
exactly have the first criteria and we
can see it we also have the second
criteria met if any of the other data
points who are not equal to IR J in the
denominator if they increase then the
entire denominator is going to decrease
and therefore the entire probability is
going to increase exactly as we wanted
so while this form itself May seemingly
come out of nowhere and to be quite
honest with you we probably could have
chosen different forms this was just the
form that was chosen perhaps for
mathematical convenience in the actual
algorithm the properties we want do in
fact hold and that's all we really care
about for right now but the motivating
factor behind this was that we're going
to assume that as the distance moves
further away the probability that you
would choose data point J if you were
data point I is going to be proportional
to the normal distribution of the
distances between those two and also the
distances between you and all the other
data points that exist out there
so the next question we're going to ask
is well this was telling us about data
points I and data point J let's zoom out
a little bit and say if we were to pick
a point at random from all of our n data
points what's the probability that I and
J end up grouped together
well this could happen in two different
situations the first is that the random
point we chose is data point I
that happens with the 1 over n
probability because all the data points
have equal probability of being picked
in this situation and given we chose
data point I what's the probability that
it would choose data point J as its
neighbor well that's exactly what we
spent so much time coming up with here
and that's exactly the quantity PJ given
I
now there's one other way that we could
have inj grouped and that would be if
the data point we pick in this little
experiment here was data point J
and if we picked data point J which
happens with the one over n probability
then the probability that inj would be
grouped is now the same as the
probability that data point J would pick
data point I which is going to be just
the opposite here so given I'm data
point J what the probability I picked
data point I
and now these two things together
if we do this multiplication in this
addition we're going to Define as p i j
which is going to be some form of
similarity score
between the data points I and J because
that represents the probability that
they would end up together in a group
and this captures both it's kind of a
balance in fact it's a literal averaging
between this pji and this p i j so we
can put p j i plus p i j and we divide
that by the number of data points there
are and this gives us a score which we
can interpret as a probability of I and
J being similar to each other and this
is all going to be based on their
physical closeness to each other in this
High dimensional space
now there's a note that I kind of
glanced over here some of you might be
wondering how do I pick this Sigma I
because when you have a gaussian
distribution it's defined by a mean and
a standard deviation and I didn't really
talk about how that standard deviation
which based on the subscript I seems to
be a function of I gets set
there's a ton more details in the
original paper which I'll link in the
description below but here's the general
intuition
we're going to pick that standard
deviation dynamically for each of our
data points from 1 to n
so that it's going to fit a certain
number in general it's going to fit a
certain number of neighboring data
points in some number of standard
deviations of that gaussian distribution
and so that seems a little bit abstract
but let me make it a little bit more
concrete here now let's say you're
looking at some data point I and you're
looking at all the other data points and
what distance they have away from your
data point I so this vertical line here
is going to be a distance zero and I
didn't even need the negative side of
this plot should have really just drawn
the positive side but let's say you have
four other data points that are right
here
so you would use this red distribution
here if your rule was something like
within the first two standard deviations
of my normal distribution which would
cut off around here I want four data
points and that's exactly what we've
been able to achieve within this first
two standard deviations you have four
data points now here's a different case
you have for a different data point x
sub I that you might be looking at let's
say that its neighbors are here here
here and here
now you can't use that same red normal
distribution because if you were to then
the first two standard deviations you
only have two data points there's the
sparsity issue going on and so we allow
this Sigma sub I to be dynamic so that
we can use this black normal
distribution here because within its
first two standard deviations we do have
now those four neighboring data points
so there are a lot more details I am
oversimplifying it a little bit here but
the general idea is that if around
whatever Point you're looking at
there's a lot of very close neighbors so
it's in a very dense part of the space
then you can set Sigma sub I to be low
you don't need a really big standard
deviation for your normal distribution
if on the other hand your data point x i
is in a very sparse part of the space
where it's here but its closest
neighbors are like really far away
which is the situation we were looking
at with this black points here then you
want to set that standard deviation
appropriately to be higher so let's take
stock of where we're at so far we have p
i j which is going to give us some
measure of similarity between the data
point I and the data point J
and now we can start talking about how
do we get to that lower dimensional
space
the goal is to learn a low dimensional
representation again in tcne that
usually means a two or three dimensional
representation because people typically
want to do this so that they can
actually visualize their data on some
kind of plot
so we're going to translate our X1 X2
all the way to xn which are high
dimensional into y1 Y2 all the way to YN
which are low dimensional for example
they could be in two-dimensional space
and we want to do it in such a way that
we're going to preserve the similarity
that we just computed which are again
these p i JS between data points x i and
XJ so if the similarity between data
points X1 and X2 was like 0.1 then we
want to do our best to preserve that
similarity in the data points y1 and Y2
even though they're in a lower
dimensional space and this is probably
the most important part of the video the
intuition the goal we're trying to do
the how is really just more mechanics
but the power behind tisney is that it's
looking at pairs of data points and
seeing what similarity we gave it based
on this formulation here and saying that
you know what I know we want to put the
data in a lower dimensional space but I
don't want to sacrifice on these
similarities I want to preserve these
local structures between my data points
that were there before I want those
local structures to still be there right
now
and now we can get into the actual
mechanics how are we going to actually
do this well why don't we just try the
most straightforward thing first based
on what we've already used in this
method so far and that's going to be
just like with the X's how we measured
this similarity using this gaussian
framework why don't we just do the exact
same thing with the Y's so just like for
the X's we measure the similarities
using the gaussian just do the same
thing with the Y's measure them using
the gaussian and now you'll have some
values q i j which are the similarities
between the Y's and now all you need to
do is make sure the qijs and the pijs
are similar to each other
now that is a great first attempt and
avoids any unnecessary complications but
we should talk about a very subtle thing
that goes wrong when we do that and this
is all under the umbrella of the curse
of dimensionality which we knew was
going to come along when we're dealing
with very high dimensional data and here
is how it rears its ugly head
to demonstrate this I'm just going to
show in two and three dimensions so
we're going to treat our two-dimensional
space as our Target space for tisney and
we're going to say the three-dimensional
space is the original space the data
points are in of course your data points
are probably going to be in a way higher
dimensional space but this is just for
illustration
what I want to do here in two Dimensions
because I can draw in two dimensions for
you all not well apparently but what I'm
going to do in two Dimensions is draw an
area of two-dimensional space who has a
distance of one from the origin and
that's this little green circle here
and we're going to call this the nearby
area all these points are going to be
nearby to whatever my target data point
in the origin is
this Blue Area we're going to call the
mid distance area so it's between a
distance of one and two
and what I want to do here is a very
simple equation I want to divide the
area of the mid region which is this
outer blue ring by the area of the near
region which is this inner green circle
now we know how to do it just pi r
squared is the area of a circle and so
the area of the mid region is going to
be the area of this big circle minus the
area of the small circle which is the
numerator and the area of the nearby is
just going to be the area of the small
circle or Pi 1 squared and we get that
that ratio is equal to 3.
so what this means is that the mid size
the area of the mid region where those
data points were a middle distance away
from me is three times the area of the
data points we're allowed to be a near
distance from me now stay with me here
let's do this for one dimension up in
this case we're not dealing with areas
we're dealing with volumes and I haven't
even dared to draw it but you can
visualize a sphere of points who all
have a distance of one or less from the
origin or the unit sphere
is going to be what we're calling the
volume of the nearby region in three
dimensions very analogous to the area of
the nearby region in two dimensions and
the volume of the mid in three
dimensions is going to be the area
between a sphere who has distance two
just like we had this circle with
distance two everything between the
outside of that and the outside of the
inner sphere
so we can do a very simpler formula we
appeal to the volume of a sphere here
and we get that the ratio is equal to
seven
and now here's the punch line
in two-dimensional space we found that
the area of the mid data points was
three times the area of the near data
points but in three-dimensional space
the volume in which the mid-level data
points are allowed to occupy is a whole
seven times bigger than the volume of
the nearby data points are allowed to
occupy
and where this is all leading up to how
we can verbalize the problem here is
exactly like this
when we're going from a high dimensional
space to a low dimensional space in this
case just kind of trivializing it going
from a three-dimensional space to a
two-dimensional space we're trying to
take all these mid-level data points
where a mid distance away from whatever
Target data point we care about
and that has you can say a size of seven
and we're trying to pack that into a
mid-size area in the lower dimensional
space who has a you can say size of
three
so the problem is that we're trying to
pack all the midpoints in high
dimensional space the midpoints in a
high dimensional space into a much much
smaller space in a lower Dimension and
the problem doesn't seem so pronounced
with just two and three dimensions but
imagine that your data point has just 10
Dimensions which is not crazy by any
data science standards typically when we
use T Snee we're trying to embed like
1000 or 100 dimensional embeddings
even if your data just has 10 Dimensions
this problem becomes such that you're
trying to take a mid-size space and I'm
kind of using this word loosey-goosey
but what I'm really trying to say is the
area in that higher dimensional space
which is occupied by the data points who
are a moderate distance away from you
and with a 10-dimensional space that has
a 1023 ratio with the nearby area and as
we saw in a two-dimensional space that
has a ratio of just three
so we run into just this kind of like
I'm running out of space I can't take
all the data points who have a certain
characteristic in a high dimension and
pack them into the same characteristic
in the low Dimension there's just not
enough room in that low dimensional
space
and the way we solve for that is
actually changing up our definition of
similarity in that lower dimensional
space and we do that in a visual way so
in the high dimensional space we are
using this normal distribution where the
x-axis here is distance away from your
target data point
and the y-axis is telling you what's the
probability that I would pick that data
point as my neighbor
and as we said this drops off as a
gaussian the further the data point away
the lower that we have this probability
you pick it as your neighbor and the
trend is gaussian
now as we said let's focus on a very
specific area of this plot between
distances of one and two which is what
we were classifying as moderate
distances in the high dimensional space
now if it's a distance of one away then
the probability I would pick that as my
neighbor is around here and if it's a
distance of 2 away the probability I
would pick that as my neighbor is like
right here but as we said in the low
dimensional space we don't have enough
room and so the way we solve that is
just by changing expanding our
definition of what it means to be in a
mid-dimensional space the the area
between one and two distance is just too
restrictive I need a lot more range to
work with in order to fit all these data
points that you want me to fit and so
what we might say for example is I want
the whole range between distances 1 and
4.
and what that implies is that whatever
probability we had for picking J as our
neighbor if its distance was two we now
need to accept that same probability if
the distance is four because we're
expanding the definition of what it
means to have a mid-dimension since we
didn't have enough space in that lower
dimensional space
so what that means is that if we
continue this line which was at 2 we're
now asking for that line to be at four
and what this implies even though I
really should have drawn this more
exaggerated what this implies is that
whatever distribution we're now using to
translate distances to scores or
probabilities is going to need to have
much heavier Tails than the normal
distribution because if we try to draw
this thing now it can't fall off at the
same rate as the normal distribution it
needs to fall off at a much heavier tail
it needs to maintain a higher
probability for longer and longer and
longer
and therefore we can't use a normal
distribution at all because its tails
fall off at the same rate no matter what
standard deviation you're talking about
so we switch in our lower dimensional
space to a heavier tailed students T
distribution with a degree of Freedom
equals equals one which is actually the
same thing as a kaushi distribution
now you don't need to at all get into
the nitty-gritty of what these
distributions are and their PDFs or
anything really I just want you to focus
on the picture here
we're just picking a distribution who
has a heavier tail than the normal
distribution and the reason we're doing
that is to deal with this cursive
dimensionality issue
and the main thing you need to realize
between these two plots is that whatever
the probability was at 2 in the higher
dimensional space we're asking the
probability to remain there for longer
and longer and longer to accommodate
higher and higher and higher distances
away to have the same property
to have the same similarity as we did
for these smaller distances in the high
dimensional space
so this honestly was the part of
learning about this algorithm that took
the most time for me to wrap my head
around I was like what is going on why
are we not just using the normal
distribution the seems to have come out
of nowhere but I'm hoping that perhaps I
over explained it here and it was pretty
obvious too but I would rather over
explain than under explain because you
can always just seek the video to where
you want but hopefully this helped
motivate why we are using a student's T
distribution and the other reason I
really wanted to get that across is
because it's in the name of this whole
thing it's called the T distributed
stochastic neighbor embedding and while
we're on this page let's make sure we
understand where all these terms come
from we know exactly about the
t-distribution part now uh stochastic
because this is all based on random
processes we're dealing with
probabilities here neighbor because
we're dealing on a neighbor by neighbor
basis That's the basis on which this
algorithm works and embedding because
we're trying to get our high dimensional
embeddings which we are at here to a
much lower dimensional space which is
where we we are going right now
so we're almost done here
now that we've accepted that we're going
to use a one degree of Freedom students
D distribution more simply known as a
kaushi distribution we can go ahead and
just do the same form as we had a couple
of pages ago so this form while it looks
complicated is really just mirroring the
pij form we had before
so now we're saying qij which is going
to be the similarity or the probability
between the if and jth of your new your
lower dimensional embeddings y i y j is
going to be defined in this way and the
reason the form looks like this is
because that's the probability density
function of the kaushi distribution or
the student's T distribution with one
degree of freedom and the denominator
we're of course normalizing that by
dividing by all the other data points
that exist in there
the final step is that well now we have
our pijs which tell us how similar the
higher dimensional embeddings were for I
and J we have qij which tells us how
similar our lower dimensional embeddings
were for data points I and J we want
them to be close to each other because
that's the whole point of statistney is
that we want to make sure that if pij
was a certain number qij should be as
close to that as possible
and we can show that pij and qij are
both probability distributions and what
does it mean for us to enforce the two
probability distributions are close to
each other we can use our good friend
the KL Divergence
and we have a whole video on KL
Divergence I'll post that below but in a
nutshell it's just telling us what's the
level of difference between two
probability distributions we want the KL
Divergence between the P which is the
original High dimensional similarities
and Q which is the new low dimensional
similarities to be very small ideally
you want to be zero but we can't have
everything we want because we're going
we're losing information by going to
load Dimensions so we want it to be as
low as possible
and finally the way we actually find
these y's is via gradient descent so you
can think of this as our loss function
and we could just take the derivative of
this loss function with respect to all
these y I's and yjs and move the Y I's
and Y J's until the KL Divergence is as
low as possible
and so that's it that is the idea behind
T Snee the key idea even though there
was seemingly a lot of math here the key
idea is really that t Snee is looking on
an individual data point by data point
basis and here we can have a quick
discussion on how it's different than
PCA TC is more complicated than PCA it's
in it's asking for more it's saying not
only do I care about General Trends in
your data set like PCA does PCA remember
is talking about what is the direction
of the maximum variance of your data set
I'm going to choose that as my first
principal component then what's the next
dimension of the highest variance of
your data set that's kind of looking on
a global level about what's the variance
of your entire data set T Snee is much
more zoomed in it's looking on a
individual I and J data point level and
trying to enforce that local structure
there so it typically does a lot better
job of clustering of making sure that if
two data points were close in the high
dimension they're close in this lower
dimension
but it also comes with that added
computational complexity because we do
need to care about things on a
pair-by-pair basis it's going to take a
lot longer for teeth need to run on the
same data set than PCA
because PCA at the end of the day even
though it is a bit complicated is just
doing a bunch of these linear
Transformations on your data and that
brings us the final point for this video
is that PCA is a linear method as much
as it's going on in PCA it's just doing
linear Transformations it's assuming
linear correlations and so it works
really well for linear data
batistney has no such assumptions it's
again just looking at pairs of data
points makes no assumption about
linearity and therefore it can work a
lot better when your data set is not
linear so hopefully you enjoyed this
video hopefully you learned a lot better
than you did before about how TC works
if you have any questions or comments
please leave them in the section below
thank you so much for watching like And
subscribe for more videos just like this
I'll catch all you wonderful people next
time
Посмотреть больше похожих видео
5.0 / 5 (0 votes)