Week 3 Lecture 19 Linear Discriminant Analysis 3
Summary
TLDRThe script discusses the concept of class variance in the context of linear discriminant analysis (LDA), focusing on maximizing the distance between class means while minimizing within-class variance. It explains the process of finding the optimal direction 'w' for class separation without assuming Gaussian distributions, highlighting the Fisher criterion for maximizing between-class variance relative to within-class variance. The summary also touches on the generalization from two classes to multiple classes and the importance of constraints to avoid unbounded solutions.
Takeaways
- 📚 The concept of 'class variance' discussed is the variance among the projected means of different classes, which is crucial in understanding the separation of classes in a dataset.
- 🔍 When dealing with two classes, the goal is to maximize the distance between their projected means, which is a fundamental aspect of binary classification.
- 📉 The 'between class variance' is maximized relative to the 'within class variance', which is a key principle in Linear Discriminant Analysis (LDA).
- 📈 The 'within class variance' is calculated by considering the variance of data points with respect to the class mean, which is essential for understanding the spread of data within each class.
- 📝 In the context of LDA, the 'Fisher criterion' is used to find the optimal direction (w) that maximizes the ratio of between-class variance to within-class variance.
- 🔢 The script emphasizes the importance of constraints on 'w' to avoid unbounded solutions, typically by assuming the norm of 'w' is one.
- 📐 The direction of 'w' is found to be proportional to the difference between the means of the two classes (m2 - m1), which is a critical step in LDA.
- 🧩 The script discusses the generalization from a two-class case to multiple classes, indicating that the principles of LDA can be extended to more complex scenarios.
- 📊 The 'Fisher criterion' is rewritten in terms of covariance matrices, showing the mathematical formulation for finding the optimal 'w'.
- 🤖 The script explains that LDA does not rely on the assumption of Gaussian distributions, making it a robust method even when the underlying data distribution is not Gaussian.
- 🔑 The final takeaway is that the 'Fisher criterion' and the Gaussian assumption lead to the same direction for 'w', highlighting the versatility of LDA.
Q & A
What is class variance in the context of the transcript?
-Class variance in this context refers to the variance among the means of different classes, specifically the variance of the projected means of the classes in a dataset.
What is the significance of maximizing the distance between the projected means of two classes?
-Maximizing the distance between the projected means of two classes is a way to enhance the separability of the classes, which is a key objective in classification tasks.
What does the term 'within class variance' refer to?
-Within class variance refers to the variance of the data points within each class with respect to the class mean, which is a measure of the spread of the data points within the class.
Why is it necessary to have constraints when maximizing the between class variance?
-Constraints are necessary to prevent unbounded solutions. Without constraints, one could arbitrarily scale the weight vector 'w' to achieve larger values, which would not be meaningful in the context of the problem.
What assumption is commonly made to ensure that the solutions are not numerically unbounded?
-A common assumption is to constrain the norm of the weight vector 'w' to be one, which is expressed as the constraint that the sum of the squares of the weights equals one.
What is the 'Fisher criterion' mentioned in the transcript?
-The Fisher criterion is a statistical method used to maximize the ratio of between-class variance to within-class variance, named after the statistician Ronald Fisher, who introduced it in the context of linear discriminant analysis (LDA).
How does the direction of the weight vector 'w' relate to the means of the classes?
-The weight vector 'w' is found to be in the direction of the difference between the means of the two classes (m2 - m1), which is the direction that maximizes the separation between the classes.
What is the relationship between the within-class covariance matrix and the Fisher criterion?
-The within-class covariance matrix is used in the denominator of the Fisher criterion to represent the within-class variance, which is what the between-class variance is being maximized relative to.
Why is it said that LDA does not only work when the distributions are Gaussian?
-The derivation of the LDA in the transcript does not rely on the Gaussian assumption for the class-conditional distributions, indicating that LDA can be well-defined and effective even when the underlying distributions are not Gaussian.
What is the significance of the threshold w0 in the context of classifying data points?
-The threshold w0 is used to classify data points based on the projection defined by the weight vector 'w'. If the projection of a data point is greater than w0, it is classified as one class, and if it is less than or equal to w0, it is classified as another class.
How does the transcript relate the concept of centroids to the discussion of class variance?
-The centroids of the data, which are the means of the classes, play a crucial role in calculating the projected means and the variances, both within and between classes, which are central to the discussion of class variance.
Outlines
📊 Understanding Class Variance in Machine Learning
The speaker introduces the concept of class variance in the context of machine learning, focusing on the variance among class means. The explanation involves the computation of variance between the projected means of 'k' classes, emphasizing the importance of maximizing the distance between these means. The discussion simplifies to a two-class scenario to illustrate the concept of maximizing the variance between the centers of the classes relative to the within-class variance. The speaker also addresses the issue of unbounded solutions by introducing constraints on the norm of 'w', which is set to one to avoid scaling issues.
🔍 Maximizing Between-Class Variance with Constraints
This paragraph delves deeper into maximizing the distance between class means, which is the first criterion for the model. The speaker discusses the potential problem of unbounded solutions when scaling 'w' without constraints, leading to arbitrarily large values. To counter this, a constraint is introduced to keep the norm of 'w' equal to one, ensuring that the solution remains numerically bounded. The direction of 'w' is identified as being proportional to the difference between the means of the two classes, highlighting the importance of this direction in class separation.
📈 Projecting Data Points and Within-Class Variance
The speaker explains the process of projecting data points onto a line defined by 'w' and classifying them based on a threshold 'w0'. The focus then shifts to within-class variance, which is calculated by considering the projected distance of data points from the projected mean of their respective classes. The paragraph introduces the concept of the Fisher criterion, which is used to maximize the ratio of between-class variance to within-class variance, without making any assumptions about the underlying data distribution.
🧩 Deriving the Optimal Direction for 'w'
In this section, the speaker discusses the mathematical derivation for finding the optimal direction of 'w' by maximizing the ratio of between-class variance to within-class variance. The process involves differentiating with respect to 'w' and setting the result to zero, leading to a solution where 'w' is proportional to the difference between the projected means of the classes. The explanation simplifies the problem by showing that the solution for 'w' is independent of the specific values of the within-class covariance matrix, as long as it is positive definite.
📚 Comparing Motivations for Linear Discriminant Analysis (LDA)
The final paragraph compares two different motivations for deriving the linear discriminant analysis (LDA). The first motivation is based on maximizing the ratio of between-class variance to within-class variance, while the second is based on the assumption of Gaussian class-conditional densities. The speaker emphasizes that both approaches lead to the same direction for 'w', modulo scaling factors, and that LDA can be applied even when the underlying data distribution is not Gaussian, as it relies on sample means and variances rather than distribution assumptions.
Mindmap
Keywords
💡Class Variance
💡Projected Means
💡Within-Class Variance
💡Between-Class Variance
💡Fisher Criterion
💡LDA (Linear Discriminant Analysis)
💡Gaussian Assumption
💡Covariance Matrix
💡Optimization
💡Threshold
💡Centroids
Highlights
Exploration of class variance as the variance among class means, emphasizing the importance of projected class means.
Introduction of the concept of maximizing distance between projected class means for a two-class scenario.
Generalization of the concept to 'k' classes, highlighting the maximization of variance among 'k' centers.
Discussion on the within-class variance, defined as the variance with respect to the class mean.
Simplification of the problem by starting with a two-class case before generalizing to multiple classes.
Explanation of the decision surface defined by wTx and the classification threshold w0.
Assumption of the means of classes C1 and C2 and the notation used for projected means.
The problem of unbounded solutions due to unrestricted scaling of 'w' and the proposed constraint.
The numerical approach to ensure 'w' is not unbounded by setting the norm of 'w' to one.
Derivation of 'w' being in the direction of m2 – m1, indicating the midpoint between class means.
Illustration of class separation using Gaussian distributions and the significance of the 1σ contour.
The concept of centroids in data and their relation to the decision boundary.
Introduction of the Fisher criterion and its role in maximizing between-class variance relative to within-class variance.
Differentiation of the criterion with respect to 'w' to find the optimal direction.
The relationship between the Fisher criterion and the class-conditional densities, highlighting LDA's applicability beyond Gaussian distributions.
Final expression for 'w' being proportional to the difference of class means, considering within-class covariance.
Understanding J(w) as the ratio of between-class variance to within-class variance, aiming for maximization.
Transcripts
Okay so when I say between
class variance I say it is the variance of the class means so I will take the classes okay look
at the means of those classes and look at the projected means of those classes and compute the
variance among the projected means okay suppose I have “k” classes I can compute the variance among
those if I have two classes what will this amount to maximizing the distance between the projected
right fits two classes it will be maximizing the distance between them if it is k classes it will
be maximizing the variance among the k centers right relative to the within class variance and
what would be the within class variance? For each class the variance with respect
to the class mean so that is what we already computed that right but for each class right
so within class variance that essentially what I'm looking at here right so let us just treat
the first condition alone all right so I will just simplicity sake start off with a two class
case and then we can think of the generalization to multiple classes, so I am going to have a
surface defined by wTx right so y = wT x if it's greater than some w0 I am going to classify it as
class one just less than some w0 or less than or equal to, I will classify it as class two.
Sorry my font went too small. I am going to say and are the means of C1 and C2 right and well
we know how to compute just like you do a there and I am going to assume that when I write the mk
without the bar okay this the projected one okay. So I should see the projection of the mean okay
in the direction wT okay so that is essentially what this is so the reason I am using this funny
notation is in the textbook if this is bold it is m1 if it is unbolded it is a projection but I
cannot write bold every time on the board. So I am just using the bar right then when you
read the book you can translate back and for this you read this part alone so till that
part it is from Hastie Tibshirani Friedman the ESL okay this part alone you do PRML (Pattern
Recognition and Machine Learning) by Bishop the textbook reference is there on the so
what is my goal when I say I want to maximize between class variance it is essentially to
essentially to maximize that quantity it is a wTm2 is the projection of m2 on w, wTm1 is a projection
of m1 on w I'm trying to maximize this quantity so that is essentially my first criterion right.
The direction w that maximizes this right so there should be some alarm bells ringing for
you what is the problem? If I do not have any bounds on w, I can just arbitrarily scale my
w and get larger and larger values right, so I will have to have some constraints assuming
summation over. so essentially the norm w is one okay that is
an assumption will make frequently to make sure that we do not get unbounded solutions
right. So this is numerically unbounded. Yeah good question so you could impose a inequality
constraint saying that summation w square is less than 1 but what we will think what do you think
will happen you are maximizing the value right I am sorry you can just scale it so essentially
what will happen is you will scale it such that wi hits 1 anyway so even if you are having a even if
you have the lesser than or equal to constraint because you are maximizing over w you will hit
it you will essentially scale w till you hit 1. So you might as well leave it as equal to 1
right.
So you can solve this right but the take-home message is that your w is going to be right,
so w will be proportional okay you add that here right you take the derivative ‘w’ will
go and that will become w, so there will be some constants here right but essentially you
are going to get w will be in the direction of m2 – m1 right, so what does this mean?
Ttake the means right and if again you can go back and show that if it is spherical then the constant
will be half right so it will be the midpoint of the line dividing that two means okay right.
So let us do it again so I have two classes right I take the means right, so this will be
the direction of the projection let us I allow to predict everything on to this right this way this
will become class one that will become class two okay does it make sense right yeah so in
this line and this line are actually parallel to each other I know you really did not want me
to repeat the drawing but I think that you helped okay, so I have class one I have class two right,
so I mean if you look at the data point that comes to me so the people understand
when I say class one class two like this do you know the direction what I mean.
So this is the Gaussian corresponding to class one I am drawing the 1σ contour of that right
this is this is a likewise the 1σ contour of the second Gaussian so the data point that comes to
me could be something like this right this could be the training data that I am getting it will
be mixed up of + and – in this region right that could be minuses here also okay already it is a
drew one that could be minuses here that could be pluses here because the Gaussian still does
extend beyond the contour I have drawn okay, the contour is only the most probable region for the
data points to lie does not mean that outside this contour the probability is 0 okay.
So this is essentially what it means so I am going to get data like this and I am going to model it
I am modeling the Gaussian by these contours ok now let us say that.
Roughly that these points are the centroids of the data that I get roughly these are the
centroids of the data I get so what this tells us is that can you join this by a straight line
okay and essentially you take direction that is all right like this and project
all the data points to that right so you will get all the data points lying here now fix up
threshold that what that is what I wrote here as w0 pick a threshold such that above that it
is class 1 below then it is class 2 right. In this case in fact if this had been spherical
you can show that the threshold would lie in the midpoint now we cannot because well
you can I would guess I mean depending under special circumstances but now the point will
be somewhere here and all the data points is projected above this I will say it is plus all
the data points are projected below this I will say it is minus that makes sense right.
But then this is not what we are looking for right we are missing something important what is that
the inter-class well I am sorry the within class variance right so this is the inter class variance
within class variance is what we are missing. So what we will do now start looking at that
right. So that is a projected mean these are the
projected data points belonging to class one okay keeping in with the terminology are using there so
I'm picking on all the data points training data points which had class k right and looking at
the projected distance from the projected mean this gives me the total within class variance
yeah where I'm going to maximize everything at the end. So I am just ignoring the things that
do not affect the maximization of that okay, which squared term this way so that is essentially this
is a projected data and that is a projected mean and just taking the variance of that right is
exactly what we did that except that I have not divided by the number of data points okay right.
So this criterion is called the official criterion it is called the
“Fisher criterion” after Fisher who was a very famous statistician who came up with LDA okay,
several decades ago so here I am going to do something confusing so I am going to rewrite it
right so this is the between class covariance matrix right. So if you think about it so what
I wanted was m2 -m1 what is m2 the projected right so the projected one, so m2 will actually
be right so essentially I have - so I can take out the wT and just have the square of the and
I am adding the w2 back in okay by doing wTSBw okay. Now what about Sw?
So likewise so I have this as my right so an S12 + S22 is essentially this is this S1 right I had
took take out the W from there and this is S2 I take out the W from there so that gives me the
wT Sww okay. So now what we want to do we want to maximize this right we want to maximize the
between class variance relative to the within class variance that is what we said right between
class variance is maximized relative to the within class variance so that is between class
variance is within class variance I have to take the ratio now I am maximizing this ratio.
So differentiate with respect to w differentiate with respect to w and set it equal to zero all
right so this is what you buy u/v right so people want to tell me what the differentiation will be
okay I will write it but you should recall all of this childhood memories okay you should not
forget whatever you studied to get in here like so the denominator in the thing will become zero
because I equated to zero already so when you take the derivative of this you're going to
get some term in the denominator right. So that will go to zero so I will just have to
equate the two half’s in the numerator and I will get this right so just refresh your derivatives
the only thing that I am pretty sure putting everybody off is the fact that we are doing all
of this in the matrix notation right just practice it makes life a lot easier do it a couple of times
right the best way to do it is try and write it out in matrix form in gory detail okay do the
term by term the derivative of it and then look at how it simplifies after you do the derivative
right then you will see the pattern and then you will know exactly what we are writing it it's a
very simple things like there are quadratics so you should know how to differentiate quadratics
that is the only thing that is throwing you off right wTw is actually a quadratic in w right.
So that is the only thing so it becomes a linear in w so that is all nothing more to it
actually if you think about it SBW okay, will always be in the direction of right you already
saw that here when we had only the constraint on SB right so here that the constraint was only on
SB hey only on the between class variance when he had the constraint only on the between class
variance we ended up finding out that the solution is going to be the direction of m2 - m1 okay.
And a little bit of work you can show that always that SBw will be in the direction of m2 –m1 right
so I can actually drop that and replace that with a vector proportional to m2 - m1 right,
so now it makes our life a lot easier right I only have one w left so what about these
guys. They are all simplified to some kind of scalar
quantities right so finally what I will get this w is not equal to but proportional to so
that is essentially what I will get so if I did not have the Sw constraint what I got was “w”
as proportional time to m2-m1 right. But now if I am taking into account the within class
variance also then I will have to pay attention to the within class covariance matrix.
So I will have to pay attention to the within class covariance so that is basically all
there is to it okay but how does this relate to this I see any relation between this and that
think about it that is basically what we are doing there right. So inverse is
Sw inverse just using different notation here right, so Sw inverse is just taking
the variance between the in the data right in the within class variance so if you remember
is the within class variance matrix right. So that gives me inverse here and this how I got
inverse here and then I have m2 - m1 and I have µk - µl here, so essentially for in modulo all of
these other non X related terms right so we are essentially finding the same direction right so
whether you do it this way starting with that is your objective function right between class
variance and within class variance or you start off by saying that your class condition density
is Gaussian and then you are trying to find out the separating hyper plane right.
So in both cases you end up with the same direction modulo some scaling factors right,
so you can use either motivation for deriving it but what is the nice thing about this motivation
we did not make any assumption about the class conditional distribution the Gaussian assumption
is missing here right the Gaussian assumption is missing and we worked only with sample means and
sample variance and so on so forth right. So it just tells you that LDA does not work only
when the distributions are Gaussian right they are fine even when the underlying distribution
is not Gaussian that is actually well-defined semantics to doing LDA right. People are with
me on that so far okay great, so any questions let us let them move on to the next thing what
does J(w) represent I told you right. So I want to look at the between class variance
relative to the within class variance right so the numerator is the between class variance and
the denominator is the within class variance so I'm trying to maximize the relative score.
5.0 / 5 (0 votes)