Week 3 Lecture 19 Linear Discriminant Analysis 3

Machine Learning- Balaraman Ravindran
4 Aug 202125:00

Summary

TLDRThe script discusses the concept of class variance in the context of linear discriminant analysis (LDA), focusing on maximizing the distance between class means while minimizing within-class variance. It explains the process of finding the optimal direction 'w' for class separation without assuming Gaussian distributions, highlighting the Fisher criterion for maximizing between-class variance relative to within-class variance. The summary also touches on the generalization from two classes to multiple classes and the importance of constraints to avoid unbounded solutions.

Takeaways

  • 📚 The concept of 'class variance' discussed is the variance among the projected means of different classes, which is crucial in understanding the separation of classes in a dataset.
  • 🔍 When dealing with two classes, the goal is to maximize the distance between their projected means, which is a fundamental aspect of binary classification.
  • 📉 The 'between class variance' is maximized relative to the 'within class variance', which is a key principle in Linear Discriminant Analysis (LDA).
  • 📈 The 'within class variance' is calculated by considering the variance of data points with respect to the class mean, which is essential for understanding the spread of data within each class.
  • 📝 In the context of LDA, the 'Fisher criterion' is used to find the optimal direction (w) that maximizes the ratio of between-class variance to within-class variance.
  • 🔢 The script emphasizes the importance of constraints on 'w' to avoid unbounded solutions, typically by assuming the norm of 'w' is one.
  • 📐 The direction of 'w' is found to be proportional to the difference between the means of the two classes (m2 - m1), which is a critical step in LDA.
  • 🧩 The script discusses the generalization from a two-class case to multiple classes, indicating that the principles of LDA can be extended to more complex scenarios.
  • 📊 The 'Fisher criterion' is rewritten in terms of covariance matrices, showing the mathematical formulation for finding the optimal 'w'.
  • 🤖 The script explains that LDA does not rely on the assumption of Gaussian distributions, making it a robust method even when the underlying data distribution is not Gaussian.
  • 🔑 The final takeaway is that the 'Fisher criterion' and the Gaussian assumption lead to the same direction for 'w', highlighting the versatility of LDA.

Q & A

  • What is class variance in the context of the transcript?

    -Class variance in this context refers to the variance among the means of different classes, specifically the variance of the projected means of the classes in a dataset.

  • What is the significance of maximizing the distance between the projected means of two classes?

    -Maximizing the distance between the projected means of two classes is a way to enhance the separability of the classes, which is a key objective in classification tasks.

  • What does the term 'within class variance' refer to?

    -Within class variance refers to the variance of the data points within each class with respect to the class mean, which is a measure of the spread of the data points within the class.

  • Why is it necessary to have constraints when maximizing the between class variance?

    -Constraints are necessary to prevent unbounded solutions. Without constraints, one could arbitrarily scale the weight vector 'w' to achieve larger values, which would not be meaningful in the context of the problem.

  • What assumption is commonly made to ensure that the solutions are not numerically unbounded?

    -A common assumption is to constrain the norm of the weight vector 'w' to be one, which is expressed as the constraint that the sum of the squares of the weights equals one.

  • What is the 'Fisher criterion' mentioned in the transcript?

    -The Fisher criterion is a statistical method used to maximize the ratio of between-class variance to within-class variance, named after the statistician Ronald Fisher, who introduced it in the context of linear discriminant analysis (LDA).

  • How does the direction of the weight vector 'w' relate to the means of the classes?

    -The weight vector 'w' is found to be in the direction of the difference between the means of the two classes (m2 - m1), which is the direction that maximizes the separation between the classes.

  • What is the relationship between the within-class covariance matrix and the Fisher criterion?

    -The within-class covariance matrix is used in the denominator of the Fisher criterion to represent the within-class variance, which is what the between-class variance is being maximized relative to.

  • Why is it said that LDA does not only work when the distributions are Gaussian?

    -The derivation of the LDA in the transcript does not rely on the Gaussian assumption for the class-conditional distributions, indicating that LDA can be well-defined and effective even when the underlying distributions are not Gaussian.

  • What is the significance of the threshold w0 in the context of classifying data points?

    -The threshold w0 is used to classify data points based on the projection defined by the weight vector 'w'. If the projection of a data point is greater than w0, it is classified as one class, and if it is less than or equal to w0, it is classified as another class.

  • How does the transcript relate the concept of centroids to the discussion of class variance?

    -The centroids of the data, which are the means of the classes, play a crucial role in calculating the projected means and the variances, both within and between classes, which are central to the discussion of class variance.

Outlines

00:00

📊 Understanding Class Variance in Machine Learning

The speaker introduces the concept of class variance in the context of machine learning, focusing on the variance among class means. The explanation involves the computation of variance between the projected means of 'k' classes, emphasizing the importance of maximizing the distance between these means. The discussion simplifies to a two-class scenario to illustrate the concept of maximizing the variance between the centers of the classes relative to the within-class variance. The speaker also addresses the issue of unbounded solutions by introducing constraints on the norm of 'w', which is set to one to avoid scaling issues.

05:01

🔍 Maximizing Between-Class Variance with Constraints

This paragraph delves deeper into maximizing the distance between class means, which is the first criterion for the model. The speaker discusses the potential problem of unbounded solutions when scaling 'w' without constraints, leading to arbitrarily large values. To counter this, a constraint is introduced to keep the norm of 'w' equal to one, ensuring that the solution remains numerically bounded. The direction of 'w' is identified as being proportional to the difference between the means of the two classes, highlighting the importance of this direction in class separation.

10:04

📈 Projecting Data Points and Within-Class Variance

The speaker explains the process of projecting data points onto a line defined by 'w' and classifying them based on a threshold 'w0'. The focus then shifts to within-class variance, which is calculated by considering the projected distance of data points from the projected mean of their respective classes. The paragraph introduces the concept of the Fisher criterion, which is used to maximize the ratio of between-class variance to within-class variance, without making any assumptions about the underlying data distribution.

15:08

🧩 Deriving the Optimal Direction for 'w'

In this section, the speaker discusses the mathematical derivation for finding the optimal direction of 'w' by maximizing the ratio of between-class variance to within-class variance. The process involves differentiating with respect to 'w' and setting the result to zero, leading to a solution where 'w' is proportional to the difference between the projected means of the classes. The explanation simplifies the problem by showing that the solution for 'w' is independent of the specific values of the within-class covariance matrix, as long as it is positive definite.

20:10

📚 Comparing Motivations for Linear Discriminant Analysis (LDA)

The final paragraph compares two different motivations for deriving the linear discriminant analysis (LDA). The first motivation is based on maximizing the ratio of between-class variance to within-class variance, while the second is based on the assumption of Gaussian class-conditional densities. The speaker emphasizes that both approaches lead to the same direction for 'w', modulo scaling factors, and that LDA can be applied even when the underlying data distribution is not Gaussian, as it relies on sample means and variances rather than distribution assumptions.

Mindmap

Keywords

💡Class Variance

Class variance refers to the variance among the means of different classes in a dataset. In the context of the video, it is about calculating the variance of the projected means of classes to understand the dispersion of these means. The script discusses maximizing the distance between the projected means of two or more classes, which is a key step in discriminant analysis.

💡Projected Means

Projected means are the means of the classes when projected onto a certain direction in the feature space. The script uses the concept of projected means to discuss how the variance among these projected values can be maximized to differentiate between classes, which is central to the theme of class separation in machine learning.

💡Within-Class Variance

Within-class variance is the variance of data points within each class relative to their class mean. The script explains that this variance is an important consideration when trying to distinguish between classes, as it represents the spread of data points within each class and is part of the optimization criterion in linear discriminant analysis (LDA).

💡Between-Class Variance

Between-class variance is the variance calculated from the difference in means between different classes. The video script emphasizes maximizing this variance to enhance the separability of classes. It is a critical component of the Fisher criterion, which is used to find an optimal linear combination of features that maximizes class separability.

💡Fisher Criterion

The Fisher criterion, named after statistician Ronald Fisher, is a method used in linear discriminant analysis to find a linear combination of features that maximizes the ratio of between-class variance to within-class variance. The script discusses this criterion in the context of maximizing class separability without assuming Gaussian distributions.

💡LDA (Linear Discriminant Analysis)

LDA is a statistical technique used to find a linear combination of features that can be used to classify data into different categories. The script explains that LDA can be derived without assuming Gaussian distributions, which broadens its applicability beyond traditional assumptions.

💡Gaussian Assumption

The Gaussian assumption refers to the assumption that data follows a Gaussian (normal) distribution. While traditional LDA often relies on this assumption, the script points out that the method discussed does not require it, making it a more versatile approach for class separation.

💡Covariance Matrix

A covariance matrix is a matrix that contains the covariance (a measure of how much two random variables change together) between the variables of a dataset. In the script, the within-class covariance matrix is used to calculate the within-class variance and is part of the optimization process in LDA.

💡Optimization

Optimization in the context of the video refers to the process of finding the best parameters (like the direction vector 'w') that maximize the objective function, which is the ratio of between-class variance to within-class variance. The script discusses the mathematical process of differentiating this function with respect to 'w' to find its maximum.

💡Threshold

A threshold is a value that separates the data into different classes based on the linear combination of features. In the script, the threshold 'w0' is used to classify data points into class one or class two based on the value of 'wTx', which is a critical part of the classification process.

💡Centroids

Centroids are the central points of a class in a dataset, often used to represent the class in clustering or classification tasks. The script uses the concept of centroids to illustrate how data points are distributed and how a separating hyperplane can be determined.

Highlights

Exploration of class variance as the variance among class means, emphasizing the importance of projected class means.

Introduction of the concept of maximizing distance between projected class means for a two-class scenario.

Generalization of the concept to 'k' classes, highlighting the maximization of variance among 'k' centers.

Discussion on the within-class variance, defined as the variance with respect to the class mean.

Simplification of the problem by starting with a two-class case before generalizing to multiple classes.

Explanation of the decision surface defined by wTx and the classification threshold w0.

Assumption of the means of classes C1 and C2 and the notation used for projected means.

The problem of unbounded solutions due to unrestricted scaling of 'w' and the proposed constraint.

The numerical approach to ensure 'w' is not unbounded by setting the norm of 'w' to one.

Derivation of 'w' being in the direction of m2 – m1, indicating the midpoint between class means.

Illustration of class separation using Gaussian distributions and the significance of the 1σ contour.

The concept of centroids in data and their relation to the decision boundary.

Introduction of the Fisher criterion and its role in maximizing between-class variance relative to within-class variance.

Differentiation of the criterion with respect to 'w' to find the optimal direction.

The relationship between the Fisher criterion and the class-conditional densities, highlighting LDA's applicability beyond Gaussian distributions.

Final expression for 'w' being proportional to the difference of class means, considering within-class covariance.

Understanding J(w) as the ratio of between-class variance to within-class variance, aiming for maximization.

Transcripts

play00:00

 Okay so when I say between  

play01:07

class variance I say it is the variance of the  class means so I will take the classes okay look  

play01:16

at the means of those classes and look at the  projected means of those classes and compute the  

play01:22

variance among the projected means okay suppose I  have “k” classes I can compute the variance among  

play01:27

those if I have two classes what will this amount  to maximizing the distance between the projected  

play01:34

right fits two classes it will be maximizing the  distance between them if it is k classes it will  

play01:40

be maximizing the variance among the k centers  right relative to the within class variance and  

play01:46

what would be the within class variance? For each class the variance with respect  

play01:50

to the class mean so that is what we already  computed that right but for each class right  

play02:00

so within class variance that essentially what  I'm looking at here right so let us just treat  

play02:08

the first condition alone all right so I will  just simplicity sake start off with a two class  

play02:39

case and then we can think of the generalization  to multiple classes, so I am going to have a  

play02:44

surface defined by wTx right so y = wT x if it's  greater than some w0 I am going to classify it as  

play02:52

class one just less than some w0 or less than or  equal to, I will classify it as class two.  

play03:14

Sorry my font went too small. I am going to say  and are the means of C1 and C2 right and well  

play03:40

we know how to compute just like you do a there  and I am going to assume that when I write the mk  

play03:54

without the bar okay this the projected one okay.  So I should see the projection of the mean okay  

play04:05

in the direction wT okay so that is essentially  what this is so the reason I am using this funny  

play04:09

notation is in the textbook if this is bold it  is m1 if it is unbolded it is a projection but I  

play04:17

cannot write bold every time on the board. So I am just using the bar right then when you  

play04:22

read the book you can translate back and for  this you read this part alone so till that  

play04:35

part it is from Hastie Tibshirani Friedman the  ESL okay this part alone you do PRML (Pattern  

play04:42

Recognition and Machine Learning) by Bishop  the textbook reference is there on the so  

play04:48

what is my goal when I say I want to maximize  between class variance it is essentially to  

play05:01

essentially to maximize that quantity it is a wTm2  is the projection of m2 on w, wTm1 is a projection  

play05:09

of m1 on w I'm trying to maximize this quantity so  that is essentially my first criterion right.  

play05:21

The direction w that maximizes this right so  there should be some alarm bells ringing for  

play05:27

you what is the problem? If I do not have any  bounds on w, I can just arbitrarily scale my  

play05:36

w and get larger and larger values right, so  I will have to have some constraints assuming  

play05:55

summation over. so essentially the norm w is one okay that is  

play06:00

an assumption will make frequently to make  sure that we do not get unbounded solutions  

play06:06

right. So this is numerically unbounded. Yeah  good question so you could impose a inequality  

play06:55

constraint saying that summation w square is less  than 1 but what we will think what do you think  

play06:59

will happen you are maximizing the value right  I am sorry you can just scale it so essentially  

play07:13

what will happen is you will scale it such that wi  hits 1 anyway so even if you are having a even if  

play07:19

you have the lesser than or equal to constraint  because you are maximizing over w you will hit  

play07:25

it you will essentially scale w till you hit  1. So you might as well leave it as equal to 1  

play07:30

right.  

play07:34

So you can solve this right but the take-home  message is that your w is going to be right,  

play07:46

so w will be proportional okay you add that  here right you take the derivative ‘w’ will  

play07:50

go and that will become w, so there will be  some constants here right but essentially you  

play07:55

are going to get w will be in the direction  of m2 – m1 right, so what does this mean?  

play08:08

Ttake the means right and if again you can go back  and show that if it is spherical then the constant  

play08:14

will be half right so it will be the midpoint of  the line dividing that two means okay right.  

play08:26

So let us do it again so I have two classes  right I take the means right, so this will be  

play08:39

the direction of the projection let us I allow to  predict everything on to this right this way this  

play08:48

will become class one that will become class  two okay does it make sense right yeah so in  

play09:01

this line and this line are actually parallel  to each other I know you really did not want me  

play09:14

to repeat the drawing but I think that you helped  okay, so I have class one I have class two right,  

play09:23

so I mean if you look at the data point  that comes to me so the people understand  

play09:28

when I say class one class two like this  do you know the direction what I mean.  

play09:31

So this is the Gaussian corresponding to class  one I am drawing the 1σ contour of that right  

play09:39

this is this is a likewise the 1σ contour of the  second Gaussian so the data point that comes to  

play09:44

me could be something like this right this could  be the training data that I am getting it will  

play09:56

be mixed up of + and – in this region right that  could be minuses here also okay already it is a  

play10:03

drew one that could be minuses here that could  be pluses here because the Gaussian still does  

play10:08

extend beyond the contour I have drawn okay, the  contour is only the most probable region for the  

play10:13

data points to lie does not mean that outside  this contour the probability is 0 okay.  

play10:18

So this is essentially what it means so I am going  to get data like this and I am going to model it  

play10:22

I am modeling the Gaussian by these contours  ok now let us say that.  

play10:42

Roughly that these points are the centroids  of the data that I get roughly these are the  

play10:50

centroids of the data I get so what this tells  us is that can you join this by a straight line  

play11:00

okay and essentially you take direction  that is all right like this and project  

play11:16

all the data points to that right so you will  get all the data points lying here now fix up  

play11:30

threshold that what that is what I wrote here  as w0 pick a threshold such that above that it  

play11:37

is class 1 below then it is class 2 right. In this case in fact if this had been spherical  

play11:44

you can show that the threshold would lie  in the midpoint now we cannot because well  

play11:51

you can I would guess I mean depending under  special circumstances but now the point will  

play11:54

be somewhere here and all the data points is  projected above this I will say it is plus all  

play12:00

the data points are projected below this I will  say it is minus that makes sense right.  

play12:08

But then this is not what we are looking for right  we are missing something important what is that  

play12:15

the inter-class well I am sorry the within class  variance right so this is the inter class variance  

play12:24

within class variance is what we are missing.  So what we will do now start looking at that  

play12:58

right. So that is a projected mean these are the  

play13:03

projected data points belonging to class one okay  keeping in with the terminology are using there so  

play13:19

I'm picking on all the data points training data  points which had class k right and looking at  

play13:25

the projected distance from the projected mean  this gives me the total within class variance  

play14:21

yeah where I'm going to maximize everything at  the end. So I am just ignoring the things that  

play14:30

do not affect the maximization of that okay, which  squared term this way so that is essentially this  

play14:40

is a projected data and that is a projected mean  and just taking the variance of that right is  

play14:46

exactly what we did that except that I have not  divided by the number of data points okay right.  

play14:53

So this criterion is called the  official criterion it is called the  

play15:00

“Fisher criterion” after Fisher who was a very  famous statistician who came up with LDA okay,  

play15:07

several decades ago so here I am going to do  something confusing so I am going to rewrite it  

play15:49

right so this is the between class covariance  matrix right. So if you think about it so what  

play15:57

I wanted was m2 -m1 what is m2 the projected  right so the projected one, so m2 will actually  

play16:05

be right so essentially I have - so I can take  out the wT and just have the square of the and  

play16:21

I am adding the w2 back in okay by doing wTSBw  okay. Now what about Sw?  

play17:08

So likewise so I have this as my right so an S12  + S22 is essentially this is this S1 right I had  

play17:23

took take out the W from there and this is S2 I  take out the W from there so that gives me the  

play17:27

wT Sww okay. So now what we want to do we want  to maximize this right we want to maximize the  

play17:40

between class variance relative to the within  class variance that is what we said right between  

play17:51

class variance is maximized relative to the  within class variance so that is between class  

play17:55

variance is within class variance I have to take  the ratio now I am maximizing this ratio.  

play17:59

So differentiate with respect to w differentiate  with respect to w and set it equal to zero all  

play18:15

right so this is what you buy u/v right so people  want to tell me what the differentiation will be  

play18:28

okay I will write it but you should recall all  of this childhood memories okay you should not  

play18:38

forget whatever you studied to get in here like  so the denominator in the thing will become zero  

play19:10

because I equated to zero already so when you  take the derivative of this you're going to  

play19:14

get some term in the denominator right. So that will go to zero so I will just have to  

play19:20

equate the two half’s in the numerator and I will  get this right so just refresh your derivatives  

play19:26

the only thing that I am pretty sure putting  everybody off is the fact that we are doing all  

play19:32

of this in the matrix notation right just practice  it makes life a lot easier do it a couple of times  

play19:42

right the best way to do it is try and write it  out in matrix form in gory detail okay do the  

play19:49

term by term the derivative of it and then look  at how it simplifies after you do the derivative  

play19:53

right then you will see the pattern and then you  will know exactly what we are writing it it's a  

play19:57

very simple things like there are quadratics so  you should know how to differentiate quadratics  

play20:02

that is the only thing that is throwing you off  right wTw is actually a quadratic in w right.  

play20:09

So that is the only thing so it becomes a  linear in w so that is all nothing more to it  

play20:23

actually if you think about it SBW okay, will  always be in the direction of right you already  

play20:34

saw that here when we had only the constraint on  SB right so here that the constraint was only on  

play20:43

SB hey only on the between class variance when  he had the constraint only on the between class  

play20:48

variance we ended up finding out that the solution  is going to be the direction of m2 - m1 okay.  

play20:54

And a little bit of work you can show that always  that SBw will be in the direction of m2 –m1 right  

play21:01

so I can actually drop that and replace that  with a vector proportional to m2 - m1 right,  

play21:08

so now it makes our life a lot easier right  I only have one w left so what about these  

play21:16

guys. They are all simplified to some kind of scalar  

play21:28

quantities right so finally what I will get  this w is not equal to but proportional to so  

play21:40

that is essentially what I will get so if I did  not have the Sw constraint what I got was “w”  

play21:46

as proportional time to m2-m1 right. But now  if I am taking into account the within class  

play21:51

variance also then I will have to pay attention  to the within class covariance matrix.  

play21:56

So I will have to pay attention to the within  class covariance so that is basically all  

play22:01

there is to it okay but how does this relate to  this I see any relation between this and that  

play22:28

think about it that is basically what  we are doing there right. So inverse is  

play22:34

Sw inverse just using different notation  here right, so Sw inverse is just taking  

play22:40

the variance between the in the data right in  the within class variance so if you remember  

play22:47

is the within class variance matrix right. So that gives me inverse here and this how I got  

play22:54

inverse here and then I have m2 - m1 and I have  µk - µl here, so essentially for in modulo all of  

play23:01

these other non X related terms right so we are  essentially finding the same direction right so  

play23:10

whether you do it this way starting with that  is your objective function right between class  

play23:15

variance and within class variance or you start  off by saying that your class condition density  

play23:20

is Gaussian and then you are trying to find  out the separating hyper plane right.  

play23:26

So in both cases you end up with the same  direction modulo some scaling factors right,  

play23:33

so you can use either motivation for deriving it  but what is the nice thing about this motivation  

play23:47

we did not make any assumption about the class  conditional distribution the Gaussian assumption  

play23:51

is missing here right the Gaussian assumption is  missing and we worked only with sample means and  

play23:56

sample variance and so on so forth right. So it just tells you that LDA does not work only  

play24:03

when the distributions are Gaussian right they  are fine even when the underlying distribution  

play24:09

is not Gaussian that is actually well-defined  semantics to doing LDA right. People are with  

play24:17

me on that so far okay great, so any questions  let us let them move on to the next thing what  

play24:36

does J(w) represent I told you right. So I  want to look at the between class variance  

play24:40

relative to the within class variance right so  the numerator is the between class variance and  

play24:48

the denominator is the within class variance so  I'm trying to maximize the relative score.

Rate This

5.0 / 5 (0 votes)

Связанные теги
Machine LearningClass VarianceMaximize DistanceFisher CriterionLDAGaussian AssumptionData ModelingCovariance MatrixProjection MeanStatistical Analysis
Вам нужно краткое изложение на английском?