Week 3 Lecture 15 Linear Classification
Summary
TLDRThe script discusses transitioning from linear regression to linear classification methods, explaining the concept of linear classification where the boundary between classes is linear. It introduces discriminant functions, which assign class labels based on the highest output value, and highlights the limitations of using linear regression for classification, such as the issue of 'masking.' The script also touches on the use of basis transformations to overcome these limitations and suggests that at least K-1 basis transformations are needed for K classes to avoid masking.
Takeaways
- 🔍 The discussion transitions from linear regression to linear classification methods, emphasizing the concept of a linear boundary separating different classes.
- 📊 Linear classification involves using a linear boundary to separate classes, which can be achieved through linear or nonlinear discriminant functions, as long as they result in a linear decision surface.
- 🤖 Discriminant functions are introduced as a way to model each class, with the class having the highest function value being assigned to a data point.
- 📈 The script explains that for a two-class problem, the separating hyperplane is found where the discriminant functions for each class are equal.
- 🧠 It's highlighted that while discriminant functions don't have to be linear, they can be nonlinear as long as a monotone transformation makes them linear, which is crucial for the linearity of the decision surface.
- 📚 Three approaches for linear classification are mentioned: using linear regression as a discriminant function, logistic regression, and linear discriminant analysis, which considers class labels.
- 👥 The second class of methods discussed models the hyperplane directly, such as the perceptron algorithm, which is an example of a more direct approach to finding an optimal hyperplane.
- 🔢 The script outlines a mathematical setup for classification with K classes, using indicator variables and linear regression to predict class probabilities.
- ⚠️ A potential issue with using linear regression for classification is highlighted, known as 'masking,' where certain classes may never be predicted due to the dominance of other classes in the data.
- 🔧 The concept of basis transformations is introduced as a method to overcome the limitations of linear models in classification, allowing for more complex decision boundaries.
Q & A
What is the main difference between linear regression and linear classification?
-In linear regression, the response is a continuous value that is a linear function of the inputs. In contrast, linear classification uses a boundary that is linear to separate different classes, with the classification decision based on which side of the boundary the input falls on.
What is meant by a 'linear boundary' in the context of classification?
-A 'linear boundary' refers to a separating surface, typically a hyperplane, that divides different classes in a feature space. This boundary is defined by a linear equation, meaning that it does not curve and can be represented as a straight line in two dimensions or a flat plane in three dimensions.
What is a discriminant function in the context of classification?
-A discriminant function is a function associated with each class that helps in classifying a data point. If the discriminant function for 'class I' outputs a higher value than for all other classes for a given data point, the data point is classified as belonging to 'class I'.
How does the concept of 'masking' in linear regression for classification affect the classification outcome?
-Masking occurs when one class's output dominates the others, preventing some classes from ever being chosen as the classification for any input points. This can happen when the data points for certain classes are not represented well in the feature space, leading to biased classification decisions.
Why might linear regression not be the best choice for classification in some cases?
-Linear regression might not be suitable for classification when the classes are not linearly separable or when there is a high degree of overlap between classes. Additionally, the outputs from linear regression cannot be directly interpreted as probabilities, which is often desired in classification tasks.
What is the relationship between the number of classes and the required basis transformations for classification?
-The rule of thumb is that if you have K classes in your input data, you need at least K - 1 basis transformations to avoid issues like masking and to ensure that each class has a chance to dominate the classification in some region of the input space.
How can logistic regression be considered as an alternative to linear regression for classification?
-Logistic regression models the probability of the classes as a function of the inputs and is constrained to output values between 0 and 1, making it more suitable for classification tasks where the outputs are probabilities.
What is the purpose of using indicator variables in linear regression for classification?
-Indicator variables are used to represent the class labels in a binary format (0 or 1) for each class. This allows the linear regression model to be fit to the data for each class separately, with the goal of predicting the expected value or probability of each class given the input features.
How does the perceptron algorithm differ from linear regression in the context of classification?
-The perceptron algorithm is a type of linear classifier that directly models the hyperplane for classification, rather than using discriminant functions. It updates its weights based on misclassifications to iteratively find the optimal separating hyperplane.
What is the significance of the separating hyperplane in linear discriminant analysis?
-In linear discriminant analysis, the separating hyperplane is used to find a direction that maximizes the separation between classes. This method is similar to principal component analysis but takes into account the class labels to derive the directions that best separate the classes.
How can one interpret the output of a linear regression model in the context of classification?
-The output of a linear regression model for each class can be interpreted as the expected value of the class label given the input features. However, these outputs should not be directly interpreted as probabilities due to the lack of constraints in linear regression models.
Outlines
📚 Introduction to Linear Classification
This paragraph introduces the concept of moving from linear regression to linear classification methods. The speaker explains that while linear regression models the response as a linear function of inputs, linear classification involves a boundary of separation between classes that is linear. The idea is to use a discriminant function for each class, where the class with the highest function output is chosen for classification. The speaker also touches on the concept of a separating hyperplane and hints at the possibility of nonlinear transformations that result in a linear separating surface.
🔍 Approaches to Linear Classification
The speaker outlines different approaches to linear classification, starting with using linear regression as a discriminant function for each class, treating it as an indicator variable. The paragraph also mentions logistic regression and linear discriminant analysis, which are methods that consider class labels to derive directions for classification. The second class of methods discussed involves directly modeling the hyperplane, with the perceptron and optimal hyperplane approaches being highlighted. The setup for classification is explained, assuming a space with K classes and using indicator variables for each class.
📈 Interpreting Linear Regression for Classification
This paragraph delves into the use of linear regression to predict class labels, emphasizing that the output of the regression can be interpreted as the expected value or probability of the class given the input. The speaker discusses the process of adding a bias term to the input and how the output vector is used to determine the class label through the argmax function. The limitations of linear regression in capturing the true probabilities are also highlighted, as the model is not constrained to output values between 0 and 1.
🚧 Pitfalls of Linear Regression in Classification
The speaker identifies potential issues with using linear regression for classification, such as the inability to directly interpret the outputs as probabilities due to the lack of constraints in the model. An example is given where fitting a straight line to data points from two classes results in a line that does not accurately represent the class probabilities. The concept of 'masking' is introduced, where certain classes may never dominate the output, leading to incorrect classifications.
🔧 Addressing Masking with Basis Transformations
The final paragraph addresses the problem of masking by suggesting the use of higher-order basis transformations. The speaker explains that regressing on the square of the input (x squared) can help recover the actual boundaries and avoid masking. The importance of having at least K - 1 basis transformations for K classes is emphasized, with the example of using cubic transformations to avoid masking in a four-class problem.
Mindmap
Keywords
💡Linear Regression
💡Classification
💡Discriminant Function
💡Hyperplane
💡Indicator Variable
💡Logistic Regression
💡Linear Discriminant Analysis (LDA)
💡Perceptron
💡Masking
💡Basis Transformation
💡Probabilities
Highlights
Introduction to linear methods for classification, emphasizing the difference from linear regression by focusing on a linear boundary for class separation.
Explanation of how a linear boundary in classification can be represented as a hyperplane, providing a geometric perspective on the classification problem.
Discussion on discriminant functions as a method for classification, where each class has a function and the class with the highest function value is chosen.
The concept that the discriminant functions, δ1 and δ2, determine the separating hyperplane where they are equal, and their relationship to class assignment.
The possibility of non-linear discriminant functions through monotone transformations that result in a linear separating surface.
Overview of different approaches for linear classification, including linear regression as a discriminant function, logistic regression, and linear discriminant analysis.
The perceptron algorithm as a classic method for directly modeling the hyperplane in classification, contrasting with discriminant function approaches.
The importance of considering the optimal hyperplane in classification and the methods for solving for it directly.
Setting up the classification problem with K classes and the corresponding indicator variables for each class.
The use of linear regression to predict the expected value of the indicator variables, which ideally should represent class probabilities.
The limitation of linear regression in interpreting the output as probabilities due to lack of constraints.
The issue of 'masking' in classification when using linear regression, where some classes may never dominate the output.
The strategy to overcome masking by using higher-order basis transformations in regression to allow for more complex decision boundaries.
The rule of thumb for the number of basis transformations needed based on the number of classes in the dataset.
The potential for masking even with quadratic transformations in datasets with four classes, necessitating cubic transformations.
The practical example of how regression on x squared can help recover actual class boundaries in a three-class problem with a single input dimension.
Transcripts
So we move on from
linear regression linear methods for regression to linear methods for classification and right,
so far we have been looking at linear methods for regression but I did tell you that you
could do “nonlinear regression” also by doing appropriate basis transformations. So what do
I mean by linear methods for classification linear regression you can understand right,
so the response is going to be a linear function of the inputs right, so what do I mean by linear
classification? So when I am going to separate the positive classes or when I am going to
separate two classes the boundary of separation between the two classes will be linear.
So that is what I mean by linear classification. So this boundary that I draw between two classes
will be linear, so you can think of when we did look at an example in the first class where we
had drawn quadratics and phases and things like that right, so but instead of that we
will assume that the classification surface or the separating hyper plane will be here
well or the separating surface will be a hyper plane. So there are two classes of approaches
that we will look at for classification for linear classification and the first one is essentially
on modeling a discriminant function. So one rough way of thinking about it is to say
that I am going to have a function for each class right and if the function for “class I” output for
“class I” is higher than for all the other classes I will take classify the data point as belonging
to “class I” right, so I am going to have some function so for each class I will have a function
right and so depending on whichever is the highest I will output it to that, so this is essentially
the idea behind discriminant functions all right so I am going to have to figure out a way to learn
this δi’s okay so suppose let us just keep it simple think of a two class problem okay.
Here a question okay they think of a two class problem and I have δ1 and δ2 right so where will
be by separating hyper plane? Wherever δ1 = δ2 right, so when δ1 is greater than δ2 it
is class 1 when δ2 is greater than δ1 it is class 2 like wherever they are equal it will be this a
boundary right, so this will essentially be okay, so if I need this to be a linear surface right,
so what conditions should δ1 δ2 satisfy should they be linear not necessarily but
okay this is sufficient condition if you are linear yeah the surface will be linear.
So what else can they be they can be non linear as long as I have some kind of a
monotone transformation of them which will become linear okay, so we will see examples of this will
actually look at discriminant functions okay or will yeah, so we look at the assumptions which
will appear to be, we are doing something nonlinear heavily nonlinear but at the end
of the day you will find that the surface will be linear okay the separating surface will be linear,
so we look at that as we go along right. So the few approaches that we look in this class
are essentially linear regression you could do a linear regression and try to treat that as your
discriminate function it for each class you could do we talked about this in the very first class
right or the 2nd class yeah, so where you could do a linear regression on an indicator variable, so
that will give you a discriminant function or you could do logistic regression or it could do linear
discriminant analysis which is like principal component regression but taking into account the
class labels you will think of deriving directions and which will be doing the classification.
You look at the 3 of those. The second class of methods which will come to this directly
model the hyper plane so it is related to this in some sense right, so if I give you
the discriminate functions, so I can always recover the hyper plane but here instead of
trying to do a class wise discriminate function will directly try to model a hyper plane okay,
so this is second class of problems, so we look at one classic approach for doing that which is the
perceptron right and we will also talk about some more recent well founded ways of doing that. Which
is essentially looking at the question of what an optimal hyper plane is right and trying to solve
for it directly, so these are the two approaches we look at right, so this basically just setting
things up, so people remember the basic set up for classification right.
So I am going to assume that I have some space G which has K classes,
so I will first convenience an index them as 1 to K. X is going to come from Rp as
before right and the output is going to come from this space G okay, so that is our setup
and so if there are K classes I am going to have when have K indicator variables.
Remember when we talked about one of K encoding so one of these K indicator variables will be
one for any input right depending on what class that data point belongs to
like, when this is assuming that my x has so assuming I have augmented ones so my β hat is
equal to XTX-1XT Y that is linear regression for you right so I can do just do linear regression
on my response matrix, so β is capitalized here because it is also a matrix right,
so one so this is capital β if you have one column for each of the classes right.
So each class I have a set of β so I can produce a vector of outputs f right given an input X by
essentially taking the product with the β right that gives me a vector of outputs f
and finally class label that is sent to the data point this argmax of they have f right,
so I am going to get a vector of fs one for each class right and the one, that I assign finally is
the one that gives me the maximum output okay, so I am not that any complex math here at all,
so only bit of math here we already saw in the very first linear regression fitting okay yeah,
because I wanted to add that I want to make it a P plus one thing right.
So x is the input the input data point I add a 1 to the front of it to for the bias okay,
so what does it mean what does this fk of x mean, so what do these fk of x mean
know that is fine every is there any semantics you can associate with the fk yeah, so if you
think about it so whenever the input take as of some classes let us pick a particular class let
us call it j right let us have a class or even make it more confident class 3 ok the input
belongs to class 3 whenever, the input belongs to class 3 okay y3 will be one where is the training
data if you look at it, whenever the input belongs to class 3 right y3 will be one.
So if you think about it if you look at the expected output that you should get for a
particular x the expected output you should get for particular x is okay how many what is the
average number of times it is going to be one so I am going to see the x again and again and
again right whenever that the x belongs to class 3 the output will be 1 and the x does not belong
to class 3 the output will be 0 right I see many times I see x okay, so what is the output I expect
it is the average of the outputs right, the prediction should be the average of the outputs
does it make sense? So I have many x,x,x there are different x they are the same x ok many times
I am getting x again and again so sometimes it is class 3 sometimes it is not sometimes
it is not sometimes it is class three okay. So if you take the average of all of this outputs
what am I getting? probability that x is class 3 right if you take the average of those outputs I
am getting the probability that x is class 3 right and we know that when I am trying to do the linear
regression what I am trying to predict is the expected value right, ideally I should be trying
to predict the expected value of this but since its linear you will not be able to get there but
we are trying to do is probability of the class ah that is a problem of using linear regression and
that is what I am coming to it right. So you cannot really interpret these as
probabilities because linear regression is not constrained right, so we will come to that in
a minute, how to fix that in a minute, so but this is one I am working upto that it is telling
you the interpretation of what you want to do is that it is a probability. So I really would like
to interpret this right, so the expected value of yk given x you would ideally like it to be
probability that the output is K given the input is x right, so this is ideally this is what you
want and the linear regression gives you hope of getting there right and people sometimes still use
linear regression because it is easy to use. You do and other things we would have to think of
other ways of getting to it right I will come to that in a minute, so before that I just want
to point out one other pitfalls of using linear regression for classification right but is it
clear any questions this is same this is remember the indicator variable thing, so it is either 0 or
1okay what do you mean behold the linear regression will work right,
so I can do linear regression, so I mean as just as a method of using it right you can
see how it was going to work I am going to do linear regression give me a minute huh.
What it means is we would ideally like it to mean this it is not going to mean that okay,
I will just give just hold or I will do this example and then you can come back and ask
me this question okay. So I am going to assume there is a single
input direction, so let us say that okay so let us say that there are data points here that belong to
one class okay data points here that belong to another class right. So if you think about it
let us say this is this is encoded by pink right so the training data right will look like this
right and 0 elsewhere for pink training data for blue will look like this and 0 elsewhere.
So now if I try to fit a straight line to this so what do you think will happen so
I will get a line that goes like that right I will probably get a line that goes like that
right, so this is essentially what your outputs will look like, so directly trying to interpret
this as probabilities is not a good idea obviously right but you can see that wherever this is
greater than this okay that should probably belong to class blue, where ever it is pink
is greater than blue it should belong to class pink right. So at least this much you can conclude
from the output of the linear regression. So that is essentially how you would interpret
the output? So whenever one output is greater than the other or greater than all the others you
will assume that it is the correct class directly interpreting that as probability it is a problem,
so this is what you would like to do that is what I said right but you do not want
to do this okay having said this let us see how visible this color is, suppose I have
a 3 class okay they are sitting in the middle like this okay, so the outputs were this will
be somewhere there right. Now if I try to fit a straight line for this what is going happen?
Remember the rest of the points are all sitting here right they are a bunch of 0 here a bunch
of 0 here and a bunch of ones there, so I try to do linear regression on this so I am going to get
that line like that I know what is the problem with that blue and pink completely dominated,
there is no part in the input space, where brown no part of the input space where brown actually
dominates right, the output of brown never dominates anywhere yeah. So this will be right,
so this is essentially what your f1 f2 f3 will be so it turns out that for class
two will never output any input point as class two okay, so this problem is called
this problem is called masking okay so this is one thing which you have to be aware of
well you are doing linear regression for making your predictions okay is there any
way to get to over masking anything, so instead of looking at pairs you just look,
at higher order basis transformations right instead of regressing on x right that is what
we did here right. So in some regressing on x if I regress on x squared I am going to get different
outputs okay today okay good return but next time actual English not just coffee right.
So if I am going to do that essentially I am going to get curves that look like that,
with interesting curve is this guy how is this brown curve going to look like okay,
so these are the crossover points it is anything, that this side will be blue anything to that side
will be pink and anything in between will be brown okay, you can see okay. So but remember the input
space is just on this line okay, so here this is the output whatever is going up is the output, so
the input is only on this line okay just a single dimensional input. So it is no region but say it
is only a line segment here so in this part of the input space it will be blue this part it will
be brown in this part it will be pink thank the almost ideal except there is a small here.
That is the just drawing error so you can you can choose appropriate data points such
that you can actually with the with the quadratic transformation if you regress on x squared you can
recover the actual boundaries okay so the rule of thumb is if you have K classes in your input data
you need at least K - 1 basis transformation so in fact with a lot of work you can show that even
with x squared regression you will have masking if you have four classes so in four classes you
have to regress on the cubic transformation okay so that you can still get away with it.
تصفح المزيد من مقاطع الفيديو ذات الصلة
Week 2 Lecture 6 - Statistical Decision Theory - Classification
GEOMETRIC MODELS ML(Lecture 7)
Machine Learning Tutorial Python - 8: Logistic Regression (Binary Classification)
Week 2 Lecture 8 - Linear Regression
Lec-3: Introduction to Regression with Real Life Examples
#9 Machine Learning Specialization [Course 1, Week 1, Lesson 3]
5.0 / 5 (0 votes)