Week 3 Lecture 15 Linear Classification

Machine Learning- Balaraman Ravindran
4 Aug 202124:10

Summary

TLDRThe script discusses transitioning from linear regression to linear classification methods, explaining the concept of linear classification where the boundary between classes is linear. It introduces discriminant functions, which assign class labels based on the highest output value, and highlights the limitations of using linear regression for classification, such as the issue of 'masking.' The script also touches on the use of basis transformations to overcome these limitations and suggests that at least K-1 basis transformations are needed for K classes to avoid masking.

Takeaways

  • 🔍 The discussion transitions from linear regression to linear classification methods, emphasizing the concept of a linear boundary separating different classes.
  • 📊 Linear classification involves using a linear boundary to separate classes, which can be achieved through linear or nonlinear discriminant functions, as long as they result in a linear decision surface.
  • 🤖 Discriminant functions are introduced as a way to model each class, with the class having the highest function value being assigned to a data point.
  • 📈 The script explains that for a two-class problem, the separating hyperplane is found where the discriminant functions for each class are equal.
  • 🧠 It's highlighted that while discriminant functions don't have to be linear, they can be nonlinear as long as a monotone transformation makes them linear, which is crucial for the linearity of the decision surface.
  • 📚 Three approaches for linear classification are mentioned: using linear regression as a discriminant function, logistic regression, and linear discriminant analysis, which considers class labels.
  • 👥 The second class of methods discussed models the hyperplane directly, such as the perceptron algorithm, which is an example of a more direct approach to finding an optimal hyperplane.
  • 🔢 The script outlines a mathematical setup for classification with K classes, using indicator variables and linear regression to predict class probabilities.
  • ⚠️ A potential issue with using linear regression for classification is highlighted, known as 'masking,' where certain classes may never be predicted due to the dominance of other classes in the data.
  • 🔧 The concept of basis transformations is introduced as a method to overcome the limitations of linear models in classification, allowing for more complex decision boundaries.

Q & A

  • What is the main difference between linear regression and linear classification?

    -In linear regression, the response is a continuous value that is a linear function of the inputs. In contrast, linear classification uses a boundary that is linear to separate different classes, with the classification decision based on which side of the boundary the input falls on.

  • What is meant by a 'linear boundary' in the context of classification?

    -A 'linear boundary' refers to a separating surface, typically a hyperplane, that divides different classes in a feature space. This boundary is defined by a linear equation, meaning that it does not curve and can be represented as a straight line in two dimensions or a flat plane in three dimensions.

  • What is a discriminant function in the context of classification?

    -A discriminant function is a function associated with each class that helps in classifying a data point. If the discriminant function for 'class I' outputs a higher value than for all other classes for a given data point, the data point is classified as belonging to 'class I'.

  • How does the concept of 'masking' in linear regression for classification affect the classification outcome?

    -Masking occurs when one class's output dominates the others, preventing some classes from ever being chosen as the classification for any input points. This can happen when the data points for certain classes are not represented well in the feature space, leading to biased classification decisions.

  • Why might linear regression not be the best choice for classification in some cases?

    -Linear regression might not be suitable for classification when the classes are not linearly separable or when there is a high degree of overlap between classes. Additionally, the outputs from linear regression cannot be directly interpreted as probabilities, which is often desired in classification tasks.

  • What is the relationship between the number of classes and the required basis transformations for classification?

    -The rule of thumb is that if you have K classes in your input data, you need at least K - 1 basis transformations to avoid issues like masking and to ensure that each class has a chance to dominate the classification in some region of the input space.

  • How can logistic regression be considered as an alternative to linear regression for classification?

    -Logistic regression models the probability of the classes as a function of the inputs and is constrained to output values between 0 and 1, making it more suitable for classification tasks where the outputs are probabilities.

  • What is the purpose of using indicator variables in linear regression for classification?

    -Indicator variables are used to represent the class labels in a binary format (0 or 1) for each class. This allows the linear regression model to be fit to the data for each class separately, with the goal of predicting the expected value or probability of each class given the input features.

  • How does the perceptron algorithm differ from linear regression in the context of classification?

    -The perceptron algorithm is a type of linear classifier that directly models the hyperplane for classification, rather than using discriminant functions. It updates its weights based on misclassifications to iteratively find the optimal separating hyperplane.

  • What is the significance of the separating hyperplane in linear discriminant analysis?

    -In linear discriminant analysis, the separating hyperplane is used to find a direction that maximizes the separation between classes. This method is similar to principal component analysis but takes into account the class labels to derive the directions that best separate the classes.

  • How can one interpret the output of a linear regression model in the context of classification?

    -The output of a linear regression model for each class can be interpreted as the expected value of the class label given the input features. However, these outputs should not be directly interpreted as probabilities due to the lack of constraints in linear regression models.

Outlines

00:00

📚 Introduction to Linear Classification

This paragraph introduces the concept of moving from linear regression to linear classification methods. The speaker explains that while linear regression models the response as a linear function of inputs, linear classification involves a boundary of separation between classes that is linear. The idea is to use a discriminant function for each class, where the class with the highest function output is chosen for classification. The speaker also touches on the concept of a separating hyperplane and hints at the possibility of nonlinear transformations that result in a linear separating surface.

05:02

🔍 Approaches to Linear Classification

The speaker outlines different approaches to linear classification, starting with using linear regression as a discriminant function for each class, treating it as an indicator variable. The paragraph also mentions logistic regression and linear discriminant analysis, which are methods that consider class labels to derive directions for classification. The second class of methods discussed involves directly modeling the hyperplane, with the perceptron and optimal hyperplane approaches being highlighted. The setup for classification is explained, assuming a space with K classes and using indicator variables for each class.

10:03

📈 Interpreting Linear Regression for Classification

This paragraph delves into the use of linear regression to predict class labels, emphasizing that the output of the regression can be interpreted as the expected value or probability of the class given the input. The speaker discusses the process of adding a bias term to the input and how the output vector is used to determine the class label through the argmax function. The limitations of linear regression in capturing the true probabilities are also highlighted, as the model is not constrained to output values between 0 and 1.

15:10

🚧 Pitfalls of Linear Regression in Classification

The speaker identifies potential issues with using linear regression for classification, such as the inability to directly interpret the outputs as probabilities due to the lack of constraints in the model. An example is given where fitting a straight line to data points from two classes results in a line that does not accurately represent the class probabilities. The concept of 'masking' is introduced, where certain classes may never dominate the output, leading to incorrect classifications.

20:12

🔧 Addressing Masking with Basis Transformations

The final paragraph addresses the problem of masking by suggesting the use of higher-order basis transformations. The speaker explains that regressing on the square of the input (x squared) can help recover the actual boundaries and avoid masking. The importance of having at least K - 1 basis transformations for K classes is emphasized, with the example of using cubic transformations to avoid masking in a four-class problem.

Mindmap

Keywords

💡Linear Regression

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data. In the context of the video, linear regression is initially discussed as a method for regression problems, but it is also extended to classification by considering the outputs as a linear function of inputs. The script mentions using linear regression to create discriminant functions for classification, where the response matrix is used to fit the model.

💡Classification

Classification is the task of predicting the category or class of an entity based on its features. The video discusses linear methods for classification, where the decision boundary between different classes is assumed to be linear. This is a key concept as it sets the stage for understanding how linear models can be used to separate data points into different classes.

💡Discriminant Function

A discriminant function is a mathematical function used to determine the class to which a new observation belongs. In the video, the concept of a discriminant function is introduced as a way to model a function for each class, and the class with the highest function output is chosen as the classification for a given data point. This is central to the theme of the video, as it explains how to assign class labels based on the output of these functions.

💡Hyperplane

A hyperplane is a flat affine subspace in a higher-dimensional space that is used as a decision boundary in classification problems. The script refers to the separating hyperplane as the linear boundary that divides the space into regions, each corresponding to a different class. This concept is crucial for understanding how linear classifiers work.

💡Indicator Variable

An indicator variable is a binary variable used in regression analysis to indicate the presence or absence of a condition or category. In the context of the video, indicator variables are used for one-of-K encoding, where each class is represented by a single binary variable in the regression model. This is exemplified in the script when discussing how to perform linear regression for classification by using indicator variables for each class.

💡Logistic Regression

Logistic regression is a statistical method for analyzing a dataset in which the response variable is categorical. Unlike linear regression, which predicts continuous outcomes, logistic regression predicts the probability of a certain class or event occurring. The video mentions logistic regression as an alternative approach to classification, which is more suitable for interpreting the output as a probability.

💡Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis is a method used to find a linear combination of features that characterizes or separates two or more classes of objects or events. The script refers to LDA as a technique similar to principal component regression but with the consideration of class labels to derive directions for classification.

💡Perceptron

The perceptron is an algorithm used to construct a linear classifier. It is a type of neural network and is one of the simplest classifiers in machine learning. The video mentions the perceptron as a classic approach for directly modeling the hyperplane that separates classes, which is a fundamental concept in understanding how to find an optimal separating surface.

💡Masking

Masking is a phenomenon in linear regression classification where some classes do not have a region in the input space where their output dominates. The script describes an example where, due to the linear nature of the regression, certain classes may not be represented adequately in the classification, which is an important issue to be aware of when using linear models for classification.

💡Basis Transformation

A basis transformation is a mathematical operation that changes the basis of a vector space. In the context of the video, basis transformations are used to create nonlinear relationships in the data by transforming the input variables (e.g., using x squared), which can help overcome issues like masking in classification and allow for better separation of classes.

💡Probabilities

In the video, probabilities refer to the likelihood of a data point belonging to a particular class. The script discusses the challenge of interpreting the outputs of linear regression as probabilities, as the model does not inherently provide probabilities. It suggests that the expected value predicted by linear regression could ideally represent the probability of class membership, but this is not the case without further adjustments.

Highlights

Introduction to linear methods for classification, emphasizing the difference from linear regression by focusing on a linear boundary for class separation.

Explanation of how a linear boundary in classification can be represented as a hyperplane, providing a geometric perspective on the classification problem.

Discussion on discriminant functions as a method for classification, where each class has a function and the class with the highest function value is chosen.

The concept that the discriminant functions, δ1 and δ2, determine the separating hyperplane where they are equal, and their relationship to class assignment.

The possibility of non-linear discriminant functions through monotone transformations that result in a linear separating surface.

Overview of different approaches for linear classification, including linear regression as a discriminant function, logistic regression, and linear discriminant analysis.

The perceptron algorithm as a classic method for directly modeling the hyperplane in classification, contrasting with discriminant function approaches.

The importance of considering the optimal hyperplane in classification and the methods for solving for it directly.

Setting up the classification problem with K classes and the corresponding indicator variables for each class.

The use of linear regression to predict the expected value of the indicator variables, which ideally should represent class probabilities.

The limitation of linear regression in interpreting the output as probabilities due to lack of constraints.

The issue of 'masking' in classification when using linear regression, where some classes may never dominate the output.

The strategy to overcome masking by using higher-order basis transformations in regression to allow for more complex decision boundaries.

The rule of thumb for the number of basis transformations needed based on the number of classes in the dataset.

The potential for masking even with quadratic transformations in datasets with four classes, necessitating cubic transformations.

The practical example of how regression on x squared can help recover actual class boundaries in a three-class problem with a single input dimension.

Transcripts

play00:00

 So we move on from  

play00:23

linear regression linear methods for regression  to linear methods for classification and right,  

play00:32

so far we have been looking at linear methods  for regression but I did tell you that you  

play00:36

could do “nonlinear regression” also by doing  appropriate basis transformations. So what do  

play00:43

I mean by linear methods for classification  linear regression you can understand right,  

play00:48

so the response is going to be a linear function  of the inputs right, so what do I mean by linear  

play00:53

classification? So when I am going to separate  the positive classes or when I am going to  

play01:11

separate two classes the boundary of separation  between the two classes will be linear.  

play01:16

So that is what I mean by linear classification.  So this boundary that I draw between two classes  

play01:23

will be linear, so you can think of when we did  look at an example in the first class where we  

play01:28

had drawn quadratics and phases and things  like that right, so but instead of that we  

play01:33

will assume that the classification surface  or the separating hyper plane will be here  

play01:39

well or the separating surface will be a hyper  plane. So there are two classes of approaches  

play01:57

that we will look at for classification for linear  classification and the first one is essentially  

play02:04

on modeling a discriminant function. So one rough way of thinking about it is to say  

play02:21

that I am going to have a function for each class  right and if the function for “class I” output for  

play02:32

“class I” is higher than for all the other classes  I will take classify the data point as belonging  

play02:38

to “class I” right, so I am going to have some  function so for each class I will have a function  

play02:58

right and so depending on whichever is the highest  I will output it to that, so this is essentially  

play03:21

the idea behind discriminant functions all right  so I am going to have to figure out a way to learn  

play03:26

this δi’s okay so suppose let us just keep it  simple think of a two class problem okay.  

play03:35

Here a question okay they think of a two class  problem and I have δ1 and δ2 right so where will  

play03:46

be by separating hyper plane? Wherever δ1 =  δ2 right, so when δ1 is greater than δ2 it  

play03:57

is class 1 when δ2 is greater than δ1 it is class  2 like wherever they are equal it will be this a  

play04:05

boundary right, so this will essentially be okay,  so if I need this to be a linear surface right,  

play04:19

so what conditions should δ1 δ2 satisfy  should they be linear not necessarily but  

play04:31

okay this is sufficient condition if you are  linear yeah the surface will be linear.  

play04:34

So what else can they be they can be non  linear as long as I have some kind of a  

play04:42

monotone transformation of them which will become  linear okay, so we will see examples of this will  

play04:48

actually look at discriminant functions okay or  will yeah, so we look at the assumptions which  

play04:57

will appear to be, we are doing something  nonlinear heavily nonlinear but at the end  

play05:01

of the day you will find that the surface will be  linear okay the separating surface will be linear,  

play05:09

so we look at that as we go along right. So the few approaches that we look in this class  

play05:14

are essentially linear regression you could do a  linear regression and try to treat that as your  

play05:21

discriminate function it for each class you could  do we talked about this in the very first class  

play05:26

right or the 2nd class yeah, so where you could do  a linear regression on an indicator variable, so  

play05:34

that will give you a discriminant function or you  could do logistic regression or it could do linear  

play05:41

discriminant analysis which is like principal  component regression but taking into account the  

play05:48

class labels you will think of deriving directions  and which will be doing the classification.  

play05:52

You look at the 3 of those. The second class  of methods which will come to this directly  

play06:14

model the hyper plane so it is related to  this in some sense right, so if I give you  

play06:18

the discriminate functions, so I can always  recover the hyper plane but here instead of  

play06:24

trying to do a class wise discriminate function  will directly try to model a hyper plane okay,  

play06:31

so this is second class of problems, so we look at  one classic approach for doing that which is the  

play06:37

perceptron right and we will also talk about some  more recent well founded ways of doing that. Which  

play06:46

is essentially looking at the question of what an  optimal hyper plane is right and trying to solve  

play06:51

for it directly, so these are the two approaches  we look at right, so this basically just setting  

play06:57

things up, so people remember the basic set  up for classification right.  

play07:03

So I am going to assume that I have  some space G which has K classes,  

play07:09

so I will first convenience an index them  as 1 to K. X is going to come from Rp as  

play07:31

before right and the output is going to come  from this space G okay, so that is our setup  

play07:37

and so if there are K classes I am going to  have when have K indicator variables.  

play08:08

Remember when we talked about one of K encoding  so one of these K indicator variables will be  

play08:13

one for any input right depending on  what class that data point belongs to  

play08:48

like, when this is assuming that my x has so  assuming I have augmented ones so my β hat is  

play09:11

equal to XTX-1XT Y that is linear regression for  you right so I can do just do linear regression  

play09:19

on my response matrix, so β is capitalized  here because it is also a matrix right,  

play09:27

so one so this is capital β if you have one  column for each of the classes right.  

play09:35

So each class I have a set of β so I can produce  a vector of outputs f right given an input X by  

play10:02

essentially taking the product with the β  right that gives me a vector of outputs f  

play10:29

and finally class label that is sent to the  data point this argmax of they have f right,  

play10:40

so I am going to get a vector of fs one for each  class right and the one, that I assign finally is  

play10:44

the one that gives me the maximum output okay,  so I am not that any complex math here at all,  

play10:50

so only bit of math here we already saw in the  very first linear regression fitting okay yeah,  

play10:59

because I wanted to add that I want to  make it a P plus one thing right.  

play11:02

So x is the input the input data point I add  a 1 to the front of it to for the bias okay,  

play11:16

so what does it mean what does this fk  of x mean, so what do these fk of x mean  

play11:36

know that is fine every is there any semantics  you can associate with the fk yeah, so if you  

play11:55

think about it so whenever the input take as of  some classes let us pick a particular class let  

play12:02

us call it j right let us have a class or even  make it more confident class 3 ok the input  

play12:09

belongs to class 3 whenever, the input belongs to  class 3 okay y3 will be one where is the training  

play12:19

data if you look at it, whenever the input  belongs to class 3 right y3 will be one.  

play12:25

So if you think about it if you look at the  expected output that you should get for a  

play12:33

particular x the expected output you should get  for particular x is okay how many what is the  

play12:43

average number of times it is going to be one  so I am going to see the x again and again and  

play12:48

again right whenever that the x belongs to class  3 the output will be 1 and the x does not belong  

play12:53

to class 3 the output will be 0 right I see many  times I see x okay, so what is the output I expect  

play13:02

it is the average of the outputs right, the  prediction should be the average of the outputs  

play13:06

does it make sense? So I have many x,x,x there  are different x they are the same x ok many times  

play13:17

I am getting x again and again so sometimes  it is class 3 sometimes it is not sometimes  

play13:21

it is not sometimes it is class three okay. So if you take the average of all of this outputs  

play13:28

what am I getting? probability that x is class 3  right if you take the average of those outputs I  

play13:42

am getting the probability that x is class 3 right  and we know that when I am trying to do the linear  

play13:48

regression what I am trying to predict is the  expected value right, ideally I should be trying  

play13:52

to predict the expected value of this but since  its linear you will not be able to get there but  

play13:57

we are trying to do is probability of the class ah  that is a problem of using linear regression and  

play14:26

that is what I am coming to it right. So you cannot really interpret these as  

play14:29

probabilities because linear regression is not  constrained right, so we will come to that in  

play14:34

a minute, how to fix that in a minute, so but  this is one I am working upto that it is telling  

play14:39

you the interpretation of what you want to do is  that it is a probability. So I really would like  

play14:44

to interpret this right, so the expected value  of yk given x you would ideally like it to be  

play15:10

probability that the output is K given the input  is x right, so this is ideally this is what you  

play15:17

want and the linear regression gives you hope of  getting there right and people sometimes still use  

play15:25

linear regression because it is easy to use. You do and other things we would have to think of  

play15:28

other ways of getting to it right I will come  to that in a minute, so before that I just want  

play15:33

to point out one other pitfalls of using linear  regression for classification right but is it  

play15:48

clear any questions this is same this is remember  the indicator variable thing, so it is either 0 or  

play16:20

1okay what do you mean behold the  linear regression will work right,  

play16:27

so I can do linear regression, so I mean as  just as a method of using it right you can  

play16:32

see how it was going to work I am going to do  linear regression give me a minute huh.  

play16:38

What it means is we would ideally like it to  mean this it is not going to mean that okay,  

play16:44

I will just give just hold or I will do this  example and then you can come back and ask  

play16:47

me this question okay. So I am going to assume there is a single  

play17:02

input direction, so let us say that okay so let us  say that there are data points here that belong to  

play17:30

one class okay data points here that belong to  another class right. So if you think about it  

play17:37

let us say this is this is encoded by pink right  so the training data right will look like this  

play17:45

right and 0 elsewhere for pink training data for  blue will look like this and 0 elsewhere.  

play18:03

So now if I try to fit a straight line to  this so what do you think will happen so  

play18:17

I will get a line that goes like that right I  will probably get a line that goes like that  

play18:32

right, so this is essentially what your outputs  will look like, so directly trying to interpret  

play18:40

this as probabilities is not a good idea obviously  right but you can see that wherever this is  

play18:47

greater than this okay that should probably  belong to class blue, where ever it is pink  

play18:52

is greater than blue it should belong to class  pink right. So at least this much you can conclude  

play18:57

from the output of the linear regression. So that is essentially how you would interpret  

play19:01

the output? So whenever one output is greater  than the other or greater than all the others you  

play19:05

will assume that it is the correct class directly  interpreting that as probability it is a problem,  

play19:10

so this is what you would like to do that  is what I said right but you do not want  

play19:14

to do this okay having said this let us see  how visible this color is, suppose I have  

play19:22

a 3 class okay they are sitting in the middle  like this okay, so the outputs were this will  

play19:31

be somewhere there right. Now if I try to fit  a straight line for this what is going happen?  

play19:53

Remember the rest of the points are all sitting  here right they are a bunch of 0 here a bunch  

play19:58

of 0 here and a bunch of ones there, so I try to  do linear regression on this so I am going to get  

play20:04

that line like that I know what is the problem  with that blue and pink completely dominated,  

play20:12

there is no part in the input space, where brown  no part of the input space where brown actually  

play20:22

dominates right, the output of brown never  dominates anywhere yeah. So this will be right,  

play20:44

so this is essentially what your f1 f2  f3 will be so it turns out that for class  

play20:48

two will never output any input point as  class two okay, so this problem is called  

play21:02

this problem is called masking okay so this  is one thing which you have to be aware of  

play21:10

well you are doing linear regression for  making your predictions okay is there any  

play21:17

way to get to over masking anything, so  instead of looking at pairs you just look,  

play21:29

at higher order basis transformations right  instead of regressing on x right that is what  

play21:35

we did here right. So in some regressing on x if  I regress on x squared I am going to get different  

play21:40

outputs okay today okay good return but next  time actual English not just coffee right.  

play21:57

So if I am going to do that essentially I  am going to get curves that look like that,  

play22:02

with interesting curve is this guy how is  this brown curve going to look like okay,  

play22:16

so these are the crossover points it is anything,  that this side will be blue anything to that side  

play22:26

will be pink and anything in between will be brown  okay, you can see okay. So but remember the input  

play23:01

space is just on this line okay, so here this is  the output whatever is going up is the output, so  

play23:06

the input is only on this line okay just a single  dimensional input. So it is no region but say it  

play23:11

is only a line segment here so in this part of  the input space it will be blue this part it will  

play23:16

be brown in this part it will be pink thank the  almost ideal except there is a small here.  

play23:23

That is the just drawing error so you can  you can choose appropriate data points such  

play23:31

that you can actually with the with the quadratic  transformation if you regress on x squared you can  

play23:36

recover the actual boundaries okay so the rule of  thumb is if you have K classes in your input data  

play23:44

you need at least K - 1 basis transformation so  in fact with a lot of work you can show that even  

play23:52

with x squared regression you will have masking  if you have four classes so in four classes you  

play23:59

have to regress on the cubic transformation  okay so that you can still get away with it.

Rate This

5.0 / 5 (0 votes)

Related Tags
Linear ClassificationMachine LearningDiscriminant FunctionsRegression AnalysisData ModelingHyperplane ModelingPerceptron AlgorithmLogistic RegressionBasis TransformationMasking EffectProbabilistic Interpretation