Week 3 Lecture 10 Subset Selection 1

Machine Learning- Balaraman Ravindran
4 Aug 202115:49

Summary

TLDRThis script discusses linear regression, emphasizing its simplicity and efficiency despite potential drawbacks. It explores the trade-off between bias and variance, and introduces the concept of subset selection to improve model interpretability and prediction accuracy. The discussion covers methods like best subset selection, forward step-wise selection, and the importance of reducing model complexity for better generalization and understanding of data.

Takeaways

  • 🔍 The script discusses linear regression and the assumption that the input data (X) can come from a set other than real numbers, and the output (Y) is from the real numbers.
  • 📊 The importance of encoding data and the input matrix X is highlighted, which can be of size N x (P+1) when an explicit intercept term is included, or N x P otherwise.
  • 📉 Linear regression aims to minimize squared error, and the script explains the trade-off between bias and variance in the model, especially when using least squares fit.
  • 🔧 The concept of subset selection in linear regression is introduced to reduce variance and potentially increase prediction accuracy by selecting a subset of input variables.
  • 📉 Subset selection can also improve model interpretability by simplifying the model to focus on a few important variables.
  • 🤖 The script mentions the computational aspect of subset selection, noting that while it can be computationally expensive to determine the subset, it can lead to more efficient model computation afterward.
  • 🔑 The script talks about the limitations of best subset selection due to its combinatorial nature and the lack of an inclusion property, which means one must redo the selection for each subset size.
  • 🛠 It mentions the use of QR decomposition as a method to speed up the process of subset selection for a moderate number of variables.
  • 🚀 The script introduces forward step-wise selection as a greedy method for feature selection, starting with the intercept and adding variables that maximize improvement in fit.
  • 🔄 The possibility of a hybrid approach is suggested, where variables can be added and removed to optimize the model performance.
  • 📚 The script notes that despite being greedy, forward step-wise selection can perform well in practice and is included in some statistical software packages.

Q & A

  • What is the assumption about the data in linear regression?

    -In linear regression, it is assumed that the input data (X) can come from any set, not necessarily the real numbers, and the output data (Y) comes from the real numbers. The input matrix X can be of size N x P + 1 when an explicit intercept term is included, or N x P when it is not.

  • What is the significance of adding a column of ones to the input matrix X?

    -Adding a column of ones to the input matrix X is a way to include an intercept term in the linear regression model. This allows the model to fit lines that are not forced to pass through the origin.

  • Why is linear regression considered efficient and easy to solve?

    -Linear regression is efficient and easy to solve because it involves simple calculations and has a closed-form solution. It runs quickly and is computationally less intensive compared to more complex models.

  • What are the drawbacks of using linear regression?

    -Linear regression can have high variance and potential bias if the model is not a good fit for the data. It also assumes a linear relationship between variables, which may not always be the case.

  • What is the trade-off between bias and variance in linear regression?

    -By adding constraints or reducing the number of variables in the model, one can reduce the variance of the estimator but at the cost of introducing some bias. This is a trade-off where a less biased estimator might have higher variance and vice versa.

  • What is subset selection in the context of linear regression?

    -Subset selection refers to the process of choosing a subset of input variables to use for fitting the regression line. This can help to improve prediction accuracy and reduce variance by focusing on the most relevant variables.

  • Why is interpretability important in linear regression models?

    -Interpretability is important because it allows for better understanding of the data and the relationships between variables. It helps in identifying which variables are most influential in the model.

  • What is the difference between best subset selection and forward step-wise selection?

    -Best subset selection involves evaluating all possible combinations of variables to find the best subset for the model, while forward step-wise selection starts with an empty model and adds variables one by one based on their contribution to the model's performance.

  • Why is forward step-wise selection considered a greedy algorithm?

    -Forward step-wise selection is considered greedy because it makes the locally optimal choice at each step by adding the variable that most improves the model's performance, without considering the global optimal solution.

  • What is the 'leaps and bounds' algorithm mentioned in the script?

    -The 'leaps and bounds' algorithm is an efficient method for performing subset selection in linear regression. It is designed to quickly narrow down the search for the best subset of variables.

  • How can one determine when to stop adding variables in forward step-wise selection?

    -One can stop adding variables in forward step-wise selection when the residual error does not change significantly, indicating that adding more variables does not improve the model's performance, or when a pre-set threshold for prediction accuracy is met.

Outlines

00:00

📊 Linear Regression and Model Complexity

This paragraph discusses the fundamental concepts of linear regression, including the assumption that the input data (X) may not necessarily come from real numbers and the output (Y) is from the real numbers. It explains the setup of the input matrix X, which could be of size N x (P + 1) when an intercept is included or N x P otherwise. The focus is on minimizing squared error and the trade-off between bias and variance in model fitting. The paragraph also introduces the idea of reducing the number of variables to improve model performance by setting some coefficients to zero, which increases bias but decreases variance.

05:04

🔍 Subset Selection for Model Interpretability and Efficiency

The second paragraph delves into subset selection as a method to enhance the interpretability and prediction accuracy of a linear regression model. It highlights the benefits of working with a smaller subset of variables, such as reduced computational complexity and the ability to identify the most influential variables. The paragraph also touches on the challenges of subset selection, including the lack of an inclusion property and the computational demands of combinatorial selection. It mentions the use of QR decomposition and the 'leaps and bounds' algorithm as methods to perform efficient subset selection.

10:09

🛠 Forward Step-wise Selection in Linear Regression

This paragraph introduces the concept of forward step-wise selection, a greedy approach to feature selection in linear regression. It starts with an intercept and iteratively adds the variable that provides the most significant improvement in model performance, measured by residual error. The process continues until no further improvement is observed or a predefined threshold is reached. The paragraph also discusses the potential for a hybrid approach that involves both adding and removing variables to optimize performance. It notes that despite being greedy, forward step-wise selection can be effective in many real-world datasets.

15:12

📚 Statistical Packages and Feature Selection Methods

The final paragraph comments on the prevalence of stage-wise feature selection methods in statistical packages like R, which often include forward step-wise selection. It contrasts this with machine learning tools such as Weka, which may not offer this specific feature selection approach but instead provide alternative methods. The paragraph emphasizes the practical effectiveness of stage-wise selection across various datasets, despite its simplicity and potential drawbacks.

Mindmap

Keywords

💡Linear Regression

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data. In the video, it is the central topic, with discussions on how to encode data, interpret results, and the challenges of fitting models to real-world data.

💡Encoding

Encoding in the context of the video refers to the process of converting data into a format that can be used for analysis, such as transforming categorical data into numerical form for linear regression models. The script mentions various ways to encode data, which is crucial for preparing input variables for regression analysis.

💡Input Matrix X

The input matrix X represents the set of independent variables in a regression model. It can be of dimensions N x P+1, where N is the number of observations, and P+1 accounts for the additional column for the intercept term. The script discusses how the structure of X can affect the regression analysis, particularly when considering the inclusion of an intercept.

💡Intercept

The intercept in a linear regression model is the value at which the line crosses the y-axis when the x-values are zero. The script explains that when fitting a model with an intercept, a column of ones is added to the input matrix X, which allows the model to account for any non-zero starting point on the y-axis.

💡Squared Error

Squared error is a measure of the difference between the actual and predicted values in a regression model. The script mentions minimizing squared error as the objective in linear regression, which is achieved by adjusting the model's coefficients to best fit the data.

💡Variance and Bias

Variance and bias are key concepts in understanding the accuracy of a regression model. Variance refers to the spread of the model's predictions around the true value, while bias is the systematic error introduced by approximating a real-world problem with a simple model. The script discusses the trade-off between reducing variance (improving prediction accuracy) and increasing bias (simplifying the model).

💡Subset Selection

Subset selection is a method for choosing a subset of input variables to use in a regression model. The script explains that by selecting a smaller subset of variables, one can reduce the model's variance and potentially improve interpretability and prediction accuracy, although this may introduce some bias.

💡Interpretability

Interpretability in the context of regression models refers to the ease with which the model's results can be understood and explained. The script highlights the importance of interpretability for understanding the data and making the model's insights accessible to others.

💡Forward Step-wise Selection

Forward step-wise selection is a greedy algorithm for feature selection in regression models. It starts with an empty model and iteratively adds the variable that most improves the model's performance until no further improvement is possible. The script mentions this method as a practical approach to subset selection.

💡Residual

A residual is the difference between the observed value and the predicted value from a regression model. The script discusses how residuals can be used to assess the fit of the model and to determine when to stop adding variables in step-wise selection, as the residuals become orthogonal to the model's directions.

💡Orthogonality

Orthogonality in the context of regression models refers to the condition where residuals are perpendicular to the space spanned by the model's variables. The script explains that at the end of a least squares fit, the residuals are orthogonal to the model's directions, indicating a good fit.

Highlights

Linear regression can assume data from any distribution, not just the real numbers.

The input matrix X can be of size N x P + 1 when an explicit intercept is included, or N x P otherwise.

Linear regression is efficient and easy to solve but has drawbacks, such as high variance with least squares fit.

Least squares fit can provide a zero-bias solution if the linear model is appropriate.

By adding constraints, it's possible to trade off some bias for a significant reduction in variance.

Subset selection in linear regression can improve prediction accuracy by reducing the number of variables.

Fewer parameters in a model can lead to lower variance but may increase bias.

Subset selection can improve model interpretability by identifying important variables.

Subset selection can be computationally expensive due to the combinatorial nature of selecting variable subsets.

QR decomposition can be used to speed up subset selection in linear regression.

Leaps and bounds is an efficient algorithm for subset selection in linear regression.

Forward step-wise selection is a greedy method that adds variables one by one to improve the fit.

Forward step-wise selection can be used even when the number of variables exceeds the number of data points.

Greedy selection methods like forward step-wise can perform well in practice despite their simplicity.

Hybrid approaches to feature selection can combine adding and deleting variables for optimal performance.

Statistical packages like R often include step-wise feature selection due to its practical effectiveness.

Transcripts

play00:00

 Right so we were looking at linear regression  

play00:20

right, so we are assuming that the X is coming  from RP but I told you that it is not necessarily  

play00:31

that it has to come from the real numbers and we  talked about various ways in which you can encode  

play00:35

the data and so on so forth and then if we assume  that the Y comes from R and I told you depending  

play00:43

on the circumstances so we will talk about the  input matrix X okay which might be of N x P + 1  

play00:58

or N x P right so it will be N x P + 1 when we  actually have an explicit intercept right.  

play01:07

So a β0 so term for the β0 term will have a column  of ones added to the data right the input instead  

play01:14

of thinking of it as a P-dimensional vector  will think of it as P + 1 dimensional vector  

play01:17

and when I do not have the intercept okay it  just becomes a P-dimensional input right so  

play01:23

that is that is the basic setup that we have  and we are essentially looking at minimizing  

play01:28

some squared error right so we looked at  the simple linear regression we looked at  

play01:34

the case where there were multiple inputs. And we looked at how you can interpret that in  

play01:41

terms of single variable regression univariate  regression is essentially what we looked at in  

play01:48

the last class. And so this class we will go on  to look at a little more you know complex things  

play01:59

that you can do with linear regression, so linear  regression is great because it is so easy to solve  

play02:19

right and it is very efficient runs very quickly  and all that but there are a couple of drawbacks  

play02:25

to linear regression so the first one is that  if you remember I was mentioning in last class  

play02:34

also that by doing this least squares fit you are  actually getting a fit that has very low very low  

play02:45

variance but it also has how much bias okay did I  say that in fact I think people are hallucinating  

play02:53

so what I did say is that if you do least squares  fit and if the if linear happens to be the right  

play03:01

choice then least squares which gives you the  0 bias solution okay right so the least squares  

play03:11

fit gives you 0 bias solution but the problem is  the variance can be relatively high and it turns  

play03:17

out that by not just doing the straightforward  least squares fit by doing more tricks with  

play03:24

their data with the with the models that we have  we can trade off a little bit bias okay.  

play03:30

I am going to get a biased estimator for the fit  for the line okay I am still fitting straight  

play03:35

lines okay I have not done anything different  I am still fitting straight lines but the fit  

play03:39

I am getting will no longer be a bias-free  fit but then I can reduce the variance a lot  

play03:45

more by adding certain constraints additional  constraints to the problem okay so essentially  

play03:52

what I what I would really like to do is  reduce the number of variables which I am  

play04:01

trying to fit instead of trying to fit P +  1 variables if you can somehow reduce the  

play04:07

number of variables what does is equal to it is  equivalent to setting some of the β to 0.  

play04:13

So if I can somehow set the β to some of the β  to 0 then essentially what it means I have lot  

play04:19

fewer parameters that I am estimating right so  the fewer the parameter set I am estimating the  

play04:24

lower the variance still right but because I  am now restricting the class of models that I  

play04:32

am going to be looking at because I have to set  some numbers to 0 so my bias will go up slightly  

play04:36

right this is assuming that I am fitting straight  lines you know the straight lines are the right  

play04:41

things to do but still my bias will increase a  little bit right in this case of course that is  

play04:46

always this residual bias because the language  of straight lines is not powerful enough.  

play04:50

So that bias is still there we are not talking  about that I am saying even assuming straight  

play04:54

lines are the right thing to do if I am going to  force certain coefficients to be 0 I am increasing  

play04:58

the bias in the estimator right but the variance  will go down because I have a lot fewer parameters  

play05:04

that I am going to estimate okay, so that is  essentially what we are going to look at and  

play05:08

so what I mean by subset selection here is that I  am going to select a subset of the input variables  

play05:13

to use for fitting the line okay right. So one thing is we can reduce the variance  

play05:24

significantly and that is where we can improve  the prediction accuracy of the model okay that  

play05:28

is one of the reasons we would like to do subset  selection there any other reason you can think of  

play05:32

for wanting to work with the few smaller subset  less computation yeah but it is a question of  

play05:43

there you have to do more computation to figure  out what the subset is and so on so forth yeah  

play05:47

but less computation is one answer but there  is another one I think I heard somebody say  

play05:51

this somebody someone or rather speaks with you  synchronously always see a trend you know no there  

play06:07

is yet another major advantage I mean. Yeah so related to this right, so there are  

play06:16

variables which could potentially have high noise  and so it will end up with a small coefficient  

play06:21

right but then if I if I tell you okay here is  this model m and it has like 135 coefficients  

play06:28

and it becomes hard for you to even figure  out what this means that instead of that if  

play06:33

it is okay here is this model you gave me a 135  variables but these are the 4 important variables  

play06:38

that I need for doing a linear fit right. Now it is easy much easier for you to interpret  

play06:43

what is going on right so interpretation  is a very big component of any kind of data  

play06:49

analytics that you want to do like ultimately  what you are doing with machine learning is  

play06:52

trying to understand the data right, so one  of the things you would like to have this  

play06:56

interpretability right. So, the first one is to okay this is  

play07:18

interpretability and that is prediction accuracy  right so there are many ways in doing this the  

play07:25

simplest kind of approach is essentially to  take this literally right and try to select  

play07:32

from subsets of features right, so why is  that a simplistic approach so why is that  

play07:52

a simplistic approach oh come on easy it is  combinatorial now exactly.  

play08:18

So this will just do subsets in a best subset  selection essentially I would say that okay here  

play08:24

first pic subsets of size one subsets of size  two subsets of size 3 and so on so forth right,  

play08:29

and it turns out that you can see yourself  if you start playing around with some linear  

play08:37

regression tools that what is the variables  that go into the best subset of size one  

play08:43

right basically the one best variable need  not figure in the best subset of size two so  

play08:50

it does not have a nice inclusion property. So for it to have a nice inclusion property you  

play08:54

need to have certain very nice conditions  on the data set so in general it does not  

play08:59

have this inclusion property right, so you  have to just redo the whole thing again so  

play09:04

it is not enough if you just do one say okay you  cannot be greedy basically I cannot say I will do  

play09:10

the first subset and then I just add the one best  variable to it and then I will find the next one,  

play09:16

so essentially you have to do a combinatorial  selection so people have come up with a very  

play09:20

clever ways of organizing this right. And in fact of using QR decomposition to do  

play09:27

things more rapidly I am not going to go into  details of that but then up to like 30 or 40  

play09:34

variables you can scale well right, you mean  can hope to get answers in your lifetime kind  

play09:40

of scaling right but if you go for much larger  variables right much larger number of variables  

play09:47

like many problems of interests like in text or  image domains and things like that then there  

play09:53

is no hope to do an exhaustive search but that  is a base length which you can do people.  

play09:58

Actually come up with algorithms which do this  kind of a subset selection okay, so there is one  

play10:02

very interestingly named algorithm called leaps  and bounds which does pretty efficient subset  

play10:09

selection but just a more of a informational  thing for you right.  

play10:14

This thing is forward step-wise selection so  any guesses what that is? exactly it being  

play10:41

greedy right it is just trying to do best subset  selection by being greedy, so you start off with  

play10:46

so what is the feature you for sure want to  have all ones okay you need the intercept  

play10:54

right otherwise your line has to pass through the  origin okay, so you need the intercepts you start  

play10:59

with the intercept okay so what should be the  coefficient for the intercept we already looked  

play11:03

at it before now the mean of the y's right. So that will be the coefficient so we already  

play11:08

have fitted that so now what you do this you  start off with that variable okay, now then you  

play11:15

add the next one okay let some xi add that as the  next variable such that it gives you the best fit  

play11:27

modulo the set they have already taken so you are  not going to disturb all the stages the variables  

play11:31

you have taken to in step k now we will add a new  variable such that among all the variables I could  

play11:38

add at this k plus 1 stage this one gives me the  maximum of improvement in the performance.  

play11:42

So how will I measure performance some kind  of residual error on the test data right so  

play11:48

that is how I measure performance I  keep doing this, until a point where  

play12:01

the error does not change much is there any other  thing I can say so the residual is orthogonal to  

play12:11

any of the other directions I could add right  there is one thing which we know of right,  

play12:15

so we know that at the end of the right fit you  get the residual will be orthogonal to the space  

play12:20

spanned by the x’s right that means individually  if I take any of the x’s right the residual will  

play12:26

be orthogonal to that individual direction as  well, so when you find that none of the directions  

play12:31

that are left have any kind of component along the  direction of the residual then I can stop.  

play12:37

Or I mean that may take a long time to do because  that might happen only when I have the full least  

play12:43

squares fit so I can stop and the residual drops  below a certain threshold that you can say okay I  

play12:49

am happy with the prediction accuracy or getting  I will stop here. So there in many ways of doing  

play12:53

that right so the other way is to do what will  you do in this case? I will start with the fit  

play13:16

that has all the variables okay and then I will  keep dropping one by one right so one thing to  

play13:22

note is that you can do this if the number of data  points is greater than the number of dimensions so  

play13:34

then you can actually find the fit. If p is  > n now as people pointed out last time the  

play13:41

formula itself that, we are using is no longer  valid right so because if the matrix will not  

play13:46

be invertible so we will have to think of other  tricks for doing the fit and so on so forth but  

play13:50

forward stepwise selection you can do even if  p is greater than n because I am anyway fitting  

play13:58

one direction at a time right so it is fine I am  adding one direction at a time I can keep going  

play14:04

until I reach n directions at which point I should  have a full least squares fit okay. Yeah so when  

play14:17

people actually even come up with all kinds of  variants on this so one thing which you could do  

play14:22

is think of some kind of hybrid approach where  I keep adding and deleting directions right,  

play14:27

so if I if you remember we talked about how greedy  is not always the best way to grow things right,  

play14:39

so you can grow up to a certain level again maybe  then dropping one of the earlier dimensions will  

play14:44

actually give you a slightly better performance  could be right so that is one possibility  

play14:48

and right they could do some kind of a hybrid  version of this so one thing I should point out  

play14:58

here, so you may think that forward stepwise  selection because it is greedy is going to be  

play15:03

much worse than best subset selection right. So it  turns out that in many real cases many real data  

play15:11

sets that you work with right the greedy selection  is actually not a bad thing to do right in fact,  

play15:18

so much so many statistics packages like R but  a little bit to do this right so you would not  

play15:23

find this in many of the machine learning tools  like Weka so they would not have this kind of a  

play15:28

forward feature selection and things like that  they have other ways of doing feature selection  

play15:33

which we will talk about later right but then  statistical packages actually have this stage  

play15:38

wise edition of features because they seem to  work well on a variety of data sets okay.

Rate This

5.0 / 5 (0 votes)

Связанные теги
Linear RegressionModel SelectionFeature SubsetData AnalysisBias-VarianceMachine LearningGreedy SelectionStatistical MethodsPredictive AccuracyInterpretabilityEfficiency
Вам нужно краткое изложение на английском?