Week 3 Lecture 10 Subset Selection 1
Summary
TLDRThis script discusses linear regression, emphasizing its simplicity and efficiency despite potential drawbacks. It explores the trade-off between bias and variance, and introduces the concept of subset selection to improve model interpretability and prediction accuracy. The discussion covers methods like best subset selection, forward step-wise selection, and the importance of reducing model complexity for better generalization and understanding of data.
Takeaways
- 🔍 The script discusses linear regression and the assumption that the input data (X) can come from a set other than real numbers, and the output (Y) is from the real numbers.
- 📊 The importance of encoding data and the input matrix X is highlighted, which can be of size N x (P+1) when an explicit intercept term is included, or N x P otherwise.
- 📉 Linear regression aims to minimize squared error, and the script explains the trade-off between bias and variance in the model, especially when using least squares fit.
- 🔧 The concept of subset selection in linear regression is introduced to reduce variance and potentially increase prediction accuracy by selecting a subset of input variables.
- 📉 Subset selection can also improve model interpretability by simplifying the model to focus on a few important variables.
- 🤖 The script mentions the computational aspect of subset selection, noting that while it can be computationally expensive to determine the subset, it can lead to more efficient model computation afterward.
- 🔑 The script talks about the limitations of best subset selection due to its combinatorial nature and the lack of an inclusion property, which means one must redo the selection for each subset size.
- 🛠 It mentions the use of QR decomposition as a method to speed up the process of subset selection for a moderate number of variables.
- 🚀 The script introduces forward step-wise selection as a greedy method for feature selection, starting with the intercept and adding variables that maximize improvement in fit.
- 🔄 The possibility of a hybrid approach is suggested, where variables can be added and removed to optimize the model performance.
- 📚 The script notes that despite being greedy, forward step-wise selection can perform well in practice and is included in some statistical software packages.
Q & A
What is the assumption about the data in linear regression?
-In linear regression, it is assumed that the input data (X) can come from any set, not necessarily the real numbers, and the output data (Y) comes from the real numbers. The input matrix X can be of size N x P + 1 when an explicit intercept term is included, or N x P when it is not.
What is the significance of adding a column of ones to the input matrix X?
-Adding a column of ones to the input matrix X is a way to include an intercept term in the linear regression model. This allows the model to fit lines that are not forced to pass through the origin.
Why is linear regression considered efficient and easy to solve?
-Linear regression is efficient and easy to solve because it involves simple calculations and has a closed-form solution. It runs quickly and is computationally less intensive compared to more complex models.
What are the drawbacks of using linear regression?
-Linear regression can have high variance and potential bias if the model is not a good fit for the data. It also assumes a linear relationship between variables, which may not always be the case.
What is the trade-off between bias and variance in linear regression?
-By adding constraints or reducing the number of variables in the model, one can reduce the variance of the estimator but at the cost of introducing some bias. This is a trade-off where a less biased estimator might have higher variance and vice versa.
What is subset selection in the context of linear regression?
-Subset selection refers to the process of choosing a subset of input variables to use for fitting the regression line. This can help to improve prediction accuracy and reduce variance by focusing on the most relevant variables.
Why is interpretability important in linear regression models?
-Interpretability is important because it allows for better understanding of the data and the relationships between variables. It helps in identifying which variables are most influential in the model.
What is the difference between best subset selection and forward step-wise selection?
-Best subset selection involves evaluating all possible combinations of variables to find the best subset for the model, while forward step-wise selection starts with an empty model and adds variables one by one based on their contribution to the model's performance.
Why is forward step-wise selection considered a greedy algorithm?
-Forward step-wise selection is considered greedy because it makes the locally optimal choice at each step by adding the variable that most improves the model's performance, without considering the global optimal solution.
What is the 'leaps and bounds' algorithm mentioned in the script?
-The 'leaps and bounds' algorithm is an efficient method for performing subset selection in linear regression. It is designed to quickly narrow down the search for the best subset of variables.
How can one determine when to stop adding variables in forward step-wise selection?
-One can stop adding variables in forward step-wise selection when the residual error does not change significantly, indicating that adding more variables does not improve the model's performance, or when a pre-set threshold for prediction accuracy is met.
Outlines
📊 Linear Regression and Model Complexity
This paragraph discusses the fundamental concepts of linear regression, including the assumption that the input data (X) may not necessarily come from real numbers and the output (Y) is from the real numbers. It explains the setup of the input matrix X, which could be of size N x (P + 1) when an intercept is included or N x P otherwise. The focus is on minimizing squared error and the trade-off between bias and variance in model fitting. The paragraph also introduces the idea of reducing the number of variables to improve model performance by setting some coefficients to zero, which increases bias but decreases variance.
🔍 Subset Selection for Model Interpretability and Efficiency
The second paragraph delves into subset selection as a method to enhance the interpretability and prediction accuracy of a linear regression model. It highlights the benefits of working with a smaller subset of variables, such as reduced computational complexity and the ability to identify the most influential variables. The paragraph also touches on the challenges of subset selection, including the lack of an inclusion property and the computational demands of combinatorial selection. It mentions the use of QR decomposition and the 'leaps and bounds' algorithm as methods to perform efficient subset selection.
🛠 Forward Step-wise Selection in Linear Regression
This paragraph introduces the concept of forward step-wise selection, a greedy approach to feature selection in linear regression. It starts with an intercept and iteratively adds the variable that provides the most significant improvement in model performance, measured by residual error. The process continues until no further improvement is observed or a predefined threshold is reached. The paragraph also discusses the potential for a hybrid approach that involves both adding and removing variables to optimize performance. It notes that despite being greedy, forward step-wise selection can be effective in many real-world datasets.
📚 Statistical Packages and Feature Selection Methods
The final paragraph comments on the prevalence of stage-wise feature selection methods in statistical packages like R, which often include forward step-wise selection. It contrasts this with machine learning tools such as Weka, which may not offer this specific feature selection approach but instead provide alternative methods. The paragraph emphasizes the practical effectiveness of stage-wise selection across various datasets, despite its simplicity and potential drawbacks.
Mindmap
Keywords
💡Linear Regression
💡Encoding
💡Input Matrix X
💡Intercept
💡Squared Error
💡Variance and Bias
💡Subset Selection
💡Interpretability
💡Forward Step-wise Selection
💡Residual
💡Orthogonality
Highlights
Linear regression can assume data from any distribution, not just the real numbers.
The input matrix X can be of size N x P + 1 when an explicit intercept is included, or N x P otherwise.
Linear regression is efficient and easy to solve but has drawbacks, such as high variance with least squares fit.
Least squares fit can provide a zero-bias solution if the linear model is appropriate.
By adding constraints, it's possible to trade off some bias for a significant reduction in variance.
Subset selection in linear regression can improve prediction accuracy by reducing the number of variables.
Fewer parameters in a model can lead to lower variance but may increase bias.
Subset selection can improve model interpretability by identifying important variables.
Subset selection can be computationally expensive due to the combinatorial nature of selecting variable subsets.
QR decomposition can be used to speed up subset selection in linear regression.
Leaps and bounds is an efficient algorithm for subset selection in linear regression.
Forward step-wise selection is a greedy method that adds variables one by one to improve the fit.
Forward step-wise selection can be used even when the number of variables exceeds the number of data points.
Greedy selection methods like forward step-wise can perform well in practice despite their simplicity.
Hybrid approaches to feature selection can combine adding and deleting variables for optimal performance.
Statistical packages like R often include step-wise feature selection due to its practical effectiveness.
Transcripts
Right so we were looking at linear regression
right, so we are assuming that the X is coming from RP but I told you that it is not necessarily
that it has to come from the real numbers and we talked about various ways in which you can encode
the data and so on so forth and then if we assume that the Y comes from R and I told you depending
on the circumstances so we will talk about the input matrix X okay which might be of N x P + 1
or N x P right so it will be N x P + 1 when we actually have an explicit intercept right.
So a β0 so term for the β0 term will have a column of ones added to the data right the input instead
of thinking of it as a P-dimensional vector will think of it as P + 1 dimensional vector
and when I do not have the intercept okay it just becomes a P-dimensional input right so
that is that is the basic setup that we have and we are essentially looking at minimizing
some squared error right so we looked at the simple linear regression we looked at
the case where there were multiple inputs. And we looked at how you can interpret that in
terms of single variable regression univariate regression is essentially what we looked at in
the last class. And so this class we will go on to look at a little more you know complex things
that you can do with linear regression, so linear regression is great because it is so easy to solve
right and it is very efficient runs very quickly and all that but there are a couple of drawbacks
to linear regression so the first one is that if you remember I was mentioning in last class
also that by doing this least squares fit you are actually getting a fit that has very low very low
variance but it also has how much bias okay did I say that in fact I think people are hallucinating
so what I did say is that if you do least squares fit and if the if linear happens to be the right
choice then least squares which gives you the 0 bias solution okay right so the least squares
fit gives you 0 bias solution but the problem is the variance can be relatively high and it turns
out that by not just doing the straightforward least squares fit by doing more tricks with
their data with the with the models that we have we can trade off a little bit bias okay.
I am going to get a biased estimator for the fit for the line okay I am still fitting straight
lines okay I have not done anything different I am still fitting straight lines but the fit
I am getting will no longer be a bias-free fit but then I can reduce the variance a lot
more by adding certain constraints additional constraints to the problem okay so essentially
what I what I would really like to do is reduce the number of variables which I am
trying to fit instead of trying to fit P + 1 variables if you can somehow reduce the
number of variables what does is equal to it is equivalent to setting some of the β to 0.
So if I can somehow set the β to some of the β to 0 then essentially what it means I have lot
fewer parameters that I am estimating right so the fewer the parameter set I am estimating the
lower the variance still right but because I am now restricting the class of models that I
am going to be looking at because I have to set some numbers to 0 so my bias will go up slightly
right this is assuming that I am fitting straight lines you know the straight lines are the right
things to do but still my bias will increase a little bit right in this case of course that is
always this residual bias because the language of straight lines is not powerful enough.
So that bias is still there we are not talking about that I am saying even assuming straight
lines are the right thing to do if I am going to force certain coefficients to be 0 I am increasing
the bias in the estimator right but the variance will go down because I have a lot fewer parameters
that I am going to estimate okay, so that is essentially what we are going to look at and
so what I mean by subset selection here is that I am going to select a subset of the input variables
to use for fitting the line okay right. So one thing is we can reduce the variance
significantly and that is where we can improve the prediction accuracy of the model okay that
is one of the reasons we would like to do subset selection there any other reason you can think of
for wanting to work with the few smaller subset less computation yeah but it is a question of
there you have to do more computation to figure out what the subset is and so on so forth yeah
but less computation is one answer but there is another one I think I heard somebody say
this somebody someone or rather speaks with you synchronously always see a trend you know no there
is yet another major advantage I mean. Yeah so related to this right, so there are
variables which could potentially have high noise and so it will end up with a small coefficient
right but then if I if I tell you okay here is this model m and it has like 135 coefficients
and it becomes hard for you to even figure out what this means that instead of that if
it is okay here is this model you gave me a 135 variables but these are the 4 important variables
that I need for doing a linear fit right. Now it is easy much easier for you to interpret
what is going on right so interpretation is a very big component of any kind of data
analytics that you want to do like ultimately what you are doing with machine learning is
trying to understand the data right, so one of the things you would like to have this
interpretability right. So, the first one is to okay this is
interpretability and that is prediction accuracy right so there are many ways in doing this the
simplest kind of approach is essentially to take this literally right and try to select
from subsets of features right, so why is that a simplistic approach so why is that
a simplistic approach oh come on easy it is combinatorial now exactly.
So this will just do subsets in a best subset selection essentially I would say that okay here
first pic subsets of size one subsets of size two subsets of size 3 and so on so forth right,
and it turns out that you can see yourself if you start playing around with some linear
regression tools that what is the variables that go into the best subset of size one
right basically the one best variable need not figure in the best subset of size two so
it does not have a nice inclusion property. So for it to have a nice inclusion property you
need to have certain very nice conditions on the data set so in general it does not
have this inclusion property right, so you have to just redo the whole thing again so
it is not enough if you just do one say okay you cannot be greedy basically I cannot say I will do
the first subset and then I just add the one best variable to it and then I will find the next one,
so essentially you have to do a combinatorial selection so people have come up with a very
clever ways of organizing this right. And in fact of using QR decomposition to do
things more rapidly I am not going to go into details of that but then up to like 30 or 40
variables you can scale well right, you mean can hope to get answers in your lifetime kind
of scaling right but if you go for much larger variables right much larger number of variables
like many problems of interests like in text or image domains and things like that then there
is no hope to do an exhaustive search but that is a base length which you can do people.
Actually come up with algorithms which do this kind of a subset selection okay, so there is one
very interestingly named algorithm called leaps and bounds which does pretty efficient subset
selection but just a more of a informational thing for you right.
This thing is forward step-wise selection so any guesses what that is? exactly it being
greedy right it is just trying to do best subset selection by being greedy, so you start off with
so what is the feature you for sure want to have all ones okay you need the intercept
right otherwise your line has to pass through the origin okay, so you need the intercepts you start
with the intercept okay so what should be the coefficient for the intercept we already looked
at it before now the mean of the y's right. So that will be the coefficient so we already
have fitted that so now what you do this you start off with that variable okay, now then you
add the next one okay let some xi add that as the next variable such that it gives you the best fit
modulo the set they have already taken so you are not going to disturb all the stages the variables
you have taken to in step k now we will add a new variable such that among all the variables I could
add at this k plus 1 stage this one gives me the maximum of improvement in the performance.
So how will I measure performance some kind of residual error on the test data right so
that is how I measure performance I keep doing this, until a point where
the error does not change much is there any other thing I can say so the residual is orthogonal to
any of the other directions I could add right there is one thing which we know of right,
so we know that at the end of the right fit you get the residual will be orthogonal to the space
spanned by the x’s right that means individually if I take any of the x’s right the residual will
be orthogonal to that individual direction as well, so when you find that none of the directions
that are left have any kind of component along the direction of the residual then I can stop.
Or I mean that may take a long time to do because that might happen only when I have the full least
squares fit so I can stop and the residual drops below a certain threshold that you can say okay I
am happy with the prediction accuracy or getting I will stop here. So there in many ways of doing
that right so the other way is to do what will you do in this case? I will start with the fit
that has all the variables okay and then I will keep dropping one by one right so one thing to
note is that you can do this if the number of data points is greater than the number of dimensions so
then you can actually find the fit. If p is > n now as people pointed out last time the
formula itself that, we are using is no longer valid right so because if the matrix will not
be invertible so we will have to think of other tricks for doing the fit and so on so forth but
forward stepwise selection you can do even if p is greater than n because I am anyway fitting
one direction at a time right so it is fine I am adding one direction at a time I can keep going
until I reach n directions at which point I should have a full least squares fit okay. Yeah so when
people actually even come up with all kinds of variants on this so one thing which you could do
is think of some kind of hybrid approach where I keep adding and deleting directions right,
so if I if you remember we talked about how greedy is not always the best way to grow things right,
so you can grow up to a certain level again maybe then dropping one of the earlier dimensions will
actually give you a slightly better performance could be right so that is one possibility
and right they could do some kind of a hybrid version of this so one thing I should point out
here, so you may think that forward stepwise selection because it is greedy is going to be
much worse than best subset selection right. So it turns out that in many real cases many real data
sets that you work with right the greedy selection is actually not a bad thing to do right in fact,
so much so many statistics packages like R but a little bit to do this right so you would not
find this in many of the machine learning tools like Weka so they would not have this kind of a
forward feature selection and things like that they have other ways of doing feature selection
which we will talk about later right but then statistical packages actually have this stage
wise edition of features because they seem to work well on a variety of data sets okay.
تصفح المزيد من مقاطع الفيديو ذات الصلة
5.0 / 5 (0 votes)