Week 3 Lecture 11 Subset Selection 2

Machine Learning- Balaraman Ravindran
4 Aug 202123:43

Summary

TLDRThe script discusses 'forward stage wise selection', a method for feature selection in regression models where variables are added one at a time to predict the residual error from the previous stage. It highlights the method's efficiency and the advantage of faster convergence compared to forward stepwise selection. The script then transitions to 'shrinkage methods', emphasizing their mathematical soundness and the introduction of ridge regression, which includes a penalty on coefficient size to reduce model variance. The explanation includes the rationale behind not penalizing the intercept and the process of centering inputs to eliminate the need for β0 in the optimization. The summary concludes with the benefits of ridge regression in ensuring numerical stability and solvability.

Takeaways

  • 🔍 The script discusses a method called 'forward stage wise selection' for variable selection in regression models, where at each stage a variable most correlated with the residual is added to the predictor.
  • 📉 The process involves starting with a single variable most correlated with the output, regressing the output on that variable, and then iteratively adding new variables that are most correlated with the current residual.
  • 🔧 The advantage of forward stage wise selection is that it potentially converges faster than forward stepwise selection and requires only univariate regressions at each stage, making it computationally efficient.
  • 🔄 However, the coefficients obtained in stage wise selection may not be the same as those from a single multivariate regression with all variables, which could lead to a different fit.
  • 🏰 The script then introduces 'shrinkage methods' as an alternative to subset selection, which aims to shrink some parameters towards zero rather than setting them to zero.
  • 🔬 Shrinkage methods are based on an optimization formulation that allows for reducing the coefficients of unnecessary variables, ideally to zero, to improve prediction accuracy and interpretability.
  • 📊 The concept of ridge regression is explained, which involves adding a penalty on the size of the coefficients to the usual objective function of minimizing the sum of squared errors.
  • 🎯 The purpose of the penalty in ridge regression is to reduce the variance of the model by constraining the size of the coefficients, preventing them from becoming very large and causing overfitting.
  • 📐 The script explains that the ridge regression modifies the normal least squares problem by adding a squared norm constraint, which is equivalent to adding a 'ridge' to the data matrix.
  • 🔑 The script points out that not penalizing the intercept (β0) is important to ensure that simple shifts in the data do not change the fit, and suggests centering the data to handle this.
  • 🧩 The script concludes by highlighting that ridge regression not only makes the problem numerically well-behaved by ensuring the invertibility of the data matrix but also serves as a foundation for understanding a broader class of shrinkage problems.

Q & A

  • What is forward stage wise selection in the context of the script?

    -Forward stage wise selection is a method where at each stage, a variable most correlated with the residual from the previous stage is selected, and a regression is performed on that residual. This process continues, adding each new variable to the predictor to improve the prediction by accounting for the error of the previous variables.

  • What is the purpose of picking the variable most correlated with the residual in forward stage wise selection?

    -The purpose is to predict the unaccounted portion of the output, known as the residual. By selecting the variable most correlated with the residual, the model attempts to minimize the error and improve the prediction accuracy at each stage.

  • How does the predictor evolve in forward stage wise selection?

    -In forward stage wise selection, the predictor evolves by sequentially adding new variables that are most correlated with the current residual. Each new variable comes with a coefficient determined by regressing the residual on that variable, and this coefficient is used to update the predictor.

  • What is the advantage of forward stage wise selection over forward stepwise selection?

    -Forward stage wise selection may converge faster than forward stepwise selection because at each stage, only a univariate regression is performed, and the coefficients from previous stages are reused. In contrast, forward stepwise selection requires redoing the multivariate regression each time a new variable is added, which can be computationally more intensive.

  • Why might the coefficients from a stage wise selection process differ from those of a full linear regression with all variables?

    -The coefficients may differ because in stage wise selection, the model is built incrementally, and each variable's effect is considered in the context of the previously added variables. In a full linear regression, all variables are considered simultaneously, which can lead to a different distribution of influence among the variables and, consequently, different coefficients.

  • What is the primary goal of shrinkage methods in regression?

    -The primary goal of shrinkage methods is to reduce the size of the coefficients in a regression model. This is achieved by imposing a penalty on the coefficients, which encourages the model to shrink unnecessary coordinates towards zero, thereby improving prediction accuracy and reducing overfitting.

  • What is ridge regression, and how does it relate to shrinkage methods?

    -Ridge regression is a type of shrinkage method that introduces a penalty on the size of the coefficients in the regression model. It adds a term to the objective function that penalizes large coefficients, effectively shrinking them towards zero. This helps in reducing the variance of the model and improving its generalizability.

  • Why is β0 often not penalized in ridge regression?

    -β0, the intercept, is often not penalized to ensure that shifts in the data do not affect the fit of the model. Penalizing β0 could cause the model to change its intercept in response to such shifts, which is undesirable as it would affect the model's ability to generalize across different datasets.

  • How does centering the inputs and outputs affect ridge regression?

    -Centering the inputs and outputs allows for the elimination of β0 from the optimization problem in ridge regression. By subtracting the mean from the Y values and the columns of X, the model can be fit without an intercept, simplifying the process and ensuring the fit passes through the origin.

  • What is the significance of the λ parameter in ridge regression?

    -The λ parameter in ridge regression determines the strength of the penalty on the coefficients. A larger λ value results in more shrinkage of the coefficients, while a smaller λ value allows for larger coefficients. The choice of λ is crucial as it affects the trade-off between bias and variance in the model.

  • How does ridge regression address the issue of multicollinearity in the data?

    -Ridge regression addresses multicollinearity by imposing a penalty on the coefficients, which restricts their size. This reduces the impact of highly correlated variables on the model, making it more robust to multicollinearity and improving the numerical stability of the regression.

Outlines

00:00

🔍 Forward Stage Wise Selection Process

This paragraph introduces the concept of forward stage wise selection in regression analysis. It explains the iterative process of selecting variables based on their correlation with the residual error from the previous stage's prediction. The method involves starting with the most correlated variable to the output, regressing the output on this variable to find the residual, and then selecting the next variable most correlated with this residual. This process continues, with each new variable predicting the error of the previous variables' predictions, and is suggested to potentially converge faster than forward stepwise selection. The advantage of this method is that it simplifies the computation at each stage by only performing univariate regressions, as opposed to recalculating multivariate regressions with each new variable added.

05:01

📉 Advantages of Stage Wise Selection and Introduction to Shrinkage Methods

The speaker discusses the advantages of stage wise selection, particularly its computational efficiency due to the univariate regressions performed at each stage, as opposed to the multivariate regressions required in stepwise selection. They also introduce shrinkage methods, which aim to minimize the coefficients towards zero, as opposed to setting them to zero as in subset selection. Shrinkage methods are presented as a more mathematically sound approach, offering a balance between prediction accuracy and interpretability, despite potentially leaving many variables with small coefficients in the model.

10:30

📉 Ridge Regression and Its Objective

The paragraph delves into ridge regression, a shrinkage method that includes a penalty on the size of the coefficients in the objective function. The goal is to minimize the sum of squared errors while also reducing the magnitude of the coefficients to prevent any single coefficient from becoming too large. This approach is contrasted with subset selection methods and is shown to be a mathematically robust method for dealing with multicollinearity and overfitting. The paragraph also explains the process of converting the constrained optimization problem into an unconstrained one by introducing a Lagrange multiplier, leading to the formulation of the ridge regression solution.

15:33

🔍 The Role of λ in Ridge Regression and Its Effect on Model Variance

This section explains the importance of the regularization parameter λ in ridge regression. It discusses how imposing a constraint on the size of the coefficients can reduce the variance of the model by limiting the range of values the coefficients can take. The explanation includes the concept of correlated input variables and how large coefficients can cancel each other out, leading to overfitting. The paragraph also clarifies that β0, the intercept, is not penalized to ensure that simple shifts in the data do not affect the fit of the model.

20:42

📉 Centering Data and Solving for Ridge Regression Coefficients

The final paragraph in the script describes the process of centering the input data to eliminate the need for penalizing the intercept β0 in ridge regression. By subtracting the mean from the Y values and the columns of X, the data is centered, and the regression can be performed without including β0. The paragraph explains that after obtaining the centered values, the ridge regression coefficients can be solved, and β0 can be estimated as the average of the outputs. The script concludes with a note on the invertibility of the matrix with the addition of λ and the original motivation for ridge regression to address numerical instability in the inversion of the data matrix.

Mindmap

Keywords

💡Forward Stage Wise Selection

Forward Stage Wise Selection is a method of feature selection in regression analysis where at each stage, a variable most correlated with the residual from the previous model is added. It is a step-by-step process that builds a model by sequentially adding variables that best predict the residual error from the previous stage. This method is highlighted in the script as a way to improve prediction accuracy by focusing on the error term of the model.

💡Residual

In the context of the script, a 'residual' refers to the difference between the actual observed values and the values predicted by a model. The concept is central to the explanation of the Forward Stage Wise Selection process, where the goal is to find variables that can predict these residuals, thereby improving the model's accuracy.

💡Regression

Regression is a statistical method used to model the relationship between variables. In the script, regression is used to find the relationship between the output and the variables, and to predict residuals. It is the fundamental process in the Forward Stage Wise Selection method, where at each stage, a new variable is used to regress the current residual.

💡Predictor

A 'predictor' in the script refers to the independent variables used in a regression model to predict the dependent variable. The predictor is built stage by stage, with each new variable added based on its correlation with the residual, as described in the Forward Stage Wise Selection process.

💡Coefficient

In the script, 'coefficient' represents the multiplier of a predictor in a regression model, indicating the strength and direction of the relationship between the predictor and the output variable. Coefficients are determined through the regression process and are key to understanding how each variable contributes to the prediction.

💡Univariate Regression

Univariate Regression is a type of regression analysis where only one predictor variable is used to model the relationship with the response variable. The script mentions univariate regression in the context of the Forward Stage Wise Selection method, where at each stage, a univariate regression is performed with the residual and a new variable.

💡Multivariate Regression

Multivariate Regression involves using multiple predictor variables to model the relationship with a single response variable. The script contrasts univariate regression with multivariate regression, noting that in Forward Stepwise Selection, a new variable requires redoing the regression with all variables, whereas in the Stage Wise Selection, only univariate regression is needed at each stage.

💡Shrinkage Methods

Shrinkage Methods are a class of statistical techniques used to reduce the impact of less important variables in a model by shrinking their coefficients towards zero. The script introduces shrinkage methods as an alternative to subset selection and stage wise selection, aiming to improve prediction accuracy and model interpretability.

💡Ridge Regression

Ridge Regression is a type of shrinkage method that adds a penalty term to the sum of squares in a regression model, discouraging large coefficients. The script explains that ridge regression is used to reduce the variance of the model and to prevent overfitting by shrinking coefficients, especially when dealing with multicollinearity.

💡L2 Norm

The L2 Norm, mentioned in the script, is a measure of the size or length of a vector. In the context of ridge regression, the L2 Norm is used as a constraint to limit the size of the coefficients, which helps in reducing the model's complexity and improving its generalizability.

💡Intercept

The 'intercept' in a regression model is the point where the line crosses the y-axis. The script discusses the treatment of the intercept (β0) in ridge regression, noting that it is often not penalized to avoid issues with shifts in the data and to ensure that simple translations of the data do not change the fit of the model.

Highlights

Introduction to forward stage wise selection, a method for variable selection in regression analysis.

Explanation of the process in forward stage wise selection, emphasizing the correlation between variables and residuals.

Advantage of forward stage wise selection over forward stepwise selection in terms of computational efficiency.

Clarification on why coefficients in stage wise selection may differ from those in a full linear regression with all variables.

Introduction to shrinkage methods as an alternative to subset selection for variable importance.

The concept of shrinking coefficients towards zero in shrinkage methods to improve interpretability and prediction accuracy.

Ridge regression as a specific shrinkage method that penalizes the size of coefficients.

The rationale behind not penalizing the intercept (β0) in ridge regression to maintain the fit's consistency with data shifts.

The mathematical formulation of ridge regression, including the L2 norm constraint and its implications.

The practical approach of centering data to eliminate the need for β0 in ridge regression.

The transformation of the ridge regression problem into an unconstrained one by introducing λ.

The relationship between λ and T in ridge regression, and the practical approach of choosing λ.

The original motivation for ridge regression to address the issue of ill-conditioned matrices in regression.

The advantage of ridge regression in ensuring the invertibility of the data matrix and improving numerical stability.

The broader implications of ridge regression in terms of shrinkage and its connection to other statistical properties.

Encouragement for students to read further on ridge regression and its connections to other statistical concepts.

Transcripts

play00:00

 Right, so it is called  

play00:21

forward stage wise selection where at each stage  you do the following okay. Let me rephrase it,  

play00:28

on the first stage you do the following, so you  pick the variable that is most correlated with  

play00:34

the output like you pick the variable that is  most correlated with the output and then you  

play00:42

regress the output on that variable find the  residual. Now what you do is pick the variable  

play00:51

that is most correlated with the residual okay,  regress the residual on that variable okay.  

play01:02

Now add it to your predictor okay, so what is  your predictor, you already had one variable  

play01:09

right then you had a coefficient for that variable  which you got by their first regression. Now you  

play01:13

have a second variable and they have a coefficient  for that variable which you got my regressing the  

play01:18

residual on this variable they essentially what  you are trying to do is okay the first variable  

play01:22

make some prediction okay. The second variable is  going to try to predict what the error is right,  

play01:27

so essentially now I will be adding the error  to the prediction of the first variable. Did  

play01:34

that make sense? Right. So the first variable let  us say that is the true output that I want right,  

play01:43

so the first variable will make a prediction  saying that okay this is the actual fitted  

play01:49

value right, and this is the residual. What  I am trying to do with the second variable  

play01:55

is actually to predict this gap right. So when I add the second variable with this  

play02:02

coefficient to the first variable, so what I am  essentially doing is okay the first variable will  

play02:06

give this as the output, the second variable make  some other prediction let us say that much so I  

play02:11

will add the two, so the new output will be that  right. Now I still have a residual left right,  

play02:16

so then I will pick a third variable which is  maximally correlated with this residual.  

play02:21

And now I add the output of all the three okay  and then I get my new predictor okay, does it make  

play02:29

sense? So at every stage I find the residual  whatever has not been predicted correctly by  

play02:34

the previous case stages right, whatever is the  residual error after the previous case stages and  

play02:40

try to predict that using the new variable and  essentially I find the direction which is most  

play02:46

correlated with this prediction and then I try  to give you that okay, make sense? Right this is  

play02:54

called forward stage wise selection right. So what is the advantage of stage by selection?  

play03:23

Come on I asked a question I believe. Can you  think of any advantage of this? sorry, neither  

play03:34

was I randomly picking a variable in the previous  methods right, I was picking greedily that was  

play03:39

not random. No even in the previous case I only  pick variables that gave me better fits right.  

play03:49

In fact I will tell you that it will probably  converge faster in forward step wise selection  

play03:55

rather than forward stage wise okay. But there is another significant advantage  

play04:00

here if you just thought about the process  of fitting the coefficients at every stage  

play04:08

I do a univariate regression right, at every  stage I am just regressing the residual on  

play04:15

one variable right every stage it is a  univariate regression right. In forward  

play04:21

stepwise selection so every stage I will add a  new variable, but then I have to do a multivariate  

play04:26

regression, I have to do the regression  all over again, I am not able to reuse the  

play04:29

coefficients from the previous step right. So when I add a new variable I basically now I  

play04:35

have k+1 variables and into a new regression  with k+1 variables, but in this case what is  

play04:40

happening at every step stage I just have to do  a linear regression I keep all the work that I  

play04:45

have done so far intact. So in fact since we  are doing this only one at a time right I am,  

play04:50

so I am not even though I might have  K variables in the system right.  

play04:54

But the coefficients I have for the K variables  might not be the same K coefficients I would have  

play05:00

gotten, if it started with this K variables  and did a linear regression on it okay. So  

play05:05

the coefficients could be different right, if I  take those K variables and do linear regression  

play05:10

I will get a better fit rather than doing  this stage wise fit. But we prefer to the  

play05:15

stage wise, because it saves us a lot  of computation okay, makes sense.  

play05:21

Eventually everything will catch up and we will  get the same kind of prediction at the end of it,  

play05:26

but you might end up adding a little bit more  variables in this work, in this approach, but  

play05:33

that is fine right. So the next class of methods  we will look at are called shrinkage methods.  

play06:06

The idea is to shrink some of the parameters to  zero, it shrink them towards zero right.  

play06:14

So in the subset selection here essentially if you  think about what we are doing all this variables  

play06:19

or all the variables that we did not select you  have setting the coefficients to zero right. But  

play06:26

instead of doing an arbitrary greedy search  or stage wise selection and so on so forth,  

play06:32

in shrinkage methods what we do is we come  up with a proper optimization formulation  

play06:37

right which allows us to shrink the  unnecessary coordinates okay.  

play06:44

Ideally you would like to shrink them all the way  to zero, but there are problems in doing that, but  

play06:49

we will try to keep them as small as possible you  can do some post-processing and then get rid of  

play06:53

really small coordinates and things like that. But  we really like to shrink these coordinates right.  

play06:57

So this is fine from the prediction accuracy  point of view right from the interpretability  

play07:03

point of view it still leaves a little bit to  be desired, because you might have a lot of  

play07:06

coefficients with I mean a lot of variables with  very small coefficients back in the system.  

play07:12

So it is still a little bit of a thing, but  mathematically this is a much sounder method than  

play07:20

any of these things we have been talking about.  And of course this is the soundest, but also  

play07:26

impossible right. So the first thing we look at it  is called ridge regression the whole idea behind  

play07:48

all of this shrinkage methods is that you are  going to have your usual objective function which  

play07:54

is what the sum squared error that you are going  to try and minimize the sum squared error.  

play08:01

In addition you are going to impose a penalty  on the size of the coefficients right. So you  

play08:11

want to reduce the error, but not at the cost  of making some coefficient very large right.  

play08:20

You do not try and shrink the coefficients as  much as possible, so what will happen is your  

play08:25

optimization procedure will try to find solutions  which have as smaller coefficient as possible and  

play08:34

give you the similar kind of minimization in this  squared error objective okay.  

play08:49

It is okay I will waste that much of the  board I will write things here. So what is  

play08:53

your normal objective function right. So that is  a normal objective function for finding their β,  

play09:37

and so your β hat is essentially  this. So now what I am saying is,  

play09:50

let us not do this, but let us do this with  the constraint right. So what is a constraint?  

play10:30

okay fairly straight forward I have added  a squared norm constraint right.  

play10:37

So I am making, I am just saying that okay,  so this is essentially we think about it is  

play10:42

L2 norm for my this thing, so I am taking  the root I am just leaving it as a square  

play10:47

does not matter right. So it is like an L2 norm  constraint for my data.  

play10:51

So I can make this into an  unconstrained problem right.  

play11:28

Because λ has to be greater than zero. Why do  I want the βs to be small okay good question  

play11:40

actually. So what we wanted to do was to make  sure that you are reducing the variance of your  

play11:49

model right, so that is essentially what we are  trying to do now, all the subsets selection was  

play11:54

we set the coefficient to zero, we said you have  lot fewer parameters to estimate right.  

play12:00

So now what I am doing if I am by imposing the  size constraint on the parameters right, the  

play12:06

size constraints on the variables I am actually  reducing the range over which these variables  

play12:11

can actually move around okay. So if you think  about it if I have moderately correlated input  

play12:19

variables or correlated or anti-correlated input  variables, so let us say I have two variables  

play12:25

which x1 and x2 which are correlated okay. Now I can have a large β1 and a large negative  

play12:34

β2 okay that essentially will cancel out each  other in terms of the predictions I am making,  

play12:41

because x1 and x2 are themselves correlated  right. So I can actually make my β1 very  

play12:47

large and my β2 is largely negative right,  so that it will just cancel out the actual  

play12:52

effects of the two variables right. So it  essentially becomes a difference of β1,  

play12:57

β2 that actually matters right, not necessarily  the difference in magnitude of β1 β2 that  

play13:03

matters not actually not the actual values. So in which case so I can basically have a large  

play13:08

class of models which will give me the same exact  output right. So this makes my problem much harder  

play13:14

to control and then that increases the difficulty  of the estimation problem right. But now we are  

play13:21

saying that no, no I cannot allow these things to  become very large, then I am restricting the class  

play13:26

of models I am going to be looking at okay. So that is the reason why the decreasing size of β  

play13:30

helps yeah I did not explain this completely last  name so thanks for asking the question right. So  

play13:40

we just have to make sure that our λ are positive  we know that little so lagrange multipliers have  

play13:45

to be positive and so on so forth. So now I  can go ahead and minimize this right. So a  

play13:51

couple of things which I want to point out now,  so one thing is if you notice the penalty here,  

play13:58

so what do you notice about this. I am not including β0 right see the sum runs  

play14:17

from 1 to P it is not running from 0 to P also  note that I actually explicitly wrote out β0 here  

play14:22

I did not squish it into the P+1 thing, because I  am going to be treating β0 specially here mainly,  

play14:28

because if I penalize β0 then what will  happen is if I move my data up right,  

play14:40

so let us say this is my X and Y axis and I  have this is the data that I had right.  

play14:48

So now I have to fit that line through this  right, it is a univariate regression problem  

play14:55

Y is my response and X is my input I have to fit  a line right. But now the same data points okay,  

play15:02

if I shift them up right, so shifting up the data  points is hard, so I will just shift the origin  

play15:10

okay. If I shift the origin what will happen if  I penalize β0. So if you penalize β0 it will try  

play15:25

to keep this intercept small right, penalizing  β0 will try to keep the intercept small.  

play15:32

So earlier when I had that right if you  look at the fit it will pass very close  

play15:37

to the origin the intercept will be close to  0 right. Now when I shifted this it is going  

play15:42

to try and make the intercept small in stuff  there is line just shifting the slope of the  

play15:46

line will change right. It is the same data  it has just shifted up a little bit right,  

play15:51

so the slope of the line will change, so it  will try to go through somewhere here.  

play15:55

So essentially earlier when the line would I  mean like this right, now the line will become  

play15:59

like that because I am penalizing β0 right. So  we do not want that to happen so just simple  

play16:05

shifts in the data should not change the fit  right. So we do not penalize β0 right. Does it  

play16:18

make sense? And anyway we know what β0 should  be do, you know what β0 should be right, it  

play16:33

should be the average of the outputs anyway. So one way which we can actually get rid of β0  

play16:38

from this optimization problem is to say that we  will center the inputs right. So we will subtract  

play16:46

the average from the Yi’s and likewise we will  subtract the averages from all the X’s okay.  

play16:52

So we will center the input, so we will make  all the X variables centered on zero right. So  

play16:58

we will subtract the mean from all the X’s, we  will subtract the mean from the Y's okay.  

play17:02

So this will give me a centered input okay,  and then I will just do regression on this  

play17:06

centered input well there will be no β0 okay.  So from now on when I write X it is a nxp matrix  

play17:15

where the input has been centered okay. So  that way I do not have to worry about the,  

play17:22

so essentially what I have done here is I have  taken my data from there okay, and shifted it  

play17:29

so that the fit whatever is the fit I am going  to get will pass through the origin right.  

play17:37

So that is essentially what I have done I  have taken the original data translated it,  

play17:41

so that whatever fit will pass through the  origin okay. And I will go back and add  

play17:49

the β0 later to get any original fit does  that make sense okay good. So matrix form  

play18:40

I write it like this, so you can minimize this  take the derivative and set it to zero solve for  

play18:58

it you will get this. So here, so both my x and  y are centered. So I subtracted the mean from Y,  

play19:30

I subtracted the mean from the columns of x  so they are all centered here okay. So just  

play19:36

remember that and so once I get this centered  values I can solve for it, this gives me the  

play19:41

β hat ridge for 1 to P right in the β0 I estimate  as Ybar and that gives me the full solution okay,  

play19:49

is it fine? So one thing which I forgot to point  out earlier you remember I had this variable T  

play19:58

here, there was upper bound on the, so I said  subject to the constraint that it should not be  

play20:04

larger than T, the T has vanished yet, but you can  show that this λ and the T are related right.  

play20:11

So it does not matter, so for every  choice of T you have a choice of λ okay,  

play20:17

but typically what happens is you choose your  appropriate lambda and then you work with it,  

play20:21

you do not worry about the T formulation okay. Any  questions on this? So this tells you why this is  

play20:32

called ridge regression, because what they have  done here is you essentially added a ridge to  

play20:41

your data matrix you take the XTX okay. And then  you add a diagonal λ which is like adding a ridge  

play20:53

of size λ to the diagonal elements of XTX okay.  So that is why it is called ridge regression.  

play20:59

So why are you doing this and can you see one  advantage of doing this λi think here, sorry,  

play21:16

this whole thing becomes invertible right.  So as well as I add the λi I am sure that  

play21:21

this is non-singular. And even if XTX was  originally singular and adding λi makes  

play21:27

it nonsingular and it is invertible. In fact this was the original motivation  

play21:32

for ridge regression right, back in the I  forget, in the 50s when people came up with  

play21:41

ridge regression the original motivation  was XTX could be badly conditioned okay,  

play21:48

even if it is non singular we talked about this  in the last class right. It could be that some  

play21:53

variables are so highly correlated. So even  if the matrix is invertible numerically you  

play21:58

will get into problems right I told you that  the residual might be so small right.  

play22:03

So when you try to fit the coefficients you will  get into problems. So numerically the inversion  

play22:08

might be a problem right, even if the matrix  is non-singular, but by adding this λi term to  

play22:13

it you make sure that it is invertible and by  controlling the size of the λ you can make sure  

play22:20

that numerically also the problem is well behaved  right. So that is the idea behind with original  

play22:25

motivation for ridge regression was essentially to  make the problem first of all solvable right.  

play22:31

But then it then people went back and understood  rigid regression in terms of shrinkage or variance  

play22:38

reduction. And since it makes it convenient for us  to talk about a whole class of problems, shrinkage  

play22:45

problems right we motivate rigid regression from  the view point of shrinkage as opposed to this  

play22:53

inversion problem right. Any questions, so I am  going to encourage you to read the discussion that  

play23:07

follows rigid regression in the book right. It requires you to work out some things along  

play23:14

with the book you just cannot just sit there  and passively read it okay, but then it draws  

play23:20

a lot more connections from ridge regression to  a variety of other statistical properties about  

play23:25

the data which will be useful to know and I am  going to make you read, I mean so go read it  

play23:29

I mean ask you questions on it later. So go and  read the discussion okay. So the next thing.

Rate This

5.0 / 5 (0 votes)

Related Tags
Regression AnalysisPredictive ModelingStage-wise SelectionShrinkage MethodsRidge RegressionModel InterpretabilityNumerical StabilityVariance ReductionMachine LearningData ScienceStatistical Techniques