Week 3 Lecture 11 Subset Selection 2

Machine Learning- Balaraman Ravindran

4 Aug 202123:43

Summary

TLDRThe script discusses 'forward stage wise selection', a method for feature selection in regression models where variables are added one at a time to predict the residual error from the previous stage. It highlights the method's efficiency and the advantage of faster convergence compared to forward stepwise selection. The script then transitions to 'shrinkage methods', emphasizing their mathematical soundness and the introduction of ridge regression, which includes a penalty on coefficient size to reduce model variance. The explanation includes the rationale behind not penalizing the intercept and the process of centering inputs to eliminate the need for β0 in the optimization. The summary concludes with the benefits of ridge regression in ensuring numerical stability and solvability.

Takeaways

🔍 The script discusses a method called 'forward stage wise selection' for variable selection in regression models, where at each stage a variable most correlated with the residual is added to the predictor.
📉 The process involves starting with a single variable most correlated with the output, regressing the output on that variable, and then iteratively adding new variables that are most correlated with the current residual.
🔧 The advantage of forward stage wise selection is that it potentially converges faster than forward stepwise selection and requires only univariate regressions at each stage, making it computationally efficient.
🔄 However, the coefficients obtained in stage wise selection may not be the same as those from a single multivariate regression with all variables, which could lead to a different fit.
🏰 The script then introduces 'shrinkage methods' as an alternative to subset selection, which aims to shrink some parameters towards zero rather than setting them to zero.
🔬 Shrinkage methods are based on an optimization formulation that allows for reducing the coefficients of unnecessary variables, ideally to zero, to improve prediction accuracy and interpretability.
📊 The concept of ridge regression is explained, which involves adding a penalty on the size of the coefficients to the usual objective function of minimizing the sum of squared errors.
🎯 The purpose of the penalty in ridge regression is to reduce the variance of the model by constraining the size of the coefficients, preventing them from becoming very large and causing overfitting.
📐 The script explains that the ridge regression modifies the normal least squares problem by adding a squared norm constraint, which is equivalent to adding a 'ridge' to the data matrix.
🔑 The script points out that not penalizing the intercept (β0) is important to ensure that simple shifts in the data do not change the fit, and suggests centering the data to handle this.
🧩 The script concludes by highlighting that ridge regression not only makes the problem numerically well-behaved by ensuring the invertibility of the data matrix but also serves as a foundation for understanding a broader class of shrinkage problems.

Q & A

What is forward stage wise selection in the context of the script?
-Forward stage wise selection is a method where at each stage, a variable most correlated with the residual from the previous stage is selected, and a regression is performed on that residual. This process continues, adding each new variable to the predictor to improve the prediction by accounting for the error of the previous variables.
What is the purpose of picking the variable most correlated with the residual in forward stage wise selection?
-The purpose is to predict the unaccounted portion of the output, known as the residual. By selecting the variable most correlated with the residual, the model attempts to minimize the error and improve the prediction accuracy at each stage.
How does the predictor evolve in forward stage wise selection?
-In forward stage wise selection, the predictor evolves by sequentially adding new variables that are most correlated with the current residual. Each new variable comes with a coefficient determined by regressing the residual on that variable, and this coefficient is used to update the predictor.
What is the advantage of forward stage wise selection over forward stepwise selection?
-Forward stage wise selection may converge faster than forward stepwise selection because at each stage, only a univariate regression is performed, and the coefficients from previous stages are reused. In contrast, forward stepwise selection requires redoing the multivariate regression each time a new variable is added, which can be computationally more intensive.
Why might the coefficients from a stage wise selection process differ from those of a full linear regression with all variables?
-The coefficients may differ because in stage wise selection, the model is built incrementally, and each variable's effect is considered in the context of the previously added variables. In a full linear regression, all variables are considered simultaneously, which can lead to a different distribution of influence among the variables and, consequently, different coefficients.
What is the primary goal of shrinkage methods in regression?
-The primary goal of shrinkage methods is to reduce the size of the coefficients in a regression model. This is achieved by imposing a penalty on the coefficients, which encourages the model to shrink unnecessary coordinates towards zero, thereby improving prediction accuracy and reducing overfitting.
What is ridge regression, and how does it relate to shrinkage methods?
-Ridge regression is a type of shrinkage method that introduces a penalty on the size of the coefficients in the regression model. It adds a term to the objective function that penalizes large coefficients, effectively shrinking them towards zero. This helps in reducing the variance of the model and improving its generalizability.
Why is β0 often not penalized in ridge regression?
-β0, the intercept, is often not penalized to ensure that shifts in the data do not affect the fit of the model. Penalizing β0 could cause the model to change its intercept in response to such shifts, which is undesirable as it would affect the model's ability to generalize across different datasets.
How does centering the inputs and outputs affect ridge regression?
-Centering the inputs and outputs allows for the elimination of β0 from the optimization problem in ridge regression. By subtracting the mean from the Y values and the columns of X, the model can be fit without an intercept, simplifying the process and ensuring the fit passes through the origin.
What is the significance of the λ parameter in ridge regression?
-The λ parameter in ridge regression determines the strength of the penalty on the coefficients. A larger λ value results in more shrinkage of the coefficients, while a smaller λ value allows for larger coefficients. The choice of λ is crucial as it affects the trade-off between bias and variance in the model.
How does ridge regression address the issue of multicollinearity in the data?
-Ridge regression addresses multicollinearity by imposing a penalty on the coefficients, which restricts their size. This reduces the impact of highly correlated variables on the model, making it more robust to multicollinearity and improving the numerical stability of the regression.