Week 3 Lecture 11 Subset Selection 2

Machine Learning- Balaraman Ravindran

4 Aug 202123:43

Summary

TLDRThe script discusses 'forward stage wise selection', a method for feature selection in regression models where variables are added one at a time to predict the residual error from the previous stage. It highlights the method's efficiency and the advantage of faster convergence compared to forward stepwise selection. The script then transitions to 'shrinkage methods', emphasizing their mathematical soundness and the introduction of ridge regression, which includes a penalty on coefficient size to reduce model variance. The explanation includes the rationale behind not penalizing the intercept and the process of centering inputs to eliminate the need for β0 in the optimization. The summary concludes with the benefits of ridge regression in ensuring numerical stability and solvability.

Takeaways

🔍 The script discusses a method called 'forward stage wise selection' for variable selection in regression models, where at each stage a variable most correlated with the residual is added to the predictor.
📉 The process involves starting with a single variable most correlated with the output, regressing the output on that variable, and then iteratively adding new variables that are most correlated with the current residual.
🔧 The advantage of forward stage wise selection is that it potentially converges faster than forward stepwise selection and requires only univariate regressions at each stage, making it computationally efficient.
🔄 However, the coefficients obtained in stage wise selection may not be the same as those from a single multivariate regression with all variables, which could lead to a different fit.
🏰 The script then introduces 'shrinkage methods' as an alternative to subset selection, which aims to shrink some parameters towards zero rather than setting them to zero.
🔬 Shrinkage methods are based on an optimization formulation that allows for reducing the coefficients of unnecessary variables, ideally to zero, to improve prediction accuracy and interpretability.
📊 The concept of ridge regression is explained, which involves adding a penalty on the size of the coefficients to the usual objective function of minimizing the sum of squared errors.
🎯 The purpose of the penalty in ridge regression is to reduce the variance of the model by constraining the size of the coefficients, preventing them from becoming very large and causing overfitting.
📐 The script explains that the ridge regression modifies the normal least squares problem by adding a squared norm constraint, which is equivalent to adding a 'ridge' to the data matrix.
🔑 The script points out that not penalizing the intercept (β0) is important to ensure that simple shifts in the data do not change the fit, and suggests centering the data to handle this.
🧩 The script concludes by highlighting that ridge regression not only makes the problem numerically well-behaved by ensuring the invertibility of the data matrix but also serves as a foundation for understanding a broader class of shrinkage problems.