Week 3 Lecture 14 Partial Least Squares

Machine Learning- Balaraman Ravindran
4 Aug 202114:34

Summary

TLDRThis video script delves into regression methods, focusing on linear regression techniques such as subset selection and shrinkage methods, including ridge regression and lasso. It introduces derived directions with principal component regression and partial least squares (PLS), emphasizing PLS's unique approach that considers both input and output data. The script explains the process of constructing PLS directions and their orthogonality, leading to univariate regressions without multicollinearity issues. It concludes with the implications of using PLS for prediction and how it relates to the original least squares fit, especially when dealing with orthogonal inputs.

Takeaways

  • 📚 The lecture continues the discussion on linear regression methods, focusing on subset selection, shrinkage methods, and derived directions.
  • 🔍 Subset selection methods include forward selection, backward selection, and stepwise selection, which involve choosing subsets of explanatory variables.
  • 🔧 Shrinkage methods such as ridge regression and lasso are introduced to address the issue of overfitting by shrinking the coefficients of less important variables.
  • 🌐 Derived directions encompass principal component regression (PCR) and partial least squares (PLS), which are methods to find new directions for regression analysis.
  • 🤔 The motivation for PLS is to address the limitation of PCR, which does not consider the relationship between input data and output data.
  • ⚖️ Before applying PLS, it's assumed that the input data (X) is standardized and the output data (Y) is centered, ensuring no variable dominates due to its scale.
  • 📈 The first derived direction (z1) in PLS is found by summing the individual contributions of each variable in explaining the output variable Y.
  • 🔄 The process of PLS involves orthogonalization, where each new direction (z) is made orthogonal to the previous ones, allowing for univariate regression.
  • 🔢 The derived directions (Z1, Z2, Z3, etc.) are constructed to maximize variance in the input space and correlation with the output variable Y.
  • 🔮 Once the derived directions and their coefficients (θ) are determined, predictions can be made without directly constructing the Z directions for new data.
  • 🔄 The final model in PLS can be derived from the θ coefficients, allowing for direct coefficients for the original variables (X) to be computed.
  • 📉 If the original variables (X) are orthogonal, PLS will stop after the first step, as there will be no additional information to extract from the residuals.

Q & A

  • What are the three classes of methods discussed in the script for linear regression?

    -The three classes of methods discussed are subset selection, shrinkage methods, and derived directions.

  • What is the main difference between principal component regression (PCR) and partial least squares (PLS)?

    -The main difference is that PCR only considers the input data (X) and its variance, while PLS also takes into account the output data (Y) and the correlation with the input data.

  • What assumptions are made about the data before applying partial least squares?

    -It is assumed that the output variable Y is centered and the input variables are standardized, meaning each column has a 0 mean and unit variance.

  • How is the first derived direction (z1) in PLS computed?

    -The first derived direction (z1) is computed by taking the projection of Y on each Xj, vectorizing it, and then summing all these projections to create a single direction.

  • What is the purpose of orthogonalization in the context of PLS?

    -Orthogonalization is used to find new directions (xj2) that are orthogonal to the previous derived directions (z1, z2, ...), allowing for univariate regression without considering the influence of previous variables.

  • How does PLS balance the variance in the input space and the correlation with the output variable?

    -PLS finds directions in X that have high variance and also high correlation with Y, effectively balancing both through an objective function.

  • What happens when you perform PLS on data where the original variables (X) are already orthogonal?

    -If the original variables are orthogonal, PLS will stop after one step because there will be no additional variance to capture, and the derived directions (Z) will be the same as the original variables (X).

  • How many derived directions (Z) can be obtained from PLS, and what does this imply for the fit of the data?

    -You can obtain as many derived directions (Z) as you want from PLS. If you get p PLS directions, it means you will get as good a fit as the original least squares fit.

  • How can the coefficients for the original variables (X) be derived from the coefficients of the derived directions (θ) in PLS?

    -The coefficients for the original variables (X) can be derived from the coefficients of the derived directions (θ) by performing linear computations that account for how the θs are stacked and the original variables' contributions to each θ.

  • What is the process of constructing derived directions in PLS, and how does it differ from PCR?

    -In PLS, derived directions are constructed by summing the projections of Y on each Xj, creating directions that maximize the variance in the input space and the correlation with the output. This differs from PCR, which only maximizes variance in the input space without considering the output.

Outlines

00:00

📊 Introduction to Linear Regression Techniques

The speaker continues discussing linear regression, focusing on different methods such as subset selection, shrinkage methods, and derived directions. The discussion revisits subset selection, including forward, backward, and stage-wise selection, then moves to shrinkage methods like ridge regression and lasso. The speaker then introduces derived directions, specifically principal component regression (PCR), and explains the motivation behind partial least squares (PLS) as it considers both input and output data, unlike PCR.

05:06

🔄 Projection and Derived Directions in 3D

The speaker explains the projection of the output variable Y on multiple input variables (X1, X2) to derive directions in the context of partial least squares (PLS). The challenge of visualizing this in a 3D space is acknowledged. The speaker contrasts PLS with principal component regression (PCR), noting that while PCR finds directions in X with the highest variance, PLS finds directions in X that are more aligned with the output variable Y. PLS balances variance in the input space with correlation to the output variable.

10:09

🔍 Orthogonalization and Prediction with PLS

The process of orthogonalizing directions in partial least squares (PLS) is discussed, where each derived direction is orthogonal to the previous ones, simplifying univariate regression. The speaker explains how coefficients for the original variables X can be derived from the PLS directions for prediction. If all directions are derived, PLS achieves a fit equivalent to least squares. A thought experiment is presented: if the input variables X are orthogonal initially, PLS would immediately yield the least squares fit after the first direction.

Mindmap

Keywords

💡Linear Regression

Linear regression is a statistical method for modeling the relationship between a dependent variable and one or more independent variables by fitting a linear equation to the observed data. In the video, linear regression serves as the foundational concept for more advanced methods like subset selection and shrinkage methods, which are discussed as extensions of this basic technique.

💡Subset Selection

Subset selection in the context of regression analysis refers to the process of choosing a subset of available variables for a regression model. The video mentions methods such as forward selection, backward selection, and stepwise selection, which are all strategies for determining which variables to include in the model to best predict the dependent variable.

💡Shrinkage Methods

Shrinkage methods are techniques used in regression analysis to reduce the variance of parameter estimates by 'shrinking' them towards a predetermined value. The video discusses ridge regression and lasso as examples of shrinkage methods, which aim to prevent overfitting by penalizing large coefficients.

💡Principal Component Regression (PCR)

Principal Component Regression is a technique that combines principal component analysis with linear regression. It involves projecting the original variables into a new set of orthogonal variables (principal components) that capture the most variance in the data. The video points out that PCR only considers the input data (X) and not the output (Y), which can sometimes lead to counterintuitive results.

💡Partial Least Squares (PLS)

Partial Least Squares is a regression technique that finds relationships between the predictors (X) and the response (Y) by extracting components that explain the maximum covariation between X and Y. Unlike PCR, PLS takes into account both the input and output data, making it a more robust method when the relationship between X and Y is important.

💡Orthogonalization

Orthogonalization is the process of making vectors orthogonal to each other, which means they are at right angles and have no correlation. In the video, orthogonalization is used in the context of deriving new directions for regression analysis, ensuring that each new component (like Z1, Z2, etc.) is uncorrelated with the previous ones.

💡Centering and Standardizing

Centering involves adjusting the data so that the mean of each variable is zero, while standardizing adjusts the data so that each variable has a mean of zero and a standard deviation of one. The video emphasizes the importance of centering Y and standardizing the inputs for both PCA and PLS to ensure unbiased and scale-invariant results.

💡Projection

In the context of the video, projection refers to the process of finding how much of the response variable (Y) can be explained by each predictor variable (Xj) individually. This is done by projecting Y onto each Xj and then summing these projections to create a derived direction, which is a key step in the PLS method.

💡Univariate Regression

Univariate regression is a type of regression analysis that involves only one predictor variable. The video discusses how, after orthogonalizing the predictors, univariate regression can be used to find the relationship between Y and each derived direction (Z) individually, simplifying the analysis.

💡Coefficients

In regression analysis, coefficients are the numerical values that multiply the predictor variables to determine the predicted value of the response variable. The video explains how coefficients for the derived directions (Z1, Z2, etc.) are calculated and how they can be used to derive coefficients for the original variables (X) in PLS.

💡Overfitting

Overfitting occurs when a statistical model is too complex and captures noise in the training data, leading to poor generalization to new data. The video mentions shrinkage methods like ridge regression and lasso as ways to prevent overfitting by penalizing large coefficients and simplifying the model.

Highlights

Continuation of the discussion on linear regression methods.

Introduction to subset selection methods including forward, backward, and stepwise selection.

Exploration of shrinkage methods like ridge regression and lasso.

Introduction to derived directions starting with principal component regression.

The limitation of principal component regression in not considering the output data.

Assumption of centered Y and standardized inputs for both PCA and partial least squares.

Process of creating derived directions by projecting Y on Xj and summing the projections.

Explanation of how to find the first derived direction z1 by summing univariate contributions.

The concept of using Y in the regression to find derived directions.

Distinguishing partial least squares from PCR by considering the output variable Y.

Demonstration of how to orthogonalize by regressing xj on z1.

Iterative process of finding new directions xj2 and their corresponding z directions.

Orthogonality of derived directions ensuring univariate regression can be performed.

Derivation of coefficients for the original variables X from the derived directions Z.

The equivalence of the fit obtained by using p PLS directions to the original least squares fit.

Implication of orthogonal X variables on the PLS method, potentially stopping after one step.

Concluding the discussion on regression methods with insights into partial least squares.

Transcripts

play00:00

 Okay so we will continue from where we left off  

play00:19

as I promised right so we are looking at linear  regression and we looked at subset selection and  

play00:34

then we looked at the shrinkage methods and then,  finally we came to derived directions all right I  

play00:42

said there are three classes of methods so we  are looking at a couple of examples of each of  

play00:47

those classes of methods the first one we looked  at was subset selection so we looked at forward,  

play00:51

backward selection and stage way selection in step  wise selection and all that and then we looked  

play00:55

at shrinkage methods where we looked at ridge  regression and lasso and then we started looking  

play01:00

at derived directions right where we looked at  principal component regression I said the next one  

play01:08

we look at is partial least squares and it gave  me the motivation for looking at partial least  

play01:12

squares it is mainly because principal component  regression only looks at the input data okay,  

play01:18

does not pay attention to the output right and  therefore you might sometimes come up with really  

play01:25

counterintuitive directions like an example  I showed you with the +1 and -1 outputs okay,  

play01:30

so the basic idea here is that we are  going to use the Y also right.  

play01:36

Just like the usual case I am going to assume  that Y is centered right. And I am also going  

play01:47

to assume that the inputs are standardized. This  is something which you have to do for both PCA and  

play01:57

partial least squares essentially assume that the  each column right it is going to have 0 mean unit  

play02:04

variance right on the data that is given to you  make it 0 mean unit variance, so that you are not  

play02:10

having any magnitude related effects on the output  okay, so what I am going to do is the following if  

play02:21

you remember how we did orthogonalization earlier  something very similar so I am going to look at  

play02:40

so I am going to look at the projection of Y  on Xj right then I am going to create a derived  

play02:56

direction which essentially sums up all of these  projections right I have computed basically I am  

play03:17

computing the projection of Y on xj right,  so this is essentially the direction is a  

play03:22

vectorized version of it then I am going to sum  all of this up so essentially what I am doing  

play03:26

here is I am looking at each variable in turn  I take each Xj in turn okay I am seeing what is  

play03:34

the effect on Y right, so how much of Y I am able  to explain just by taking Xj alone and I am using  

play03:44

all of that I am combining that and making  that as my single direction so individually  

play03:49

taking each one of this all by itself okay. Individually taking each direction by itself how  

play03:54

much of Y can I explain and that becomes my  first derived direction that is my z1 okay.  

play04:20

So that is the coefficient for z1 in my  regression fit eventual regression fit  

play04:25

okay that is the coefficient for that one you  can see what it is like so I have taken Y and  

play04:29

regressed it on z1 and that essentially gives me  what the coefficient for z1 right so how do I go  

play04:37

on to fine okay so I am looking at how much of Y  is along each direction Xj right so in some sense  

play04:50

you can think of it as if I have one variable  Xj right how much of Y can be explained with  

play04:56

that one variable xj okay I am looking at that and  then my first direction z1 is essentially summing  

play05:05

that univariate contributions over all my input  directions I suppose, I have two input directions  

play05:23

Unfortunately I have to do this in 3d suppose  I have to input directions so what I am going  

play05:31

to do is I am going to take my Y right, so  project it on x1 alone first right project  

play05:45

it on x1 alone and on x2 alone right we redo  that this is tricky to do this in 3d but any  

play06:20

way right. No it is going to be hard to do it on the  

play06:27

board pictorially for you okay I am not going to  do this so I really need to get a actually have to  

play06:31

plot a function Y right I cannot just do it with  single data points why that does not make sense,  

play06:36

so I actually have to get to a surface Y on x1,  x2 and then talk about the projection so that  

play06:41

is going to be hard right, but the basic idea is  I take Y right I find the projection of Y along  

play06:48

x1 okay then I find the projection of Y along x2  okay now I am going to take the sum of these two  

play06:54

okay and whatever is the resulting direction and  I am going to use that as my first direction.  

play06:59

Yes we see in PCR what we did was we first found  directions X which had the highest variance here  

play07:17

we are not finding directions in X with the  highest variance but we are finding directions  

play07:20

in X right some sense components of X which are  have more in the direction of the output variable  

play07:27

Y right, so eventually you can show that which  you are not going to do but you can show that the  

play07:33

directions you pick that Z1, Z2, Z 3 that you  pick or those which have high variance in the  

play07:41

input space. But also have a high correlation  with way right it is actually an objective  

play07:48

function which tries to balance correlation with  Y and variance in the input space why PCA that  

play07:57

is only variance PCR does only variance in the  input space does not worry about the correlation  

play08:02

but partial least squares you can show that it  actually worries about the correlation as well  

play08:08

right. We find the first coordinate now what do  you do you orthoganalize, so what should I do now  

play08:40

I should regress x1 so what should I be doing  now I should regress x1 like xj on z1 right.  

play08:57

This is how we did the orthogonalization earlier  right, so you find one direction okay then you  

play09:01

regress everything else on that direction then  subtract from it that gives you the orthogonal  

play09:07

direction right, so essentially that is what  you are doing here the expressions look big  

play09:33

but then if you have been following the material  from the previous classes then it is essentially  

play09:39

whether they just reusing the univariate  regression construction we had earlier right.  

play09:43

So now I have a new set of directions which  I call xj2 write x j1 was the original xj  

play09:51

is I start off with now I have a new set of  directions which we will call xj2 and then  

play09:58

I can keep repeating the whole process, I can  take white projected along xj2 right and then  

play10:08

combine that and get Z2 and then regress Y  on Z2 to get θ2 right, so I can keep doing  

play10:17

this until I get as many directions as I want  all right so what is the nice thing about Z1,  

play10:25

Z2 other things they themselves will be orthogonal  because they are being constructed by individual  

play10:32

vectors which are orthogonal with respect to  their all the previous Zs that we have right.  

play10:38

Each one will be orthogonal and therefore I can  essentially do univariate regression so I do not  

play10:42

have to worry about accommodating the previous  variable, so when it when I want to fit the ZK  

play10:47

I can just do a univariate regression of Y on that  K and I will get the coordinates theta K okay is  

play10:53

it fine great. So once I get this theta one to  θK how do I use it for prediction can I just do  

play11:08

like xβ like into xβ can I do xθ and I know what  should I do well so I can do zx z read it mean  

play11:30

I am sorry I can do θZ and predict it but then I  do not really want to construct this Z directions  

play11:38

for every vector that I am going to get so I do  not want to project it along those Z direction,  

play11:43

so instead of that what I can do if you think  about it each of those Zs is actually composed  

play11:48

of the original variables X right. So I can  compute the θ and then I can just go back  

play11:58

and derive coefficients for the Xs directly  because all of these are linear computations  

play12:04

I all I need to do is essentially figure out how  I am going to stack all the thetas so that I can  

play12:11

derive the coefficients for the Xs okay think  about it you can do it as a short exercise but  

play12:16

I can eventually come up and write right. So where I can derive this coefficients β hat  

play12:35

from these θs right so I will derive θ1, θ2, θ3  so on so forth I can just go back and do this  

play12:43

computation so you will have to think about it  very easy you can work it out and figure out  

play12:47

what is the number should be right and what is  the ‘m’ doing that number of their directions, I  

play12:57

actually derive the number of directions I derive  from the PLS right so here the first direction I  

play13:04

can keep going suppose I derive p directions what  can you tell me about the fit for the data if I  

play13:25

get p PLS directions it essentially means that  I will get as good a fit as the original least  

play13:31

squares fit right so I essentially get the same  fit as least squares fit okay so anything lesser  

play13:37

than that is going to give me something different  from the least squares fit okay here is a thought  

play13:42

question if my X are originally, orthogonal to  begin with X were actually orthogonal to begin  

play13:49

with what will happen with PLS Z will be the  same as Xs right and what will happen to Z2 can  

play14:11

I do the Z2 no right PLS will stop after one  step because there will be no residuals after  

play14:19

that right so I will essentially get my least  squares fit in the first attempt itself okay  

play14:24

so that is essentially what will happen right  so we will stop with regression methods.

Rate This

5.0 / 5 (0 votes)

相关标签
Linear RegressionSubset SelectionShrinkage MethodsPrincipal ComponentPartial Least SquaresRegression AnalysisData ScienceMachine LearningStatistical TechniquesPredictive Modeling
您是否需要英文摘要?