Week 3 Lecture 14 Partial Least Squares
Summary
TLDRThis video script delves into regression methods, focusing on linear regression techniques such as subset selection and shrinkage methods, including ridge regression and lasso. It introduces derived directions with principal component regression and partial least squares (PLS), emphasizing PLS's unique approach that considers both input and output data. The script explains the process of constructing PLS directions and their orthogonality, leading to univariate regressions without multicollinearity issues. It concludes with the implications of using PLS for prediction and how it relates to the original least squares fit, especially when dealing with orthogonal inputs.
Takeaways
- 📚 The lecture continues the discussion on linear regression methods, focusing on subset selection, shrinkage methods, and derived directions.
- 🔍 Subset selection methods include forward selection, backward selection, and stepwise selection, which involve choosing subsets of explanatory variables.
- 🔧 Shrinkage methods such as ridge regression and lasso are introduced to address the issue of overfitting by shrinking the coefficients of less important variables.
- 🌐 Derived directions encompass principal component regression (PCR) and partial least squares (PLS), which are methods to find new directions for regression analysis.
- 🤔 The motivation for PLS is to address the limitation of PCR, which does not consider the relationship between input data and output data.
- ⚖️ Before applying PLS, it's assumed that the input data (X) is standardized and the output data (Y) is centered, ensuring no variable dominates due to its scale.
- 📈 The first derived direction (z1) in PLS is found by summing the individual contributions of each variable in explaining the output variable Y.
- 🔄 The process of PLS involves orthogonalization, where each new direction (z) is made orthogonal to the previous ones, allowing for univariate regression.
- 🔢 The derived directions (Z1, Z2, Z3, etc.) are constructed to maximize variance in the input space and correlation with the output variable Y.
- 🔮 Once the derived directions and their coefficients (θ) are determined, predictions can be made without directly constructing the Z directions for new data.
- 🔄 The final model in PLS can be derived from the θ coefficients, allowing for direct coefficients for the original variables (X) to be computed.
- 📉 If the original variables (X) are orthogonal, PLS will stop after the first step, as there will be no additional information to extract from the residuals.
Q & A
What are the three classes of methods discussed in the script for linear regression?
-The three classes of methods discussed are subset selection, shrinkage methods, and derived directions.
What is the main difference between principal component regression (PCR) and partial least squares (PLS)?
-The main difference is that PCR only considers the input data (X) and its variance, while PLS also takes into account the output data (Y) and the correlation with the input data.
What assumptions are made about the data before applying partial least squares?
-It is assumed that the output variable Y is centered and the input variables are standardized, meaning each column has a 0 mean and unit variance.
How is the first derived direction (z1) in PLS computed?
-The first derived direction (z1) is computed by taking the projection of Y on each Xj, vectorizing it, and then summing all these projections to create a single direction.
What is the purpose of orthogonalization in the context of PLS?
-Orthogonalization is used to find new directions (xj2) that are orthogonal to the previous derived directions (z1, z2, ...), allowing for univariate regression without considering the influence of previous variables.
How does PLS balance the variance in the input space and the correlation with the output variable?
-PLS finds directions in X that have high variance and also high correlation with Y, effectively balancing both through an objective function.
What happens when you perform PLS on data where the original variables (X) are already orthogonal?
-If the original variables are orthogonal, PLS will stop after one step because there will be no additional variance to capture, and the derived directions (Z) will be the same as the original variables (X).
How many derived directions (Z) can be obtained from PLS, and what does this imply for the fit of the data?
-You can obtain as many derived directions (Z) as you want from PLS. If you get p PLS directions, it means you will get as good a fit as the original least squares fit.
How can the coefficients for the original variables (X) be derived from the coefficients of the derived directions (θ) in PLS?
-The coefficients for the original variables (X) can be derived from the coefficients of the derived directions (θ) by performing linear computations that account for how the θs are stacked and the original variables' contributions to each θ.
What is the process of constructing derived directions in PLS, and how does it differ from PCR?
-In PLS, derived directions are constructed by summing the projections of Y on each Xj, creating directions that maximize the variance in the input space and the correlation with the output. This differs from PCR, which only maximizes variance in the input space without considering the output.
Outlines
📊 Introduction to Linear Regression Techniques
The speaker continues discussing linear regression, focusing on different methods such as subset selection, shrinkage methods, and derived directions. The discussion revisits subset selection, including forward, backward, and stage-wise selection, then moves to shrinkage methods like ridge regression and lasso. The speaker then introduces derived directions, specifically principal component regression (PCR), and explains the motivation behind partial least squares (PLS) as it considers both input and output data, unlike PCR.
🔄 Projection and Derived Directions in 3D
The speaker explains the projection of the output variable Y on multiple input variables (X1, X2) to derive directions in the context of partial least squares (PLS). The challenge of visualizing this in a 3D space is acknowledged. The speaker contrasts PLS with principal component regression (PCR), noting that while PCR finds directions in X with the highest variance, PLS finds directions in X that are more aligned with the output variable Y. PLS balances variance in the input space with correlation to the output variable.
🔍 Orthogonalization and Prediction with PLS
The process of orthogonalizing directions in partial least squares (PLS) is discussed, where each derived direction is orthogonal to the previous ones, simplifying univariate regression. The speaker explains how coefficients for the original variables X can be derived from the PLS directions for prediction. If all directions are derived, PLS achieves a fit equivalent to least squares. A thought experiment is presented: if the input variables X are orthogonal initially, PLS would immediately yield the least squares fit after the first direction.
Mindmap
Keywords
💡Linear Regression
💡Subset Selection
💡Shrinkage Methods
💡Principal Component Regression (PCR)
💡Partial Least Squares (PLS)
💡Orthogonalization
💡Centering and Standardizing
💡Projection
💡Univariate Regression
💡Coefficients
💡Overfitting
Highlights
Continuation of the discussion on linear regression methods.
Introduction to subset selection methods including forward, backward, and stepwise selection.
Exploration of shrinkage methods like ridge regression and lasso.
Introduction to derived directions starting with principal component regression.
The limitation of principal component regression in not considering the output data.
Assumption of centered Y and standardized inputs for both PCA and partial least squares.
Process of creating derived directions by projecting Y on Xj and summing the projections.
Explanation of how to find the first derived direction z1 by summing univariate contributions.
The concept of using Y in the regression to find derived directions.
Distinguishing partial least squares from PCR by considering the output variable Y.
Demonstration of how to orthogonalize by regressing xj on z1.
Iterative process of finding new directions xj2 and their corresponding z directions.
Orthogonality of derived directions ensuring univariate regression can be performed.
Derivation of coefficients for the original variables X from the derived directions Z.
The equivalence of the fit obtained by using p PLS directions to the original least squares fit.
Implication of orthogonal X variables on the PLS method, potentially stopping after one step.
Concluding the discussion on regression methods with insights into partial least squares.
Transcripts
Okay so we will continue from where we left off
as I promised right so we are looking at linear regression and we looked at subset selection and
then we looked at the shrinkage methods and then, finally we came to derived directions all right I
said there are three classes of methods so we are looking at a couple of examples of each of
those classes of methods the first one we looked at was subset selection so we looked at forward,
backward selection and stage way selection in step wise selection and all that and then we looked
at shrinkage methods where we looked at ridge regression and lasso and then we started looking
at derived directions right where we looked at principal component regression I said the next one
we look at is partial least squares and it gave me the motivation for looking at partial least
squares it is mainly because principal component regression only looks at the input data okay,
does not pay attention to the output right and therefore you might sometimes come up with really
counterintuitive directions like an example I showed you with the +1 and -1 outputs okay,
so the basic idea here is that we are going to use the Y also right.
Just like the usual case I am going to assume that Y is centered right. And I am also going
to assume that the inputs are standardized. This is something which you have to do for both PCA and
partial least squares essentially assume that the each column right it is going to have 0 mean unit
variance right on the data that is given to you make it 0 mean unit variance, so that you are not
having any magnitude related effects on the output okay, so what I am going to do is the following if
you remember how we did orthogonalization earlier something very similar so I am going to look at
so I am going to look at the projection of Y on Xj right then I am going to create a derived
direction which essentially sums up all of these projections right I have computed basically I am
computing the projection of Y on xj right, so this is essentially the direction is a
vectorized version of it then I am going to sum all of this up so essentially what I am doing
here is I am looking at each variable in turn I take each Xj in turn okay I am seeing what is
the effect on Y right, so how much of Y I am able to explain just by taking Xj alone and I am using
all of that I am combining that and making that as my single direction so individually
taking each one of this all by itself okay. Individually taking each direction by itself how
much of Y can I explain and that becomes my first derived direction that is my z1 okay.
So that is the coefficient for z1 in my regression fit eventual regression fit
okay that is the coefficient for that one you can see what it is like so I have taken Y and
regressed it on z1 and that essentially gives me what the coefficient for z1 right so how do I go
on to fine okay so I am looking at how much of Y is along each direction Xj right so in some sense
you can think of it as if I have one variable Xj right how much of Y can be explained with
that one variable xj okay I am looking at that and then my first direction z1 is essentially summing
that univariate contributions over all my input directions I suppose, I have two input directions
Unfortunately I have to do this in 3d suppose I have to input directions so what I am going
to do is I am going to take my Y right, so project it on x1 alone first right project
it on x1 alone and on x2 alone right we redo that this is tricky to do this in 3d but any
way right. No it is going to be hard to do it on the
board pictorially for you okay I am not going to do this so I really need to get a actually have to
plot a function Y right I cannot just do it with single data points why that does not make sense,
so I actually have to get to a surface Y on x1, x2 and then talk about the projection so that
is going to be hard right, but the basic idea is I take Y right I find the projection of Y along
x1 okay then I find the projection of Y along x2 okay now I am going to take the sum of these two
okay and whatever is the resulting direction and I am going to use that as my first direction.
Yes we see in PCR what we did was we first found directions X which had the highest variance here
we are not finding directions in X with the highest variance but we are finding directions
in X right some sense components of X which are have more in the direction of the output variable
Y right, so eventually you can show that which you are not going to do but you can show that the
directions you pick that Z1, Z2, Z 3 that you pick or those which have high variance in the
input space. But also have a high correlation with way right it is actually an objective
function which tries to balance correlation with Y and variance in the input space why PCA that
is only variance PCR does only variance in the input space does not worry about the correlation
but partial least squares you can show that it actually worries about the correlation as well
right. We find the first coordinate now what do you do you orthoganalize, so what should I do now
I should regress x1 so what should I be doing now I should regress x1 like xj on z1 right.
This is how we did the orthogonalization earlier right, so you find one direction okay then you
regress everything else on that direction then subtract from it that gives you the orthogonal
direction right, so essentially that is what you are doing here the expressions look big
but then if you have been following the material from the previous classes then it is essentially
whether they just reusing the univariate regression construction we had earlier right.
So now I have a new set of directions which I call xj2 write x j1 was the original xj
is I start off with now I have a new set of directions which we will call xj2 and then
I can keep repeating the whole process, I can take white projected along xj2 right and then
combine that and get Z2 and then regress Y on Z2 to get θ2 right, so I can keep doing
this until I get as many directions as I want all right so what is the nice thing about Z1,
Z2 other things they themselves will be orthogonal because they are being constructed by individual
vectors which are orthogonal with respect to their all the previous Zs that we have right.
Each one will be orthogonal and therefore I can essentially do univariate regression so I do not
have to worry about accommodating the previous variable, so when it when I want to fit the ZK
I can just do a univariate regression of Y on that K and I will get the coordinates theta K okay is
it fine great. So once I get this theta one to θK how do I use it for prediction can I just do
like xβ like into xβ can I do xθ and I know what should I do well so I can do zx z read it mean
I am sorry I can do θZ and predict it but then I do not really want to construct this Z directions
for every vector that I am going to get so I do not want to project it along those Z direction,
so instead of that what I can do if you think about it each of those Zs is actually composed
of the original variables X right. So I can compute the θ and then I can just go back
and derive coefficients for the Xs directly because all of these are linear computations
I all I need to do is essentially figure out how I am going to stack all the thetas so that I can
derive the coefficients for the Xs okay think about it you can do it as a short exercise but
I can eventually come up and write right. So where I can derive this coefficients β hat
from these θs right so I will derive θ1, θ2, θ3 so on so forth I can just go back and do this
computation so you will have to think about it very easy you can work it out and figure out
what is the number should be right and what is the ‘m’ doing that number of their directions, I
actually derive the number of directions I derive from the PLS right so here the first direction I
can keep going suppose I derive p directions what can you tell me about the fit for the data if I
get p PLS directions it essentially means that I will get as good a fit as the original least
squares fit right so I essentially get the same fit as least squares fit okay so anything lesser
than that is going to give me something different from the least squares fit okay here is a thought
question if my X are originally, orthogonal to begin with X were actually orthogonal to begin
with what will happen with PLS Z will be the same as Xs right and what will happen to Z2 can
I do the Z2 no right PLS will stop after one step because there will be no residuals after
that right so I will essentially get my least squares fit in the first attempt itself okay
so that is essentially what will happen right so we will stop with regression methods.
5.0 / 5 (0 votes)