Week 3 Lecture 14 Partial Least Squares

Machine Learning- Balaraman Ravindran
4 Aug 202114:34

Summary

TLDRThis video script delves into regression methods, focusing on linear regression techniques such as subset selection and shrinkage methods, including ridge regression and lasso. It introduces derived directions with principal component regression and partial least squares (PLS), emphasizing PLS's unique approach that considers both input and output data. The script explains the process of constructing PLS directions and their orthogonality, leading to univariate regressions without multicollinearity issues. It concludes with the implications of using PLS for prediction and how it relates to the original least squares fit, especially when dealing with orthogonal inputs.

Takeaways

  • šŸ“š The lecture continues the discussion on linear regression methods, focusing on subset selection, shrinkage methods, and derived directions.
  • šŸ” Subset selection methods include forward selection, backward selection, and stepwise selection, which involve choosing subsets of explanatory variables.
  • šŸ”§ Shrinkage methods such as ridge regression and lasso are introduced to address the issue of overfitting by shrinking the coefficients of less important variables.
  • šŸŒ Derived directions encompass principal component regression (PCR) and partial least squares (PLS), which are methods to find new directions for regression analysis.
  • šŸ¤” The motivation for PLS is to address the limitation of PCR, which does not consider the relationship between input data and output data.
  • āš–ļø Before applying PLS, it's assumed that the input data (X) is standardized and the output data (Y) is centered, ensuring no variable dominates due to its scale.
  • šŸ“ˆ The first derived direction (z1) in PLS is found by summing the individual contributions of each variable in explaining the output variable Y.
  • šŸ”„ The process of PLS involves orthogonalization, where each new direction (z) is made orthogonal to the previous ones, allowing for univariate regression.
  • šŸ”¢ The derived directions (Z1, Z2, Z3, etc.) are constructed to maximize variance in the input space and correlation with the output variable Y.
  • šŸ”® Once the derived directions and their coefficients (Īø) are determined, predictions can be made without directly constructing the Z directions for new data.
  • šŸ”„ The final model in PLS can be derived from the Īø coefficients, allowing for direct coefficients for the original variables (X) to be computed.
  • šŸ“‰ If the original variables (X) are orthogonal, PLS will stop after the first step, as there will be no additional information to extract from the residuals.

Q & A

  • What are the three classes of methods discussed in the script for linear regression?

    -The three classes of methods discussed are subset selection, shrinkage methods, and derived directions.

  • What is the main difference between principal component regression (PCR) and partial least squares (PLS)?

    -The main difference is that PCR only considers the input data (X) and its variance, while PLS also takes into account the output data (Y) and the correlation with the input data.

  • What assumptions are made about the data before applying partial least squares?

    -It is assumed that the output variable Y is centered and the input variables are standardized, meaning each column has a 0 mean and unit variance.

  • How is the first derived direction (z1) in PLS computed?

    -The first derived direction (z1) is computed by taking the projection of Y on each Xj, vectorizing it, and then summing all these projections to create a single direction.

  • What is the purpose of orthogonalization in the context of PLS?

    -Orthogonalization is used to find new directions (xj2) that are orthogonal to the previous derived directions (z1, z2, ...), allowing for univariate regression without considering the influence of previous variables.

  • How does PLS balance the variance in the input space and the correlation with the output variable?

    -PLS finds directions in X that have high variance and also high correlation with Y, effectively balancing both through an objective function.

  • What happens when you perform PLS on data where the original variables (X) are already orthogonal?

    -If the original variables are orthogonal, PLS will stop after one step because there will be no additional variance to capture, and the derived directions (Z) will be the same as the original variables (X).

  • How many derived directions (Z) can be obtained from PLS, and what does this imply for the fit of the data?

    -You can obtain as many derived directions (Z) as you want from PLS. If you get p PLS directions, it means you will get as good a fit as the original least squares fit.

  • How can the coefficients for the original variables (X) be derived from the coefficients of the derived directions (Īø) in PLS?

    -The coefficients for the original variables (X) can be derived from the coefficients of the derived directions (Īø) by performing linear computations that account for how the Īøs are stacked and the original variables' contributions to each Īø.

  • What is the process of constructing derived directions in PLS, and how does it differ from PCR?

    -In PLS, derived directions are constructed by summing the projections of Y on each Xj, creating directions that maximize the variance in the input space and the correlation with the output. This differs from PCR, which only maximizes variance in the input space without considering the output.

Outlines

00:00

šŸ“Š Introduction to Linear Regression Techniques

The speaker continues discussing linear regression, focusing on different methods such as subset selection, shrinkage methods, and derived directions. The discussion revisits subset selection, including forward, backward, and stage-wise selection, then moves to shrinkage methods like ridge regression and lasso. The speaker then introduces derived directions, specifically principal component regression (PCR), and explains the motivation behind partial least squares (PLS) as it considers both input and output data, unlike PCR.

05:06

šŸ”„ Projection and Derived Directions in 3D

The speaker explains the projection of the output variable Y on multiple input variables (X1, X2) to derive directions in the context of partial least squares (PLS). The challenge of visualizing this in a 3D space is acknowledged. The speaker contrasts PLS with principal component regression (PCR), noting that while PCR finds directions in X with the highest variance, PLS finds directions in X that are more aligned with the output variable Y. PLS balances variance in the input space with correlation to the output variable.

10:09

šŸ” Orthogonalization and Prediction with PLS

The process of orthogonalizing directions in partial least squares (PLS) is discussed, where each derived direction is orthogonal to the previous ones, simplifying univariate regression. The speaker explains how coefficients for the original variables X can be derived from the PLS directions for prediction. If all directions are derived, PLS achieves a fit equivalent to least squares. A thought experiment is presented: if the input variables X are orthogonal initially, PLS would immediately yield the least squares fit after the first direction.

Mindmap

Keywords

šŸ’”Linear Regression

Linear regression is a statistical method for modeling the relationship between a dependent variable and one or more independent variables by fitting a linear equation to the observed data. In the video, linear regression serves as the foundational concept for more advanced methods like subset selection and shrinkage methods, which are discussed as extensions of this basic technique.

šŸ’”Subset Selection

Subset selection in the context of regression analysis refers to the process of choosing a subset of available variables for a regression model. The video mentions methods such as forward selection, backward selection, and stepwise selection, which are all strategies for determining which variables to include in the model to best predict the dependent variable.

šŸ’”Shrinkage Methods

Shrinkage methods are techniques used in regression analysis to reduce the variance of parameter estimates by 'shrinking' them towards a predetermined value. The video discusses ridge regression and lasso as examples of shrinkage methods, which aim to prevent overfitting by penalizing large coefficients.

šŸ’”Principal Component Regression (PCR)

Principal Component Regression is a technique that combines principal component analysis with linear regression. It involves projecting the original variables into a new set of orthogonal variables (principal components) that capture the most variance in the data. The video points out that PCR only considers the input data (X) and not the output (Y), which can sometimes lead to counterintuitive results.

šŸ’”Partial Least Squares (PLS)

Partial Least Squares is a regression technique that finds relationships between the predictors (X) and the response (Y) by extracting components that explain the maximum covariation between X and Y. Unlike PCR, PLS takes into account both the input and output data, making it a more robust method when the relationship between X and Y is important.

šŸ’”Orthogonalization

Orthogonalization is the process of making vectors orthogonal to each other, which means they are at right angles and have no correlation. In the video, orthogonalization is used in the context of deriving new directions for regression analysis, ensuring that each new component (like Z1, Z2, etc.) is uncorrelated with the previous ones.

šŸ’”Centering and Standardizing

Centering involves adjusting the data so that the mean of each variable is zero, while standardizing adjusts the data so that each variable has a mean of zero and a standard deviation of one. The video emphasizes the importance of centering Y and standardizing the inputs for both PCA and PLS to ensure unbiased and scale-invariant results.

šŸ’”Projection

In the context of the video, projection refers to the process of finding how much of the response variable (Y) can be explained by each predictor variable (Xj) individually. This is done by projecting Y onto each Xj and then summing these projections to create a derived direction, which is a key step in the PLS method.

šŸ’”Univariate Regression

Univariate regression is a type of regression analysis that involves only one predictor variable. The video discusses how, after orthogonalizing the predictors, univariate regression can be used to find the relationship between Y and each derived direction (Z) individually, simplifying the analysis.

šŸ’”Coefficients

In regression analysis, coefficients are the numerical values that multiply the predictor variables to determine the predicted value of the response variable. The video explains how coefficients for the derived directions (Z1, Z2, etc.) are calculated and how they can be used to derive coefficients for the original variables (X) in PLS.

šŸ’”Overfitting

Overfitting occurs when a statistical model is too complex and captures noise in the training data, leading to poor generalization to new data. The video mentions shrinkage methods like ridge regression and lasso as ways to prevent overfitting by penalizing large coefficients and simplifying the model.

Highlights

Continuation of the discussion on linear regression methods.

Introduction to subset selection methods including forward, backward, and stepwise selection.

Exploration of shrinkage methods like ridge regression and lasso.

Introduction to derived directions starting with principal component regression.

The limitation of principal component regression in not considering the output data.

Assumption of centered Y and standardized inputs for both PCA and partial least squares.

Process of creating derived directions by projecting Y on Xj and summing the projections.

Explanation of how to find the first derived direction z1 by summing univariate contributions.

The concept of using Y in the regression to find derived directions.

Distinguishing partial least squares from PCR by considering the output variable Y.

Demonstration of how to orthogonalize by regressing xj on z1.

Iterative process of finding new directions xj2 and their corresponding z directions.

Orthogonality of derived directions ensuring univariate regression can be performed.

Derivation of coefficients for the original variables X from the derived directions Z.

The equivalence of the fit obtained by using p PLS directions to the original least squares fit.

Implication of orthogonal X variables on the PLS method, potentially stopping after one step.

Concluding the discussion on regression methods with insights into partial least squares.

Transcripts

play00:00

ļ»æ Okay so we will continue from where we left offĀ Ā 

play00:19

as I promised right so we are looking at linearĀ  regression and we looked at subset selection andĀ Ā 

play00:34

then we looked at the shrinkage methods and then,Ā  finally we came to derived directions all right IĀ Ā 

play00:42

said there are three classes of methods so weĀ  are looking at a couple of examples of each ofĀ Ā 

play00:47

those classes of methods the first one we lookedĀ  at was subset selection so we looked at forward,Ā Ā 

play00:51

backward selection and stage way selection in stepĀ  wise selection and all that and then we lookedĀ Ā 

play00:55

at shrinkage methods where we looked at ridgeĀ  regression and lasso and then we started lookingĀ Ā 

play01:00

at derived directions right where we looked atĀ  principal component regression I said the next oneĀ Ā 

play01:08

we look at is partial least squares and it gaveĀ  me the motivation for looking at partial leastĀ Ā 

play01:12

squares it is mainly because principal componentĀ  regression only looks at the input data okay,Ā Ā 

play01:18

does not pay attention to the output right andĀ  therefore you might sometimes come up with reallyĀ Ā 

play01:25

counterintuitive directions like an exampleĀ  I showed you with the +1 and -1 outputs okay,Ā Ā 

play01:30

so the basic idea here is that we areĀ  going to use the Y also right. Ā 

play01:36

Just like the usual case I am going to assumeĀ  that Y is centered right. And I am also goingĀ Ā 

play01:47

to assume that the inputs are standardized. ThisĀ  is something which you have to do for both PCA andĀ Ā 

play01:57

partial least squares essentially assume that theĀ  each column right it is going to have 0 mean unitĀ Ā 

play02:04

variance right on the data that is given to youĀ  make it 0 mean unit variance, so that you are notĀ Ā 

play02:10

having any magnitude related effects on the outputĀ  okay, so what I am going to do is the following ifĀ Ā 

play02:21

you remember how we did orthogonalization earlierĀ  something very similar so I am going to look atĀ Ā 

play02:40

so I am going to look at the projection of YĀ  on Xj right then I am going to create a derivedĀ Ā 

play02:56

direction which essentially sums up all of theseĀ  projections right I have computed basically I amĀ Ā 

play03:17

computing the projection of Y on xj right,Ā  so this is essentially the direction is aĀ Ā 

play03:22

vectorized version of it then I am going to sumĀ  all of this up so essentially what I am doingĀ Ā 

play03:26

here is I am looking at each variable in turnĀ  I take each Xj in turn okay I am seeing what isĀ Ā 

play03:34

the effect on Y right, so how much of Y I am ableĀ  to explain just by taking Xj alone and I am usingĀ Ā 

play03:44

all of that I am combining that and makingĀ  that as my single direction so individuallyĀ Ā 

play03:49

taking each one of this all by itself okay. Individually taking each direction by itself howĀ Ā 

play03:54

much of Y can I explain and that becomes myĀ  first derived direction that is my z1 okay.Ā Ā 

play04:20

So that is the coefficient for z1 in myĀ  regression fit eventual regression fitĀ Ā 

play04:25

okay that is the coefficient for that one youĀ  can see what it is like so I have taken Y andĀ Ā 

play04:29

regressed it on z1 and that essentially gives meĀ  what the coefficient for z1 right so how do I goĀ Ā 

play04:37

on to fine okay so I am looking at how much of YĀ  is along each direction Xj right so in some senseĀ Ā 

play04:50

you can think of it as if I have one variableĀ  Xj right how much of Y can be explained withĀ Ā 

play04:56

that one variable xj okay I am looking at that andĀ  then my first direction z1 is essentially summingĀ Ā 

play05:05

that univariate contributions over all my inputĀ  directions I suppose, I have two input directionsĀ Ā 

play05:23

Unfortunately I have to do this in 3d supposeĀ  I have to input directions so what I am goingĀ Ā 

play05:31

to do is I am going to take my Y right, soĀ  project it on x1 alone first right projectĀ Ā 

play05:45

it on x1 alone and on x2 alone right we redoĀ  that this is tricky to do this in 3d but anyĀ Ā 

play06:20

way right. No it is going to be hard to do it on theĀ Ā 

play06:27

board pictorially for you okay I am not going toĀ  do this so I really need to get a actually have toĀ Ā 

play06:31

plot a function Y right I cannot just do it withĀ  single data points why that does not make sense,Ā Ā 

play06:36

so I actually have to get to a surface Y on x1,Ā  x2 and then talk about the projection so thatĀ Ā 

play06:41

is going to be hard right, but the basic idea isĀ  I take Y right I find the projection of Y alongĀ Ā 

play06:48

x1 okay then I find the projection of Y along x2Ā  okay now I am going to take the sum of these twoĀ Ā 

play06:54

okay and whatever is the resulting direction andĀ  I am going to use that as my first direction. Ā 

play06:59

Yes we see in PCR what we did was we first foundĀ  directions X which had the highest variance hereĀ Ā 

play07:17

we are not finding directions in X with theĀ  highest variance but we are finding directionsĀ Ā 

play07:20

in X right some sense components of X which areĀ  have more in the direction of the output variableĀ Ā 

play07:27

Y right, so eventually you can show that whichĀ  you are not going to do but you can show that theĀ Ā 

play07:33

directions you pick that Z1, Z2, Z 3 that youĀ  pick or those which have high variance in theĀ Ā 

play07:41

input space. But also have a high correlationĀ  with way right it is actually an objectiveĀ Ā 

play07:48

function which tries to balance correlation withĀ  Y and variance in the input space why PCA thatĀ Ā 

play07:57

is only variance PCR does only variance in theĀ  input space does not worry about the correlationĀ Ā 

play08:02

but partial least squares you can show that itĀ  actually worries about the correlation as wellĀ Ā 

play08:08

right. We find the first coordinate now what doĀ  you do you orthoganalize, so what should I do nowĀ Ā 

play08:40

I should regress x1 so what should I be doingĀ  now I should regress x1 like xj on z1 right. Ā 

play08:57

This is how we did the orthogonalization earlierĀ  right, so you find one direction okay then youĀ Ā 

play09:01

regress everything else on that direction thenĀ  subtract from it that gives you the orthogonalĀ Ā 

play09:07

direction right, so essentially that is whatĀ  you are doing here the expressions look bigĀ Ā 

play09:33

but then if you have been following the materialĀ  from the previous classes then it is essentiallyĀ Ā 

play09:39

whether they just reusing the univariateĀ  regression construction we had earlier right. Ā 

play09:43

So now I have a new set of directions whichĀ  I call xj2 write x j1 was the original xjĀ Ā 

play09:51

is I start off with now I have a new set ofĀ  directions which we will call xj2 and thenĀ Ā 

play09:58

I can keep repeating the whole process, I canĀ  take white projected along xj2 right and thenĀ Ā 

play10:08

combine that and get Z2 and then regress YĀ  on Z2 to get Īø2 right, so I can keep doingĀ Ā 

play10:17

this until I get as many directions as I wantĀ  all right so what is the nice thing about Z1,Ā Ā 

play10:25

Z2 other things they themselves will be orthogonalĀ  because they are being constructed by individualĀ Ā 

play10:32

vectors which are orthogonal with respect toĀ  their all the previous Zs that we have right. Ā 

play10:38

Each one will be orthogonal and therefore I canĀ  essentially do univariate regression so I do notĀ Ā 

play10:42

have to worry about accommodating the previousĀ  variable, so when it when I want to fit the ZKĀ Ā 

play10:47

I can just do a univariate regression of Y on thatĀ  K and I will get the coordinates theta K okay isĀ Ā 

play10:53

it fine great. So once I get this theta one toĀ  ĪøK how do I use it for prediction can I just doĀ Ā 

play11:08

like xĪ² like into xĪ² can I do xĪø and I know whatĀ  should I do well so I can do zx z read it meanĀ Ā 

play11:30

I am sorry I can do ĪøZ and predict it but then IĀ  do not really want to construct this Z directionsĀ Ā 

play11:38

for every vector that I am going to get so I doĀ  not want to project it along those Z direction,Ā Ā 

play11:43

so instead of that what I can do if you thinkĀ  about it each of those Zs is actually composedĀ Ā 

play11:48

of the original variables X right. So I canĀ  compute the Īø and then I can just go backĀ Ā 

play11:58

and derive coefficients for the Xs directlyĀ  because all of these are linear computationsĀ Ā 

play12:04

I all I need to do is essentially figure out howĀ  I am going to stack all the thetas so that I canĀ Ā 

play12:11

derive the coefficients for the Xs okay thinkĀ  about it you can do it as a short exercise butĀ Ā 

play12:16

I can eventually come up and write right. So where I can derive this coefficients Ī² hatĀ Ā 

play12:35

from these Īøs right so I will derive Īø1, Īø2, Īø3Ā  so on so forth I can just go back and do thisĀ Ā 

play12:43

computation so you will have to think about itĀ  very easy you can work it out and figure outĀ Ā 

play12:47

what is the number should be right and what isĀ  the ā€˜mā€™ doing that number of their directions, IĀ Ā 

play12:57

actually derive the number of directions I deriveĀ  from the PLS right so here the first direction IĀ Ā 

play13:04

can keep going suppose I derive p directions whatĀ  can you tell me about the fit for the data if IĀ Ā 

play13:25

get p PLS directions it essentially means thatĀ  I will get as good a fit as the original leastĀ Ā 

play13:31

squares fit right so I essentially get the sameĀ  fit as least squares fit okay so anything lesserĀ Ā 

play13:37

than that is going to give me something differentĀ  from the least squares fit okay here is a thoughtĀ Ā 

play13:42

question if my X are originally, orthogonal toĀ  begin with X were actually orthogonal to beginĀ Ā 

play13:49

with what will happen with PLS Z will be theĀ  same as Xs right and what will happen to Z2 canĀ Ā 

play14:11

I do the Z2 no right PLS will stop after oneĀ  step because there will be no residuals afterĀ Ā 

play14:19

that right so I will essentially get my leastĀ  squares fit in the first attempt itself okayĀ Ā 

play14:24

so that is essentially what will happen rightĀ  so we will stop with regression methods.

Rate This
ā˜…
ā˜…
ā˜…
ā˜…
ā˜…

5.0 / 5 (0 votes)

Related Tags
Linear RegressionSubset SelectionShrinkage MethodsPrincipal ComponentPartial Least SquaresRegression AnalysisData ScienceMachine LearningStatistical TechniquesPredictive Modeling