Week 3 Lecture 13 Principal Components Regression
Summary
TLDRThis script delves into the concept of Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) in the context of data analysis. It explains how SVD breaks down a matrix into diagonal entries and eigenvectors, and how PCA uses these components to find directions with maximum variance for data projection. The script highlights the benefits of PCA, such as reducing dimensionality while retaining significant variance, and discusses potential drawbacks, like ignoring the output variable when selecting principal components. It also touches on the importance of considering both input data and output variables for optimal feature selection in regression and classification tasks.
Takeaways
- 📊 The script discusses the concept of Singular Value Decomposition (SVD), where D is a diagonal matrix with eigenvalues or singular values, and V and U are matrices containing eigenvectors that span the column space of X.
- 🔍 Principal Component Analysis (PCA) is mentioned as a method that involves finding the covariance matrix of centered data and then performing an eigendecomposition to find principal component directions.
- 🌐 The principal component directions are orthogonal and each explains a certain amount of variance in the data, with the first principal component (V1) having the maximum variance.
- 📈 The script explains that projecting data onto the first principal component direction (V1) results in the highest variance among all possible projection directions, indicating the spread of data.
- 🔄 The process of PCA involves selecting principal components to perform regression, where each component is orthogonal and can be used independently in regression models.
- 📉 The script highlights that the first principal component direction minimizes the reconstruction error when only one coordinate is used to summarize the data.
- 📝 Principal Component Regression (PCR) is introduced as a method to perform regression using the selected principal components, with the regression coefficients being determined by regressing Y on the principal components.
- 🚫 A drawback of PCR is that it only considers the input data and not the output, which might lead to suboptimal directions if the output has specific characteristics.
- 📊 The script contrasts ideal PCA directions with those that might be more suitable for classification tasks, where considering the output can lead to better separating surfaces.
- 📉 The importance of considering both the input data and the output when deriving directions for tasks like classification is emphasized, as it can lead to more effective models.
- 🔑 The takeaways from the script highlight the mathematical foundations of PCA and its applications in feature selection and regression, as well as the importance of considering the output in certain contexts.
Q & A
What is the singular value decomposition (SVD) mentioned in the script?
-Singular Value Decomposition (SVD) is a factorization of a matrix into three matrices: D, a diagonal matrix with singular values on its diagonal; V, a P x P matrix with eigenvectors; and U, an N x P matrix that spans the column space of the original matrix X.
How is the covariance matrix related to the principal component analysis (PCA)?
-In PCA, the covariance matrix is derived from the centered data. The eigendecomposition of this covariance matrix provides the principal components, which are the directions of maximum variance in the data.
What are the principal component directions of X?
-The principal component directions of X are the columns of the V matrix obtained from the eigendecomposition of the covariance matrix of the centered data.
Why is the first principal component direction considered the best for projecting data?
-The first principal component direction is considered the best for projecting data because it captures the highest variance among all possible directions, resulting in the most spread-out data projection.
What is the significance of the orthogonality of principal components?
-Orthogonality of principal components means they are uncorrelated, allowing for independent regression analysis along each principal component direction, which can be useful for feature selection and data reconstruction.
How does the script relate the concept of variance to the principal components?
-The script explains that each principal component captures the maximum variance in its respective orthogonal space. The first principal component has the highest variance, followed by the second, and so on, with each subsequent component capturing the highest variance in the space orthogonal to the previously selected components.
What is the role of the intercept in principal component regression?
-In principal component regression, the intercept is the mean of the dependent variable (y-bar), which is added to the regression model to account for the central tendency of the data.
How does the script discuss the limitation of principal component regression?
-The script points out that principal component regression only considers the input data (X) and not the output, which might lead to suboptimal directions for regression if the output's characteristics are not taken into account.
What is the script's example illustrating the importance of considering the output in regression?
-The script uses a classification example where projecting data in the direction of maximum variance might mix different classes, leading to poor predictive performance. Instead, a direction that separates the classes clearly, even if it has lower variance, might be more beneficial for regression or classification.
How does the script suggest improving the selection of principal components for regression?
-The script suggests that considering the output data along with the input data might help in selecting better directions for regression, especially in cases where the output data has specific characteristics that could influence the choice of principal components.
What is the general process of feature selection using principal components?
-The general process involves selecting the first M principal components, performing univariate regression on each, and adding the outputs until the residual error becomes small enough, indicating a satisfactory fit.
Outlines
📊 Introduction to Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)
This paragraph introduces the concept of singular value decomposition (SVD), where a matrix X is decomposed into three components: a diagonal matrix D with eigenvalues or singular values, a P x P matrix V containing eigenvectors, and an N x P matrix U that spans the column space of X. It then relates SVD to principal component analysis (PCA), explaining that PCA involves finding the covariance matrix of centered data, and then performing an eigendecomposition on it. The paragraph emphasizes that the eigenvectors (V matrix) from this decomposition represent the principal component directions of X, which are significant for feature selection and dimensionality reduction.
🔍 Exploring Principal Component Directions and Variance Maximization
The second paragraph delves into the properties of principal component directions derived from PCA. It explains how projecting the data onto the first eigenvector (V1) results in a vector (Z1) that captures the maximum variance among all possible projection directions. The paragraph uses a two-dimensional example to illustrate how projecting data onto different directions can yield varying degrees of spread, with the principal component direction showing the most spread, indicating higher variance. It also touches on the concept of reconstruction error, stating that the first principal component direction minimizes this error, and discusses the orthogonal nature of the principal components, each capturing the highest variance in their respective subspaces.
📉 Principal Component Regression and Its Limitations
This paragraph discusses the application of principal components in regression analysis, known as principal component regression. It describes how to incorporate the principal components into a regression model by regressing Y on the transformed data Zm, where Zm is a function of the principal components. The paragraph also points out a limitation of principal component regression: it focuses solely on the input data (X) and does not consider the output (Y), which might lead to suboptimal directions for projection if the output's characteristics are not aligned with the variance captured by the principal components. The discussion includes a hypothetical example comparing the effectiveness of different projection directions in a classification context, highlighting the importance of considering the output when selecting directions for analysis.
Mindmap
Keywords
💡Diagonal Matrix
💡Eigenvalues
💡Singular Value Decomposition (SVD)
💡Eigenvectors
💡Covariance Matrix
💡Principal Component Analysis (PCA)
💡Principal Component Directions
💡Variance
💡Projection
💡Reconstruction Error
💡Orthogonal Directions
💡Feature Selection
💡Residual
Highlights
Introduction to singular value decomposition (SVD) and its components: D (diagonal matrix with eigenvalues/singular values), V (matrix of eigenvectors), and U (matrix spanning the column space of X).
Linking SVD to principal component analysis (PCA) and the concept of covariance matrix S.
Explanation of how PCA involves eigendecomposition of the covariance matrix of centered data.
Principal component directions as columns of the V matrix derived from PCA.
PCA's utility in feature selection and its implications for regression and classification.
The concept of projecting data X onto the first eigenvector direction to achieve maximum variance.
Illustration of how variance is maximized in the direction of the first principal component.
Discussion on the reconstruction of data using principal components and the associated error.
The first principal component direction minimizes reconstruction error.
Orthogonality of principal components and their role in explaining variation in data.
The method of independently performing regression on orthogonal dimensions.
Principal component regression and its limitation of focusing only on input data.
The process of adding principal components to a regression model until residuals are sufficiently small.
The drawback of PCA in cases where output data should also be considered for optimal direction selection.
Example of how PCA might not be optimal for classification if the principal direction does not separate classes well.
The importance of considering both input data and output labels when deriving directions for regression or classification.
The potential for better prediction accuracy when considering the output in the selection of principal components.
Transcripts
Right so D is a diagonal
matrix right where the diagonal entries are your Eigen values if ideally or otherwise
known as singular values right V is a V is a P x P matrix which has your eigenvectors and U
the N x P matrix which typically spans your column space as X the same column space as X okay so this
is essentially your singular value decomposition that we talked about so.
So if you if you look at singular value decomposition or what is called the
principal component analysis literature you will find the following you will find that
they will talk about the covariance matrix S okay what is the covariance matrix is a
covariance matrix this is essentially if you think of whatever we have been doing,
so far what would be this centered right it is centered so I take the centered data
okay then this becomes this right so X tends to l then what I do is I find
the eigendecomposition of that. I find the Eigen decomposition of the covariance matrix
right.
So I can essentially write this as so the same V and D that I wrote here assuming this was okay
so if I take XC so basically I am going to get the same thing right so it is essentially like
doing singular value decomposition right and retrieving the V matrix right I am essentially
taking the XTX which is the covariance matrix of the centered data okay and I am finding the
Eigenvalue decomposition of that so D2 would be the Eigen vectors of XTX so this is standard stuff
you should know okay. So the columns of
so they are called the principal component directions of X. So there are a couple of nice
things about the principal component directions, so we will talk about just one so I will actually
come back to PCA slightly later right when I talk more about generally about feature selection not
just in the context of regression but when I talk with generally about feature selection I will come
back to PCA and tell you at least show you why PCA is good right now I will just tell you why
PCA is good I will come back later and then I will show you why PCA is good right.
So suppose I take so where V1 is the eigenvector corresponding to the first eigenvalue right
eigenvector corresponding to the first eigenvalue so essentially what this means is I am projecting
my data X on the first eigenvector direction okay so the resulting vector Z1 okay will have
the highest variance among all possible directions in which I can project X right.
So what does that mean right suppose this is X okay this is not x and y okay so it is a
two-dimensional X this is x now I am claiming that V1 will be such that when I project X onto V1 I
will have the maximum variance right, so in this case it will be some direction like this okay and
projecting X onto this essentially means that right, so you can see that the data is pretty
spread out it goes from here to here right on the other hand if I had taken a direction let
us say that looks like that right so if I look at projection of the data right, so you can
look at the spread it is a lot lesser in that direction than in the original direction I did
the projection I know it looks pretty confusing to look at but the people can get my point right it
is in the original direction that way the data was a lot more spread out as opposed to this
direction where the data is lot more compact when I project it on to that direction.
So that is essentially what I am saying so z1 right is essentially the projected data
onto that direction onto X like z1 actually has a highest variance among all the directions in which
I can project the data right and consequently you can also show things like if I am looking
to reconstruct the data original data and I say that you can only give me one coordinate right
so you have to summarize the data in a single coordinate and now I am going to measure the
data measure the error in reconstruction right. If you looked at it so the error in reconstruction
would have been these bars that I did the projection over right.
That would be the error in reconstruction, so I have the original data so that is the data
so now I will give you this coordinates now I have to reconstruct the data right
so essentially this will be the errors so the first principal component direction the first
principal component direction is the one that has the smallest reconstruction error. First
principal comment direction will be the one that has the smallest reconstruction so we can
show a lot of nice properties about this. So I will actually come back and do this later
when we talk about the general feature selection okay but here you can see the first thing you
can see what each one V1 to VP will be orthogonal right, so I have gotten my orthogonal directions
right and the thing to notice is a lot of the variation in the data is explained by V1 or V1
has the maximum variance likewise you take out V1 right you take out V1 so now what you have your
data lies in some kind of a p - 1 dimensional space right and the direction in that the space
which has the highest variance is p2 it turns out that so V1 has the highest variance over
the data. So in this space orthogonal to V1 , V2 has the highest variance right in the space
orthogonal to V1 and V2, V3 will have the highest variance and so on so forth so essentially now
what you can do is hey I am going to take all this directions one at a time right and I will
do my regression right because each is orthogonal I can independently do the regression I can add
the outputs and I can keep adding the dimensions until my residual becomes small enough that make
sense. So I will just keep adding this orthogonal dimensions until my residual becomes small enough
at that point I stop. So this is essentially the idea behind
so remember we are working with the centered data right, so you automatically add in your
intercept which is y bar the coefficient is y bar right and then your if you if you choose to
take the first M principal components your thing will be θmZm where Zm is given by this right and
θm is essentially regressing Y on Zm right so that is a univariate regression expression we
know that well now so this gives you the principal component regression fit so one of the drawbacks
of doing principal component regression is that I am only looking at the data the input
right I am not looking at the output. So it could very well be that once I consider
what the output is right I might want to change the directions a little bit right, so I can give
you an example is easier for me to draw if I think of classification.
Let us say this is the data and what would be the principal component direction you want to
choose something like this right so that would be the ideal direction that you would want to choose
okay so now what will happen the data will get we get projected like this right but suppose I tell
you that. Suppose I tell you that
that is fine that these three were in a different class and if you want to think of it in terms of
regression let us assume that these three have an output of -1 and these four have an output
of +1 okay now if you think of this direction so the +1 and -1 are hopelessly mixed up right the
+1 and -1 are hopelessly mixed up and I cannot I cannot draw is give a smooth prediction of which
will be +1 which will be -1 on the other hand if you project onto a direction like this right the
variance is small right I agree the variance is much smaller but if you think about it. So
all the -1 go to one side right all the +1 go to one side, so now if I want to do a prediction on
this so it will be like okay this is this side is -1 and that side is +1 I can essentially do a fit
like this which will give me a lot lesser error than the other case right so in cases where you
are having an output that is specified for you already it might be beneficial to look at the
output also when trying to derive directions as opposed to just looking at the input data so in
classification you can see right in classification this will be say class 1 this will be class 2
and having this direction allows you to have a separating surface somewhere here right we talked
about classification in the first class right. So you just having a separating surface here
will be great but in this case if I am projecting on this direction coming up
with a linear separating surface is going to be hard everything gets completely mixed up.
تصفح المزيد من مقاطع الفيديو ذات الصلة
1 Principal Component Analysis | PCA | Dimensionality Reduction in Machine Learning by Mahesh Huddar
Singular Value Decomposition (SVD): Mathematical Overview
Python Exercise on kNN and PCA
SINGULAR VALUE DECOMPOSITION (SVD)@VATAMBEDUSRAVANKUMAR
Week 3 Lecture 14 Partial Least Squares
Week 2 Lecture 9 - Multivariate Regression
5.0 / 5 (0 votes)