Stanford CS224W: ML with Graphs | 2021 | Lecture 4.4 - Matrix Factorization and Node Embeddings
Summary
TLDRThe lecture discusses the concept of node embeddings in graph theory, focusing on maximizing dot products between similar nodes. It explores defining similarity through random walks and simpler edge connectivity, leading to matrix factorization. The lecture also touches on limitations of current embedding methods, such as inability to capture structural similarity and lack of feature utilization, and hints at graph neural networks as a potential solution.
Takeaways
- 📈 The lecture discusses the connection between embeddings and matrix factorization in graph theory.
- 🧠 Embedding matrix 'z' is defined with rows as embedding dimensions and columns as nodes in the graph.
- 🔍 The goal is to maximize the dot product between similar nodes' embeddings within the encoder-decoder framework.
- 🤔 Similarity is defined by nodes co-appearing in the same random walk or being directly connected by an edge.
- 🔗 The adjacency matrix 'A' can be factorized into Z transposed times Z, aiming for an approximation where connected nodes' dot product is 1, and non-connected is 0.
- 📊 The Frobenius norm is used to measure the similarity between the original adjacency matrix and its factorized form.
- 📚 The lecture connects previous discussions on node embeddings and random walks to matrix factorization.
- 🚫 Limitations of node embeddings include inability to obtain embeddings for new nodes not in the training set and lack of capturing structural similarity.
- 🔄 The approach does not utilize node, edge, or graph-level features, separating them from the embedding process.
- 🌐 Graph neural networks are introduced as a solution to overcome these limitations, allowing the fusion of feature information with structured information.
Q & A
What is the main objective of the encoder-decoder framework discussed in the lecture?
-The main objective of the encoder-decoder framework is to maximize the dot product between pairs of similar nodes, ensuring that if two nodes are similar, the dot product of their embeddings in the matrix Z should be high.
How is node similarity defined in the context of the lecture?
-Node similarity is defined through two methods: 1) If two nodes co-appear in the same random walk starting at a given node, and 2) If two nodes are connected by an edge.
What is matrix factorization in the context of this lecture?
-Matrix factorization, in this context, refers to the process of representing the adjacency matrix A as a product of two matrices, Z and Z transposed, where Z is the embedding matrix.
Why is an exact approximation of A as Z transpose times Z generally not possible?
-An exact approximation is generally not possible because the embedding dimension D (the number of rows in Z) is much smaller than the number of nodes, which means the matrix is wide but not deep, lacking enough representation power to capture each edge perfectly.
What is the Frobenius norm and how is it used in this context?
-The Frobenius norm is a matrix norm that is the square root of the sum of the absolute squares of its elements. In this context, it is used to measure the similarity between the adjacency matrix A and the factorized form Z transpose times Z by summing the square differences between corresponding entries.
How is the adjacency matrix transformed in the case of random walk-based similarity?
-The adjacency matrix is transformed by raising the inverse of the diagonal matrix of node degrees to the power r (where r is between 1 and T, the context window length), multiplying it by the adjacency matrix, and then by the volume of the graph divided by the log of the number of negative samples used.
What are the limitations of node embeddings via matrix factorization or random walks?
-The limitations include: 1) Inability to obtain embeddings for nodes not in the training set, meaning new nodes require re-computation of all embeddings. 2) Failure to capture structural similarity, as the embeddings are based on immediate neighbors rather than overall structure. 3) Inability to utilize node, edge, or graph-level features in the embedding process.
What is the solution proposed for the limitations of node embeddings?
-The solution proposed is deep representation learning and graph neural networks, which will be discussed in the next lecture. These methods aim to resolve the limitations by fusing feature information with structured information.
How is PageRank discussed in the lecture?
-PageRank is discussed in three ways: 1) As a flow formulation, 2) In terms of links and nodes, and 3) From a linear algebra perspective by computing the eigenvector of a transformed graph adjacency matrix.
What are the different ways of viewing graphs mathematically as mentioned in the lecture?
-Graphs can be viewed mathematically as matrices, where the adjacency matrix is factorized, eigenvectors and eigenvalues are computed, and connectivity information is extracted using linear algebra tools.
What is the significance of the lecture's conclusion about viewing graphs as matrices?
-The conclusion emphasizes that viewing graphs as matrices plays a key role in various algorithms, including matrix factorization, eigenvector computation, and connectivity analysis, which are all crucial for understanding and working with graph structures.
Outlines
📊 Matrix Factorization and Embeddings
The lecture concludes with a discussion on the connection between embeddings and matrix factorization. The lecturer revisits the concept of embedding matrix 'z', where each column represents an embedding for a node in the graph. The goal is to maximize the dot product between similar nodes' embeddings. Similarity is defined by co-appearance in random walks. The lecturer then explores a simpler definition of similarity: nodes are similar if they are directly connected by an edge, aiming for a dot product of 1 if connected and 0 if not. This leads to the matrix factorization problem where the adjacency matrix 'A' is approximated by the product of 'Z' transposed and 'Z'. The limitations of this approach are discussed, including the inability to capture structural similarity and the inability to include node or edge features.
🔍 DeepWalk and Node2vec as Matrix Factorization
The lecturer explains that the more complex definition of similarity used in algorithms like DeepWalk and node2vec can also be viewed as a form of matrix factorization, but of a more complex transformed adjacency matrix. The transformation involves considering the lengths of random walks and the number of negative samples used in optimization. The lecturer highlights that the process of simulating random walks or directly factorizing the transformed adjacency matrix yields the same result. The limitations of these methods are also discussed, including the inability to obtain embeddings for new nodes not present in the training set and the inability to capture structural similarity.
🚀 Graph Neural Networks as a Solution
The final part of the lecture addresses the limitations of node embeddings via matrix factorization and random walks, such as the inability to capture structural similarity and the inability to utilize node or edge features. The lecturer introduces graph neural networks as a solution to these limitations, which will be discussed in the following lectures. The lecture summarizes by highlighting the importance of viewing graphs as matrices and the various ways this can be done, including matrix factorization, eigenvector computation, and connectivity extraction using linear algebra.
Mindmap
Keywords
💡Embeddings
💡Matrix Factorization
💡Adjacency Matrix
💡Random Walk
💡Frobenius Norm
💡DeepWalk
💡Node2vec
💡Graph Neural Networks
💡Eigenvector
💡PageRank
Highlights
The lecture discusses the connection between node embeddings and matrix factorization.
An embedding matrix 'z' is introduced where rows represent the embedding dimension and columns represent nodes in the graph.
The goal in the encoder-decoder framework is to maximize the dot product between similar node pairs.
Similarity is defined through co-appearance in the same random walk.
A simpler notion of similarity is proposed where nodes are similar if they are directly connected by an edge.
Matrix factorization is used to approximate the adjacency matrix 'A' as the product of two matrices, Z transposed and Z.
The embedding dimension 'D' is typically much smaller than the number of nodes, leading to an approximation rather than an exact match.
Frobenius norm is used to measure the similarity between the original adjacency matrix and its factorization.
The inner product decoder with node similarity defined by edge connectivity is equivalent to the factorization of the adjacency matrix.
Random walk-based similarity can also be viewed as a form of matrix factorization, even with a more complex adjacency matrix transformation.
The limitations of node embeddings via matrix factorization or random walks include the inability to obtain embeddings for nodes not in the training set.
Embeddings cannot capture structural similarity, as they focus more on the identities of neighbors rather than local network structure.
These approaches cannot utilize node, edge, or graph-level features, as they create node embeddings separately from any features the nodes might have.
Graph neural networks are introduced as a solution to the limitations of node embeddings.
Graph neural networks allow for the fusion of feature information with structured information.
The lecture summarizes the discussion on PageRank, random walk extensions, and node embeddings based on random walks.
Viewing graphs as matrices plays a key role in various algorithms, including matrix factorization, eigenvector computation, and connectivity extraction.
Transcripts
So this now concludes,
uh, the first two parts of the lecture.
And what I wanna do in the, uh,
remaining, uh, few minutes,
10 minutes or so, I wanna talk about
connection to know the embeddings and matrix factorization.
So, uh, let me refresh you what we talked,
uh, what we talked, uh, on Tuesday.
So we talked about embeddings and we talked
about how we have this embedding matrix z that, you know,
the number of rows is the e- embedding dimension,
and the number of columns is the number of nodes,
uh, in the graph.
And this means that every column of this matrix z
will store an embedding for that given node.
And our objective was in this encoder-decoder framework is that we wanna maximize,
um, the- the- the dot product between pairs of nodes,
uh, that are similar.
So if two nodes are similar, uh,
then their dot product of their embeddings of
their columns in this matrix z, uh, has to be high.
And that was the- that was the idea.
Then of course, how did you define this notion of similarity?
We said two nodes are similar if they co-appear in the same random walk,
uh, starting at the given node.
So now, you know,
how- how can we,
uh, think about this more broadly, right?
You could say we define the notion of
similarity through random walks in the previous lecture.
What if we define an even simpler notion of, uh, similarity?
What if we say, two nodes are similar if they are connected by an edge.
Then what this means is we say, "Oh,
I want to approximate this matrix,
say like entry in the mai- UV in the matrix,
say this is either 0 or 1 by this dot product?"
Like I'm saying, if two nodes are connect- u and v are connected,
then I want their dot product to be 1.
And if they are not connected,
I want their dot product to be 0, right?
Um, of course, like if I write it this way at the level of the single entry,
if I write it in the- in the matrix form,
this- I'm writing basically saying,
this is Z trans- Z transposed times Z equals A.
So this is now caused matrix factorization because I take my matrix A and I factorize it,
I represent it as a product of,
um- of, uh, two matrices.
Um, Z- Z transposed and Z.
So essentially I'm saying I take my adjacency matrix, uh, A.
Here is, for example, one entry of it.
I want- because there is an edge, I want, you know,
the- the- the dot product of this row and that column to be value 1.
For example, for, uh, for this entry,
I would want the product of this,
uh, row and this particular column to be 0.
So this means that we take matrix A and factorize it as a product of Z transpose times Z.
Of course, because embedding dimension D number of rows in Z,
Z is much, much smaller than the number of, uh- uh, nodes, right?
So the matrix is very wide,
but not too- too deep.
This means that this exact approximation saying
A equals Z transpose times Z is generally not possible, right?
We don't have enough representation power
to really capture each edge but perfectly.
So we can learn this martix Z approximately.
So what we can see is,
let's find the matrix Z such that, you know, the, uh,
Z transpose times Z, uh,
the values of it are as similar to the values of A as possible.
How do I- how do I measure similarity now between
values is to use what is called Frobenius norm,
which is simply take an entry in A,
subtract an entry in this,
uh- in this matrix Z transpose times Z,
um, and take the square value of it and sum it up.
So Frobenius norm is simply, uh,
a sum of the square differences, uh,
between corresponding entries into, uh- into matrices.
Right? So this is now very similar to what we were
also doing in the previous, uh, lecture, right?
In the node embeddings lecture.
In the node embeddings lecture,
we were not using the L2 norm to- to define
the discrepancy between A and its factorization,
we used the softmax,
uh, uh, function instead of the L2 norm.
But the goal of approximating A with Z transpose times Z was the same.
So the conclusion is that the inner product decoder with note similarity defined by
edge- edge connectivity is equivalent to
the factorization of adjacency matrix, uh, A, right?
So we say we wanna directly approximate A by
the embeddings of the nodes such that if two nodes are linked,
then I want their dot product to be equal to 1.
And if they are not linked,
I want their dot product to be equal to 0.
So that's the, uh- the simplest way to define node similarity.
Now, in random walk-based similarity,
it turns out that when we have this more,
um, nuanced, more complex definition of, uh,
similarity, then, um, it is still- the entire process is still equivalent to
matrix factorization of just- of a more complex or transformed, uh, adjacency matrix.
So, um, here's the equation,
let me explain it, uh, on the next slide, what does it mean.
So it means that what we are really trying to do is
we are trying to factorize this particular,
um, transformed graph adjacency matrix.
So how is the the- uh,
what- what is the transformation?
The transformation is- here is the, um, adjacency matrix.
Here is the, um, one over the- the diagonal matrix,
these are the node degrees.
Um, this is now, uh,
raised to the power r,
where this- r goes between 1 and T where, uh,
capital T is the context window length is actually the length of the random walks that,
uh, we are simulating in,
uh- in DeepWalk or, uh, node2vec.
Um, volume of G is simply the sum of the entries of the adjacency matrix,
we see- which is twice the number of edges.
And the log b is a factor that corresponds to
the number of negative samples we are using in the optimization problem.
So basically, what this means is that you can compute,
um- in this case,
deep walk either by what we talked last time by simulating random walks
and defining the gradients and doing gradient descent or taking the adjacency matrix,
A, of your graph,
transforming it according to this equation where
we both take into account the lands of the random walks,
as well as the number of negative samples we use.
And if we factorize this, uh,
transform matrix, what I mean by this is if we go and, uh,
replace this A here with the transform matrix and then we try to solve,
uh, this- this equation,
uh, then the, uh, the solution to it,
this matrix Z, will be exactly the same as what,
uh- what we- what, uh,
our approach that we discussed in the previous lecture arrived to.
So, um, basically that is a very- very nice paper.
Um, if you wanna know more about this called network embedding as
matrix factorization that basically unifies DeepWalk LINE,
um, and a lot of other algorithms including node2vec in this,
uh, one mathematical, uh, framework.
So, um, as I said,
this random walk-based similarity can also be thought as, um, matrix factorization.
The equation I show here is for DeepWalk.
Um, you can derive similar type of matrix transformation for, uh, node2vec.
Just the matrix is even more complex because the random walk process of node2vec is,
uh- is more complex,
is more, uh- more nuanced.
So to conclude the lecture today,
I wanna talk a bit about the limitations and kind of
motivate what are we going to talk about next week.
So the limitations of node embeddings via this matrix factorization or this random walks,
uh, like- like I discussed in terms of, um,
node2vec, DeepWalk for multidimensional embeddings,
you can even think of PageRank as a single-dimensional embedding is that,
um, we cannot obtain embeddings for nodes not in the training set.
So this means if our graph is evolving,
if new nodes are appearing over time, then, uh,
the nodes that are not in the graph, uh,
at the time when we are computing embeddings won't have an embedding.
So if a newly added node 5,
for example, arrives, let's say, at the test time,
lets say- or is a new user in the social network,
we can- we cannot compute its embedding,
we have to redo everything from scratch.
We have to recompute all embeddings of all nodes in the network.
So this is very limiting.
The second important thing is that this, uh,
embeddings cannot capture structural similarity.
Uh, the reason being is, for example,
if I have the graph like I show here and I consider nodes,
uh, 1 and 11.
Even though they are in very different parts of the graph,
their local network structure, uh,
looks- looks quite similar.
Um, and DeepWalk and node2vec will come up with
very different embeddings for 11 and- and 1 because,
you know, 11 is neighbor with 12 and 13,
while 1 is neighbor with 2 and 3.
So this means that they- these types of embeddings
won't be able to capture this notion of local structural similarity,
but it will- more be able to capture who- who- what are the identities of the neighbors,
uh, next, uh- next to a given starting node.
Of course, if you were to define these over anonymous walks,
then, um, that would cap- allow us to capture the structure because, uh,
2 and 3, um, would- would- basically the identities of the nodes,
um, uh, are forgotten, and then we would be able to,
uh, solve this, uh, problem.
And then the last limitation I wanna talk about is- is that
this approaches cannot utilize node edge in graph level features,
meaning, you know, feature vectors attached to nodes,
uh, edges and graphs cannot naturally be incorporated in this framework.
Right? We basically create node embedding separately
from the features that these nodes might have.
So for example, you know, I use it in
a social network may have a set of properties, features,
attributes where are protein has a set of properties,
features that you would want to use them in creating, uh, embeddings.
And really, what are we going to talk next is,
um, you know, what are the solutions to these limitations?
The solution for- to these limitations is deep representation learning and,
uh, graph neural networks.
And in the next week,
we are going to move, uh,
to the topic of, uh,
graph neural networks that will allow us to
resolve these limitations that I have just, uh,
discussed, and will allow us to fuse the feature
information together with the structured information.
So to summarize today's lecture,
we talked about PageRank that measures important soft nodes in a graph.
Uh, we talked about it in three different ways,
we talked about it in terms of flow formulation,
in terms of links, and nodes.
We talked about it in terms of
random walk and stationary distribution of a random walk process.
And we also talked about it from the linear algebra point of view, uh,
by basically computing the eigenvector, uh,
to the particularly transformed,
uh, graph adjacency matrix.
Then we talked about, um,
extensions of PageRank particular random walk
with restarts and personalized PageRank where
basically the difference is in terms of changing the teleportation set.
Um, and then the last part of the lecture,
we talked about node embeddings based on random walks and
how they can be rep- expressed as a form of matrix factorization.
So this means that viewing graphs as matrices is a-plays
a key role in all of the above algorithms
where we can think of this in many different ways.
But mathematically at the end,
it's all about mat- matrix representation of the graph, factorizing this matrix,
computing eigenvectors, eigenvalues, and extracting connectivity information,
uh, out of it with the tools of, uh, linear algebra.
So, um, thank you very much,
everyone, for the- for the lecture.
関連動画をさらに表示
Stanford CS224W: ML with Graphs | 2021 | Lecture 6.1 - Introduction to Graph Neural Networks
Stanford CS224W: Machine Learning with Graphs | 2021 | Lecture 3.3 - Embedding Entire Graphs
Stanford CS224W: Machine Learning with Graphs | 2021 | Lecture 3.1 - Node Embeddings
Stanford CS224W: ML with Graphs | 2021 | Lecture 5.1 - Message passing and Node Classification
Stanford CS224W: Machine Learning with Graphs | 2021 | Lecture 4.3 - Random Walk with Restarts
Stanford CS224W: Machine Learning with Graphs | 2021 | Lecture 1.3 - Choice of Graph Representation
5.0 / 5 (0 votes)