Stanford CS224W: ML with Graphs | 2021 | Lecture 4.4 - Matrix Factorization and Node Embeddings

Stanford Online
22 Apr 202112:47

Summary

TLDRThe lecture discusses the concept of node embeddings in graph theory, focusing on maximizing dot products between similar nodes. It explores defining similarity through random walks and simpler edge connectivity, leading to matrix factorization. The lecture also touches on limitations of current embedding methods, such as inability to capture structural similarity and lack of feature utilization, and hints at graph neural networks as a potential solution.

Takeaways

  • 📈 The lecture discusses the connection between embeddings and matrix factorization in graph theory.
  • 🧠 Embedding matrix 'z' is defined with rows as embedding dimensions and columns as nodes in the graph.
  • 🔍 The goal is to maximize the dot product between similar nodes' embeddings within the encoder-decoder framework.
  • đŸ€” Similarity is defined by nodes co-appearing in the same random walk or being directly connected by an edge.
  • 🔗 The adjacency matrix 'A' can be factorized into Z transposed times Z, aiming for an approximation where connected nodes' dot product is 1, and non-connected is 0.
  • 📊 The Frobenius norm is used to measure the similarity between the original adjacency matrix and its factorized form.
  • 📚 The lecture connects previous discussions on node embeddings and random walks to matrix factorization.
  • đŸš« Limitations of node embeddings include inability to obtain embeddings for new nodes not in the training set and lack of capturing structural similarity.
  • 🔄 The approach does not utilize node, edge, or graph-level features, separating them from the embedding process.
  • 🌐 Graph neural networks are introduced as a solution to overcome these limitations, allowing the fusion of feature information with structured information.

Q & A

  • What is the main objective of the encoder-decoder framework discussed in the lecture?

    -The main objective of the encoder-decoder framework is to maximize the dot product between pairs of similar nodes, ensuring that if two nodes are similar, the dot product of their embeddings in the matrix Z should be high.

  • How is node similarity defined in the context of the lecture?

    -Node similarity is defined through two methods: 1) If two nodes co-appear in the same random walk starting at a given node, and 2) If two nodes are connected by an edge.

  • What is matrix factorization in the context of this lecture?

    -Matrix factorization, in this context, refers to the process of representing the adjacency matrix A as a product of two matrices, Z and Z transposed, where Z is the embedding matrix.

  • Why is an exact approximation of A as Z transpose times Z generally not possible?

    -An exact approximation is generally not possible because the embedding dimension D (the number of rows in Z) is much smaller than the number of nodes, which means the matrix is wide but not deep, lacking enough representation power to capture each edge perfectly.

  • What is the Frobenius norm and how is it used in this context?

    -The Frobenius norm is a matrix norm that is the square root of the sum of the absolute squares of its elements. In this context, it is used to measure the similarity between the adjacency matrix A and the factorized form Z transpose times Z by summing the square differences between corresponding entries.

  • How is the adjacency matrix transformed in the case of random walk-based similarity?

    -The adjacency matrix is transformed by raising the inverse of the diagonal matrix of node degrees to the power r (where r is between 1 and T, the context window length), multiplying it by the adjacency matrix, and then by the volume of the graph divided by the log of the number of negative samples used.

  • What are the limitations of node embeddings via matrix factorization or random walks?

    -The limitations include: 1) Inability to obtain embeddings for nodes not in the training set, meaning new nodes require re-computation of all embeddings. 2) Failure to capture structural similarity, as the embeddings are based on immediate neighbors rather than overall structure. 3) Inability to utilize node, edge, or graph-level features in the embedding process.

  • What is the solution proposed for the limitations of node embeddings?

    -The solution proposed is deep representation learning and graph neural networks, which will be discussed in the next lecture. These methods aim to resolve the limitations by fusing feature information with structured information.

  • How is PageRank discussed in the lecture?

    -PageRank is discussed in three ways: 1) As a flow formulation, 2) In terms of links and nodes, and 3) From a linear algebra perspective by computing the eigenvector of a transformed graph adjacency matrix.

  • What are the different ways of viewing graphs mathematically as mentioned in the lecture?

    -Graphs can be viewed mathematically as matrices, where the adjacency matrix is factorized, eigenvectors and eigenvalues are computed, and connectivity information is extracted using linear algebra tools.

  • What is the significance of the lecture's conclusion about viewing graphs as matrices?

    -The conclusion emphasizes that viewing graphs as matrices plays a key role in various algorithms, including matrix factorization, eigenvector computation, and connectivity analysis, which are all crucial for understanding and working with graph structures.

Outlines

00:00

📊 Matrix Factorization and Embeddings

The lecture concludes with a discussion on the connection between embeddings and matrix factorization. The lecturer revisits the concept of embedding matrix 'z', where each column represents an embedding for a node in the graph. The goal is to maximize the dot product between similar nodes' embeddings. Similarity is defined by co-appearance in random walks. The lecturer then explores a simpler definition of similarity: nodes are similar if they are directly connected by an edge, aiming for a dot product of 1 if connected and 0 if not. This leads to the matrix factorization problem where the adjacency matrix 'A' is approximated by the product of 'Z' transposed and 'Z'. The limitations of this approach are discussed, including the inability to capture structural similarity and the inability to include node or edge features.

05:02

🔍 DeepWalk and Node2vec as Matrix Factorization

The lecturer explains that the more complex definition of similarity used in algorithms like DeepWalk and node2vec can also be viewed as a form of matrix factorization, but of a more complex transformed adjacency matrix. The transformation involves considering the lengths of random walks and the number of negative samples used in optimization. The lecturer highlights that the process of simulating random walks or directly factorizing the transformed adjacency matrix yields the same result. The limitations of these methods are also discussed, including the inability to obtain embeddings for new nodes not present in the training set and the inability to capture structural similarity.

10:07

🚀 Graph Neural Networks as a Solution

The final part of the lecture addresses the limitations of node embeddings via matrix factorization and random walks, such as the inability to capture structural similarity and the inability to utilize node or edge features. The lecturer introduces graph neural networks as a solution to these limitations, which will be discussed in the following lectures. The lecture summarizes by highlighting the importance of viewing graphs as matrices and the various ways this can be done, including matrix factorization, eigenvector computation, and connectivity extraction using linear algebra.

Mindmap

Keywords

💡Embeddings

Embeddings in the context of the video refer to the representation of nodes in a graph as points in a continuous vector space. This is crucial for tasks such as machine learning on graphs. The video discusses how embeddings can be learned such that similar nodes (as defined by co-occurrence in random walks) have high dot products of their embeddings. An example from the script is the 'embedding matrix z' where each column represents an embedding for a node.

💡Matrix Factorization

Matrix factorization is a method in linear algebra where a matrix is expressed as a product of other matrices. In the video, it is used to approximate the adjacency matrix of a graph using the embeddings of the nodes. The goal is to factorize the adjacency matrix A as Z^T * Z, where Z is the matrix of node embeddings. This is a key concept in network embedding algorithms like DeepWalk and node2vec.

💡Adjacency Matrix

The adjacency matrix A is a square matrix used to represent a graph. An entry in the matrix indicates whether there is an edge between two nodes. The video discusses how the adjacency matrix can be factorized to find embeddings that capture node connectivity. For instance, if nodes u and v are connected, the dot product of their embeddings should approximate the corresponding entry in A.

💡Random Walk

A random walk on a graph is a process of traversing the graph by moving from node to node along the edges. The video explains how random walks can define the similarity between nodes, with nodes being similar if they co-appear in the same random walk. This concept is fundamental to algorithms like DeepWalk and node2vec, which use random walks to learn node embeddings.

💡Frobenius Norm

The Frobenius norm is a matrix norm that is used to measure the 'size' of a matrix. In the video, it is used to quantify the difference between the original adjacency matrix and its factorization in terms of embeddings. The norm is calculated as the sum of the squared differences between corresponding entries of the matrices.

💡DeepWalk

DeepWalk is an algorithm for learning node embeddings by simulating random walks on a graph and then using the walks to create a corpus of node sequences, which can be input into a word2vec model. The video mentions DeepWalk in the context of random walk-based similarity and how it can be viewed as a form of matrix factorization.

💡Node2vec

Node2vec is another algorithm for learning node embeddings that improves upon DeepWalk by allowing for more nuanced exploration of the graph through biased random walks. The video discusses how the node2vec process is more complex than DeepWalk and results in a more complex matrix transformation for matrix factorization.

💡Graph Neural Networks

Graph Neural Networks (GNNs) are a class of neural networks designed to work with graph-structured data. The video mentions that GNNs can address some of the limitations of traditional node embedding techniques, such as the inability to capture structural similarity or utilize node/edge features.

💡Eigenvector

An eigenvector of a matrix is a non-zero vector that only changes by a scalar factor when that matrix is multiplied by it. The video discusses PageRank in terms of eigenvectors, specifically the eigenvector corresponding to the largest eigenvalue of a particular transformed adjacency matrix.

💡PageRank

PageRank is an algorithm used by Google Search to rank web pages in their search engine results. The video explains PageRank from multiple perspectives, including as a linear algebra problem involving eigenvectors of the adjacency matrix, which helps in measuring the importance of nodes in a graph.

Highlights

The lecture discusses the connection between node embeddings and matrix factorization.

An embedding matrix 'z' is introduced where rows represent the embedding dimension and columns represent nodes in the graph.

The goal in the encoder-decoder framework is to maximize the dot product between similar node pairs.

Similarity is defined through co-appearance in the same random walk.

A simpler notion of similarity is proposed where nodes are similar if they are directly connected by an edge.

Matrix factorization is used to approximate the adjacency matrix 'A' as the product of two matrices, Z transposed and Z.

The embedding dimension 'D' is typically much smaller than the number of nodes, leading to an approximation rather than an exact match.

Frobenius norm is used to measure the similarity between the original adjacency matrix and its factorization.

The inner product decoder with node similarity defined by edge connectivity is equivalent to the factorization of the adjacency matrix.

Random walk-based similarity can also be viewed as a form of matrix factorization, even with a more complex adjacency matrix transformation.

The limitations of node embeddings via matrix factorization or random walks include the inability to obtain embeddings for nodes not in the training set.

Embeddings cannot capture structural similarity, as they focus more on the identities of neighbors rather than local network structure.

These approaches cannot utilize node, edge, or graph-level features, as they create node embeddings separately from any features the nodes might have.

Graph neural networks are introduced as a solution to the limitations of node embeddings.

Graph neural networks allow for the fusion of feature information with structured information.

The lecture summarizes the discussion on PageRank, random walk extensions, and node embeddings based on random walks.

Viewing graphs as matrices plays a key role in various algorithms, including matrix factorization, eigenvector computation, and connectivity extraction.

Transcripts

play00:03

So this now concludes,

play00:06

uh, the first two parts of the lecture.

play00:08

And what I wanna do in the, uh,

play00:10

remaining, uh, few minutes,

play00:12

10 minutes or so, I wanna talk about

play00:14

connection to know the embeddings and matrix factorization.

play00:18

So, uh, let me refresh you what we talked,

play00:22

uh, what we talked, uh, on Tuesday.

play00:24

So we talked about embeddings and we talked

play00:27

about how we have this embedding matrix z that, you know,

play00:30

the number of rows is the e- embedding dimension,

play00:34

and the number of columns is the number of nodes,

play00:37

uh, in the graph.

play00:38

And this means that every column of this matrix z

play00:41

will store an embedding for that given node.

play00:45

And our objective was in this encoder-decoder framework is that we wanna maximize,

play00:50

um, the- the- the dot product between pairs of nodes,

play00:54

uh, that are similar.

play00:55

So if two nodes are similar, uh,

play00:57

then their dot product of their embeddings of

play01:00

their columns in this matrix z, uh, has to be high.

play01:03

And that was the- that was the idea.

play01:06

Then of course, how did you define this notion of similarity?

play01:09

We said two nodes are similar if they co-appear in the same random walk,

play01:14

uh, starting at the given node.

play01:16

So now, you know,

play01:18

how- how can we,

play01:20

uh, think about this more broadly, right?

play01:22

You could say we define the notion of

play01:25

similarity through random walks in the previous lecture.

play01:27

What if we define an even simpler notion of, uh, similarity?

play01:31

What if we say, two nodes are similar if they are connected by an edge.

play01:35

Then what this means is we say, "Oh,

play01:37

I want to approximate this matrix,

play01:39

say like entry in the mai- UV in the matrix,

play01:42

say this is either 0 or 1 by this dot product?"

play01:46

Like I'm saying, if two nodes are connect- u and v are connected,

play01:49

then I want their dot product to be 1.

play01:51

And if they are not connected,

play01:53

I want their dot product to be 0, right?

play01:56

Um, of course, like if I write it this way at the level of the single entry,

play02:01

if I write it in the- in the matrix form,

play02:04

this- I'm writing basically saying,

play02:06

this is Z trans- Z transposed times Z equals A.

play02:10

So this is now caused matrix factorization because I take my matrix A and I factorize it,

play02:17

I represent it as a product of,

play02:20

um- of, uh, two matrices.

play02:22

Um, Z- Z transposed and Z.

play02:25

So essentially I'm saying I take my adjacency matrix, uh, A.

play02:28

Here is, for example, one entry of it.

play02:30

I want- because there is an edge, I want, you know,

play02:33

the- the- the dot product of this row and that column to be value 1.

play02:37

For example, for, uh, for this entry,

play02:40

I would want the product of this,

play02:43

uh, row and this particular column to be 0.

play02:46

So this means that we take matrix A and factorize it as a product of Z transpose times Z.

play02:53

Of course, because embedding dimension D number of rows in Z,

play02:59

Z is much, much smaller than the number of, uh- uh, nodes, right?

play03:03

So the matrix is very wide,

play03:05

but not too- too deep.

play03:06

This means that this exact approximation saying

play03:10

A equals Z transpose times Z is generally not possible, right?

play03:15

We don't have enough representation power

play03:17

to really capture each edge but perfectly.

play03:19

So we can learn this martix Z approximately.

play03:23

So what we can see is,

play03:25

let's find the matrix Z such that, you know, the, uh,

play03:28

Z transpose times Z, uh,

play03:31

the values of it are as similar to the values of A as possible.

play03:35

How do I- how do I measure similarity now between

play03:38

values is to use what is called Frobenius norm,

play03:41

which is simply take an entry in A,

play03:44

subtract an entry in this,

play03:46

uh- in this matrix Z transpose times Z,

play03:49

um, and take the square value of it and sum it up.

play03:52

So Frobenius norm is simply, uh,

play03:55

a sum of the square differences, uh,

play03:57

between corresponding entries into, uh- into matrices.

play04:01

Right? So this is now very similar to what we were

play04:05

also doing in the previous, uh, lecture, right?

play04:09

In the node embeddings lecture.

play04:10

In the node embeddings lecture,

play04:11

we were not using the L2 norm to- to define

play04:15

the discrepancy between A and its factorization,

play04:20

we used the softmax,

play04:23

uh, uh, function instead of the L2 norm.

play04:26

But the goal of approximating A with Z transpose times Z was the same.

play04:31

So the conclusion is that the inner product decoder with note similarity defined by

play04:37

edge- edge connectivity is equivalent to

play04:40

the factorization of adjacency matrix, uh, A, right?

play04:44

So we say we wanna directly approximate A by

play04:47

the embeddings of the nodes such that if two nodes are linked,

play04:50

then I want their dot product to be equal to 1.

play04:53

And if they are not linked,

play04:55

I want their dot product to be equal to 0.

play04:57

So that's the, uh- the simplest way to define node similarity.

play05:02

Now, in random walk-based similarity,

play05:05

it turns out that when we have this more,

play05:08

um, nuanced, more complex definition of, uh,

play05:11

similarity, then, um, it is still- the entire process is still equivalent to

play05:17

matrix factorization of just- of a more complex or transformed, uh, adjacency matrix.

play05:24

So, um, here's the equation,

play05:26

let me explain it, uh, on the next slide, what does it mean.

play05:29

So it means that what we are really trying to do is

play05:32

we are trying to factorize this particular,

play05:35

um, transformed graph adjacency matrix.

play05:39

So how is the the- uh,

play05:41

what- what is the transformation?

play05:43

The transformation is- here is the, um, adjacency matrix.

play05:48

Here is the, um, one over the- the diagonal matrix,

play05:51

these are the node degrees.

play05:53

Um, this is now, uh,

play05:55

raised to the power r,

play05:57

where this- r goes between 1 and T where, uh,

play06:00

capital T is the context window length is actually the length of the random walks that,

play06:07

uh, we are simulating in,

play06:08

uh- in DeepWalk or, uh, node2vec.

play06:11

Um, volume of G is simply the sum of the entries of the adjacency matrix,

play06:16

we see- which is twice the number of edges.

play06:18

And the log b is a factor that corresponds to

play06:22

the number of negative samples we are using in the optimization problem.

play06:26

So basically, what this means is that you can compute,

play06:29

um- in this case,

play06:31

deep walk either by what we talked last time by simulating random walks

play06:36

and defining the gradients and doing gradient descent or taking the adjacency matrix,

play06:42

A, of your graph,

play06:44

transforming it according to this equation where

play06:47

we both take into account the lands of the random walks,

play06:50

as well as the number of negative samples we use.

play06:53

And if we factorize this, uh,

play06:55

transform matrix, what I mean by this is if we go and, uh,

play06:59

replace this A here with the transform matrix and then we try to solve,

play07:04

uh, this- this equation,

play07:06

uh, then the, uh, the solution to it,

play07:07

this matrix Z, will be exactly the same as what,

play07:11

uh- what we- what, uh,

play07:13

our approach that we discussed in the previous lecture arrived to.

play07:17

So, um, basically that is a very- very nice paper.

play07:21

Um, if you wanna know more about this called network embedding as

play07:24

matrix factorization that basically unifies DeepWalk LINE,

play07:28

um, and a lot of other algorithms including node2vec in this,

play07:32

uh, one mathematical, uh, framework.

play07:35

So, um, as I said,

play07:38

this random walk-based similarity can also be thought as, um, matrix factorization.

play07:44

The equation I show here is for DeepWalk.

play07:46

Um, you can derive similar type of matrix transformation for, uh, node2vec.

play07:53

Just the matrix is even more complex because the random walk process of node2vec is,

play07:58

uh- is more complex,

play07:59

is more, uh- more nuanced.

play08:02

So to conclude the lecture today,

play08:05

I wanna talk a bit about the limitations and kind of

play08:07

motivate what are we going to talk about next week.

play08:11

So the limitations of node embeddings via this matrix factorization or this random walks,

play08:16

uh, like- like I discussed in terms of, um,

play08:21

node2vec, DeepWalk for multidimensional embeddings,

play08:24

you can even think of PageRank as a single-dimensional embedding is that,

play08:29

um, we cannot obtain embeddings for nodes not in the training set.

play08:33

So this means if our graph is evolving,

play08:35

if new nodes are appearing over time, then, uh,

play08:38

the nodes that are not in the graph, uh,

play08:40

at the time when we are computing embeddings won't have an embedding.

play08:44

So if a newly added node 5,

play08:46

for example, arrives, let's say, at the test time,

play08:48

lets say- or is a new user in the social network,

play08:51

we can- we cannot compute its embedding,

play08:53

we have to redo everything from scratch.

play08:56

We have to recompute all embeddings of all nodes in the network.

play08:59

So this is very limiting.

play09:02

The second important thing is that this, uh,

play09:04

embeddings cannot capture structural similarity.

play09:09

Uh, the reason being is, for example,

play09:11

if I have the graph like I show here and I consider nodes,

play09:14

uh, 1 and 11.

play09:16

Even though they are in very different parts of the graph,

play09:19

their local network structure, uh,

play09:21

looks- looks quite similar.

play09:24

Um, and DeepWalk and node2vec will come up with

play09:27

very different embeddings for 11 and- and 1 because,

play09:31

you know, 11 is neighbor with 12 and 13,

play09:34

while 1 is neighbor with 2 and 3.

play09:37

So this means that they- these types of embeddings

play09:41

won't be able to capture this notion of local structural similarity,

play09:45

but it will- more be able to capture who- who- what are the identities of the neighbors,

play09:50

uh, next, uh- next to a given starting node.

play09:53

Of course, if you were to define these over anonymous walks,

play09:57

then, um, that would cap- allow us to capture the structure because, uh,

play10:01

2 and 3, um, would- would- basically the identities of the nodes,

play10:06

um, uh, are forgotten, and then we would be able to,

play10:09

uh, solve this, uh, problem.

play10:11

And then the last limitation I wanna talk about is- is that

play10:16

this approaches cannot utilize node edge in graph level features,

play10:20

meaning, you know, feature vectors attached to nodes,

play10:24

uh, edges and graphs cannot naturally be incorporated in this framework.

play10:28

Right? We basically create node embedding separately

play10:31

from the features that these nodes might have.

play10:34

So for example, you know, I use it in

play10:35

a social network may have a set of properties, features,

play10:38

attributes where are protein has a set of properties,

play10:41

features that you would want to use them in creating, uh, embeddings.

play10:45

And really, what are we going to talk next is,

play10:48

um, you know, what are the solutions to these limitations?

play10:51

The solution for- to these limitations is deep representation learning and,

play10:56

uh, graph neural networks.

play10:57

And in the next week,

play10:58

we are going to move, uh,

play11:00

to the topic of, uh,

play11:01

graph neural networks that will allow us to

play11:04

resolve these limitations that I have just, uh,

play11:07

discussed, and will allow us to fuse the feature

play11:10

information together with the structured information.

play11:13

So to summarize today's lecture,

play11:16

we talked about PageRank that measures important soft nodes in a graph.

play11:20

Uh, we talked about it in three different ways,

play11:23

we talked about it in terms of flow formulation,

play11:27

in terms of links, and nodes.

play11:29

We talked about it in terms of

play11:31

random walk and stationary distribution of a random walk process.

play11:35

And we also talked about it from the linear algebra point of view, uh,

play11:38

by basically computing the eigenvector, uh,

play11:41

to the particularly transformed,

play11:43

uh, graph adjacency matrix.

play11:45

Then we talked about, um,

play11:48

extensions of PageRank particular random walk

play11:51

with restarts and personalized PageRank where

play11:53

basically the difference is in terms of changing the teleportation set.

play11:59

Um, and then the last part of the lecture,

play12:02

we talked about node embeddings based on random walks and

play12:06

how they can be rep- expressed as a form of matrix factorization.

play12:10

So this means that viewing graphs as matrices is a-plays

play12:15

a key role in all of the above algorithms

play12:18

where we can think of this in many different ways.

play12:21

But mathematically at the end,

play12:23

it's all about mat- matrix representation of the graph, factorizing this matrix,

play12:27

computing eigenvectors, eigenvalues, and extracting connectivity information,

play12:32

uh, out of it with the tools of, uh, linear algebra.

play12:36

So, um, thank you very much,

play12:39

everyone, for the- for the lecture.

Rate This
★
★
★
★
★

5.0 / 5 (0 votes)

Étiquettes Connexes
Graph TheoryMachine LearningDeepWalkNode2vecMatrix FactorizationEmbeddingsRandom WalksAdjacency MatrixGraph Neural NetworksLinear Algebra
Besoin d'un résumé en anglais ?