Stanford CS224W: Machine Learning with Graphs | 2021 | Lecture 3.1 - Node Embeddings
Summary
TLDRThis lecture introduces node embeddings for graph representation learning, aiming to automate feature engineering. It discusses traditional machine learning approaches on graphs and the shift towards learning features automatically without manual intervention. The lecture explains the concept of mapping nodes into a low-dimensional space where node similarities in the network are preserved in the embedding space, useful for various prediction tasks. It also touches on methods like DeepWalk and node2vec and their unsupervised nature, focusing on network structure rather than node labels or features.
Takeaways
- 📚 The lecture introduces node embeddings, a technique to represent nodes in a graph as vectors in a continuous space.
- 🔍 Traditional machine learning on graphs involves extracting features that describe the topological structure and attributes of the network.
- 🤖 Graph representation learning aims to automate the feature engineering process by learning features directly from the graph structure.
- 🧭 The goal of node embeddings is to map nodes into a space where the structure of the network is captured, allowing for similarity measurements between nodes.
- 📈 Node similarity in the embedding space is often measured by the dot product, which correlates with the cosine of the angle between vectors.
- 🌐 The embeddings can be used for various tasks such as node classification, link prediction, graph classification, anomaly detection, and clustering.
- 📊 DeepWalk, introduced in 2014, is an early method for learning node embeddings by treating random walks as sentences in a language model.
- 🔑 The adjacency matrix is used to represent the graph without assuming any features or attributes on the nodes.
- 🔄 The encoder-decoder framework is used to formulate the task of learning node embeddings, where the encoder maps nodes to embeddings and the decoder maps back to a similarity score.
- 🔢 The embedding matrix Z is a large matrix where each column corresponds to an embedding vector for a node, and its size grows with the number of nodes.
- 🚀 Methods like DeepWalk and node2vec are unsupervised, learning embeddings based on network structure without relying on node labels or features.
Q & A
What is the main focus of Lecture 3?
-The main focus of Lecture 3 is on node embeddings, which is a method in graph representation learning that aims to automatically learn features of a network without manual feature engineering.
What is traditional machine learning in graphs?
-Traditional machine learning in graphs involves extracting topological features from a given input graph and combining them with attribute-based information to train a classical machine learning model for predictions.
What is the goal of graph representation learning?
-The goal of graph representation learning is to alleviate the need for manual feature engineering by automatically learning the features of the network structure that can be used for various prediction tasks.
What is a node embedding?
-A node embedding is a vector representation of a node in a graph, where the vector captures the structure of the underlying network and is used to indicate the similarity between nodes in the network.
Why is creating node embeddings useful?
-Creating node embeddings is useful because it allows for the automatic encoding of network structure information, which can be used for various downstream tasks such as node classification, link prediction, graph classification, anomaly detection, and clustering.
What is DeepWalk and how does it relate to node embeddings?
-DeepWalk is a method introduced in 2014 that learns node embeddings by simulating random walks on the graph. It is significant because it was one of the pioneering approaches to learning node embeddings.
How are node embeddings represented mathematically in the lecture?
-In the lecture, node embeddings are represented as a set of coordinates in a d-dimensional space, denoted by the letter Z, where similarity in the embedding space is approximated by the dot product of the coordinates of two nodes.
What is the role of the adjacency matrix in graph representation learning?
-The adjacency matrix plays a crucial role in graph representation learning as it represents the graph structure without assuming any features or attributes on the nodes, allowing the learning algorithms to focus solely on the network's topology.
What is the difference between a shallow encoder and a deep encoder in the context of node embeddings?
-A shallow encoder is a simple embedding lookup, where the parameters to optimize are the embedding matrix Z. In contrast, a deep encoder, such as graph neural networks, uses a more complex approach to compute node embeddings.
How are node similarities defined in the context of node embeddings?
-Node similarities are defined based on random walks in the network. The embeddings are optimized so that nodes that are similar according to the random-walk similarity measure are close together in the embedding space.
What are some of the practical methods mentioned for learning node embeddings?
-The lecture mentions DeepWalk and node2vec as practical methods for learning node embeddings. These methods aim to capture the network structure in a low-dimensional vector space.
Outlines
🌐 Introduction to Node Embeddings
The lecture introduces the concept of node embeddings in the context of graph representation learning. It discusses the traditional approach to machine learning on graphs, where topological features are extracted and combined with attribute-based information to train models for predictions. The lecturer highlights the manual effort required in feature engineering and the desire to automate this process. The idea is to learn the structure of the network automatically, eliminating the need for manual feature engineering. The goal is to create an embedding space where the similarity of node embeddings reflects their similarity in the network, which can be used for various prediction tasks. The lecture also introduces the concept of using adjacency matrices to represent graphs without assuming any node features.
🔍 Defining Node Similarity and Embedding Space
This paragraph delves into how to represent a graph using an adjacency matrix and the concept of encoding nodes into an embedding space. The lecturer explains that the goal is to map nodes in such a way that their similarity in the embedding space approximates their similarity in the graph. The use of dot product as a similarity measure in the embedding space is introduced, which correlates to the angle between vectors. The paragraph also discusses the encoder-decoder framework, where the encoder maps nodes to embeddings, and the decoder maps from embeddings to a similarity score. The lecturer emphasizes the need to define a node similarity function and an objective function that connects the similarity with the embeddings. The approach to learning the embeddings is described as task-independent, meaning it's not trained on a specific prediction task or node labels.
📈 Scalability and Methods for Learning Node Embeddings
The lecturer addresses the scalability issue of learning node embeddings, noting that the parameter count grows with the number of nodes in the network. While these methods can be scaled to millions of nodes, they may become slow due to the need to estimate parameters for each node. The paragraph introduces two methods for learning node embeddings: DeepWalk and node2vec. The lecturer summarizes the encoder-decoder framework, mentioning that a shallow encoder is used, which is simply an embedding lookup, and the decoder is based on node similarity through dot product. The objective is to maximize the dot product for node pairs that are similar according to a node similarity function. The paragraph concludes by discussing the unsupervised nature of these methods, where node embeddings are learned without utilizing node labels or features, focusing solely on capturing network structure.
Mindmap
Keywords
💡Node Embeddings
💡Graph Representation Learning
💡Feature Engineering
💡Adjacency Matrix
💡DeepWalk
💡Node2Vec
💡Random Walks
💡Encoder-Decoder Framework
💡Dot Product
💡Unsupervised Learning
Highlights
Introduction to node embeddings in graph representation learning.
Traditional machine learning on graphs involves extracting topological features for prediction tasks.
Graph representation learning aims to automate the feature engineering process.
The concept of learning node embeddings to represent network structure without manual feature engineering.
Node embeddings map nodes into a d-dimensional space to capture network structure.
Similarity between node embeddings indicates their similarity in the network.
Applications of node embeddings include node classification, link prediction, and anomaly detection.
DeepWalk, a method introduced in 2014, visualizes node embeddings in a two-dimensional space.
Node embeddings can reveal structural patterns in small toy networks.
The adjacency matrix represents a graph without assuming any node features.
The goal is to encode nodes so that embedding similarity approximates graph similarity.
Encoder maps nodes to embeddings; decoder maps embeddings to similarity scores.
Dot product is commonly used as a decoder to measure similarity in the embedding space.
The simplicity of the encoder as an embedding-lookup matrix.
Challenges of estimating a large number of parameters for node embeddings in large graphs.
Once estimated, retrieving node embeddings is as simple as a matrix lookup.
DeepWalk and node2vec are methods for learning node embeddings from random walks.
Node embeddings are task-independent and do not require node labels or features.
The importance of defining node similarity for optimizing embeddings.
Transcripts
This is Lecture 3 of our class and we are going to talk today about node embeddings.
So the way we think of this is the following.
Um, the inter- what we talked
on last week was about traditional machine learning in graphs,
where the idea was that given an input graph,
we are going to extract some node link or graph level
features that basically describe the topological structure,
uh, of the network,
either around the node,
around a particular link or the entire graph.
And then we can take that topological information,
compare, um, er, uh, um,
combine it with the attribute-based information to
then train a classical machine learning model
like a support vector machine or a logistic regression,
uh, to be able to make predictions.
So, um, in this sense, right,
the way we are thinking of this is that we are given an input graph here.
We are then, uh,
creating structure- structured features or structural features,
uh, of this graph so that then we can apply
our learning algorithm and make, uh, prediction.
And generally most of the effort goes here into the feature engineering,
uh, where, you know, we are as,
uh, engineers, humans, scientists,
we are trying to figure out how to best describe,
uh, this particular, um,
network so that, uh,
it would be most useful, uh,
for, uh, downstream prediction task.
Um, and, uh, the question then becomes,
uh, can we do this automatically?
Can we kind of get away from, uh, feature engineer?
So the idea behind graph representation learning is that we wanna
alleviate this need to do manual feature engineering every single time, every time for,
uh, every different task,
and we wanna kind of automatically learn the features,
the structure of the network,
um, in- that we are interested in.
And this is what is called, uh,
representation learning so that no manual, uh,
feature engineering is, uh, necessary, uh, anymore.
So the idea will be to do
efficient task-independent feature learning for machine learning with the graphs.
Um, the idea is that for example,
if we are doing this at the level of individual nodes,
that for every node,
we wanna learn how to map this node in a d-dimensional
space ha- and represent it as a vector of d numbers.
And we will call this vector of d numbers as feature representation,
or we will call it, um, an embeding.
And the goal will be that this, uh, mapping, um,
happens automatically and that this vector
captures the structure of the underlying network that,
uh, we are, uh, interested in,
uh, analyzing or making predictions over.
So why would you wanna do this?
Why create these embeddings?
Right. The task is to map nodes into an- into an embedding space.
Um, and the idea is that similarity, uh,
of the embeddings between nodes indicates their similarity in the network.
Uh, for example, you know,
if bo- nodes that are close to each other in the network,
perhaps they should be embedded close together in the embedding space.
Um, and the goal of this is that kind of en-
automatically encodes the network, uh, structure information.
Um, and then, you know,
it can be used for many kinds of different downstream prediction tasks.
For example, you can do any kind of node classification, link prediction,
graph classification, you can do anomaly detection,
you can do clustering,
a lot of different things.
So to give you an example, uh,
here is- here is a plot from a- a paper that came up with
this idea back in 2014, fe- 2015.
The method is called DeepWalk.
Um, and they take this, uh, small, uh,
small network that you see here,
and then they show how the embedding of nodes would look like in two-dimensions.
And- and here the nodes are,
uh, colored by different colors.
Uh, they have different numbers.
And here in the, um, in this example, uh,
you can also see how, um, uh,
how different nodes get mapped into different parts of the embedding space.
For example, all these light blue nodes end up here,
the violet nodes, uh,
from this part of the network end up here,
you know, the green nodes are here,
the bottom two nodes here,
get kind of set, uh,
uh on a different pa- uh,
in a different place.
And basically what you see is that in some sense,
this visualization of the network and
the underlying embedding correspond to each other quite well in two-dimensional space.
And of course, this is a small network.
It's a small kind of toy- toy network,
but you can get an idea about, uh,
how this would look like in,
uh, uh- in, uh,
more interesting, uh, larger,
uh- in larger dimensions.
So that's basically the,
uh- that's basically the, uh, idea.
So what I wanna now do is to tell you about how do we formulate this as a task, uh,
how do we view it in this, uh,
encoder, decoder, uh, view or a definition?
And then what kind of practical methods, um,
exist there, uh for us to be able, uh, to do this.
So the way we are going to do this, um,
is that we are going to represent, uh,
a graph, as a- as a- with an adjacency matrix.
Um, and we are going to think of this,
um, in terms of its adjacency matrix,
and we are not going to assume any feature, uh, uh,
represe- features or attributes,
uh, on the nodes, uh, of the network.
So we are just going to- to think of this as a- as a- as a set of,
um, as a- as an adjacency matrix that we wanna- that we wanna analyze.
Um, we are going to have a graph, as I showed here,
and the corresponding adjacency matrix A.
And for simplicity, we are going to think of these as undirected graphs.
So the goal is to encode nodes so that similarity in the embedding space- uh,
similarity in the embedding space,
you can think of it as distance or as a dot product,
as an inner product of the coordinates of two nodes
approximates the similarity in the graph space, right?
So the idea will be that in- or in the original network,
I wanna to take the nodes,
I wanna map them into the embedding space.
I'm going to use the letter Z to denote the coordinates,
uh, of that- of that embedding,
uh, of a given node.
Um, and the idea is that, you know,
some notion of similarity here
corresponds to some notion of similarity in the embedding space.
And the goal is to learn this encoder that encodes
the original network as a set of, uh, node embeddings.
So the goal is to- to define the similarity in the original network, um,
and to map nodes into the coordinates in the embedding space such that, uh,
similarity of their embeddings corresponds to the similarity in the network.
Uh, and as a similarity metric in the embedding space, uh,
people usually, uh, select, um, dot product.
And dot product is simply the angle,
uh, between the two vectors, right?
So when you do the dot product,
it's the cosine of the- of the angle.
So if the two points are close together or in the same direction from the origin,
they have, um, um,
high, uh, uh, dot product.
And if they are orthogonal,
so there is kind of a 90-degree angle, uh,
then- then they are as- as dissimilar as
possible because the dot product will be, uh, zero.
So that's the idea. So now what do we need to
define is we need to define this notion of, uh,
ori- similarity in the original network and we need to define then
an objective function that will connect the similarity with the, uh, embeddings.
And this is really what we are going to do ,uh, in this lecture.
So, uh, to summarize a bit, right?
Encoder maps nodes, uh, to embeddings.
We need to define a node similarity function,
a measure of similarity in the original network.
And then the decoder, right,
maps from the embeddings to the similarity score.
Uh, and then we can optimize the parameters such that
the decoded similarity corresponds as closely as
possible to the underlying definition of the network similarity.
Where here we're using a very simple decoder,
as I said, just the dot-product.
So, uh, encoder will map notes into low-dimensional vectors.
So encoder of a given node will simply be the coordinates or the embedding of that node.
Um, we talked about how we are going to define the similarity
in the embedding space in terms of the decoder,
in terms of the dot product.
Um, and as I said, uh,
the embeddings will be in some d-dimensional space.
You can think of d, you know, between,
let's say 64 up to about 1,000,
this is usually how- how,
uh, how many dimensions people, uh, choose,
but of course, it depends a bit on the size of the network,
uh, and other factors as well.
Um, and then as I said,
the similarity function specifies how the relationship in
the- in the vector space map to the relationship in the,
uh, original ,uh, in the original network.
And this is what I'm trying to ,uh,
show an example of, uh, here.
So the simplest encoding approach is that an encoder is just an embedding-lookup.
So what- what do I mean by this that- is that an encoded- an
encoding of a given node is simply a vector of numbers.
And this is just a lookup in some big matrix.
So what I mean by this is that our goal will be to learn this matrix Z,
whose dimensionalities is d,
the embedding dimension times the number of nodes,
uh, in the network.
So this means that for every node we will have a column
that is reserved to store the embedding for that node.
And this is what we are going to learn,
this is what we are going to estimate.
And then in this kind of notation,
you can think of v simply as an indicator vector that has all zeros,
except the value of one in the column
indicating the- the ID of that node v. And- and what this
will do pictorially is that basically you can think of
Z as this matrix that has one column per node,
um, and the column store- a given column stores the embedding of that given node.
So the size of this matrix will be number of nodes times the embedding dimension.
And people now who are, for example,
thinking about large graphs may already have a question.
You know, won't these to be a lot of parameters to estimate?
Because the number of parameters in this model is basically the number of entries, uh,
of this matrix, and this matrix gets very large because
it dep- the size of the matrix depends on the number of nodes in the network.
So if you want to do a network or one billion nodes,
then the dimensionality of this matrix would be one billion times,
let's say thousandth, uh,
and embedding dimension and that's- that's,
uh, that's a lot of parameters.
So these methods won't necessarily be most scalable, you can scale them,
let's say up to millions or a million nodes or, uh,
something like that if you- if you really try,
but they will be slow because for every node
we essentially have to estimate the parameters.
Basically for every node we have to estimate its embedding- embedding vector,
which is described by the d-numbers d-parameters,
d-coordinates that we have to estimate.
So, um, but this means that once we have estimated this embeddings,
getting them is very easy.
Is just, uh, lookup in this matrix where everything is stored.
So this means, as I said,
each node is assigned a unique embedding vector.
And the goal of our methods will be to directly optimize or ,uh,
learn the embedding of each node separately in some sense.
Um, and this means that, uh,
there are many methods that will allow us to do this.
In particular, we are going to look at two methods.
One is called, uh,
DeepWalk and the other one is called node2vec.
So let me- let me summarize.
In this view we talked about an encoder ,uh, decoder, uh,
framework where we have what we call a shallow encoder because it's
just an embedding-lookup the parameters to optimize ,um, are- are,
uh, very simple, it is just this embedding matrix Z. Um,
and for every node we want to identify the embedding z_u.
And v are going to cover in the future lectures is we are
going to cover deep encoders like graph neural networks that- that,
uh, are a very different approach to computing, uh, node embeddings.
In terms of a decoder,
decoder for us would be something very similar- simple.
It'd be simple- simply based on the node similarity based on the dot product.
And our objective function that we are going to try to
learn is to maximize the dot product
of node pairs that are similar according to our node similarity function.
So then the question is,
how do we define the similarity, right?
I've been talking about it,
but I've never really defined it.
And really this is how these methods are going to differ between each other,
is how do they define the node similarity notion?
Um, and you could ask a lot of different ways how- how to do this, right?
You could chay- say,
"Should two nodes have similar embedding if they are perhaps linked by an edge?"
Perhaps they share many neighbors in common,
perhaps they have something else in common or they are in similar part of
the network or the structure of the network around them, uh, look similar.
And the idea that allow- that- that started all this area of
learning node embeddings was that we are going to def- define a similarity,
um, of nodes based on random walks.
And we are going to ,uh,
optimize node embedding for this random-walk similarity measure.
So, uh, let me explain what- what I mean by that.
So, uh, it is important to know that this method
is what is called unsupervised or self-supervised,
in a way that when we are learning the node embeddings,
we are not utilizing any node labels.
Um, will only be basically trying to
learn embedding so that they capture some notion of network similarity,
but they don't need to capture the- the notion of labels of the nodes.
Uh, and we are also not- not utilizing any node features
or node attributes in a sense that if nodes are humans,
perhaps, you know, their interest,
location, gender, age would be attached to the node.
So we are not using any data,
any information attached to every node or attached to every link.
And the goal here is to directly estimate a set of coordinates of node so
that some aspect of the network structure is preserved.
And in- in this sense,
these embeddings will be task-independent because they are
not trained on a given prediction task, um,
or a given specific, you know,
labelings of the nodes or are given specific subset of links,
it is trained just given the network itself.
تصفح المزيد من مقاطع الفيديو ذات الصلة
Stanford CS224W: ML with Graphs | 2021 | Lecture 2.1 - Traditional Feature-based Methods: Node
Stanford CS224W: ML with Graphs | 2021 | Lecture 6.1 - Introduction to Graph Neural Networks
Stanford CS224W: Machine Learning with Graphs | 2021 | Lecture 3.3 - Embedding Entire Graphs
Stanford CS224W: ML with Graphs | 2021 | Lecture 2.2 - Traditional Feature-based Methods: Link
Stanford CS224W: ML with Graphs | 2021 | Lecture 4.4 - Matrix Factorization and Node Embeddings
Stanford CS224W: ML with Graphs | 2021 | Lecture 2.3 - Traditional Feature-based Methods: Graph
5.0 / 5 (0 votes)