Stanford CS224W: ML with Graphs | 2021 | Lecture 2.3 - Traditional Feature-based Methods: Graph
Summary
TLDRThis lecture introduces graph-level features and kernels for predicting entire graphs' structures. It explains the concept of kernels that measure similarity between graphs without needing an explicit feature vector. The lecture discusses graphlet and Weisfeiler-Lehman kernels, which represent graphs as 'bags' of graphlets or colors, respectively. The Weisfeiler-Lehman kernel, in particular, is highlighted for its efficiency and effectiveness in graph similarity tasks, using a color refinement algorithm to enrich node features based on neighborhood information.
Takeaways
- 📊 **Graph-Level Features**: The lecture focuses on features that characterize the entire structure of a graph, not just individual nodes or edges.
- 🔍 **Kernel Methods**: Kernel methods are introduced as a technique to make predictions for entire graphs by designing a kernel rather than a feature vector.
- 🤖 **Graph Kernels**: A graph kernel measures the similarity between two graphs and is used in machine learning for graph-level prediction.
- 📏 **Positive Semi-Definite**: For a kernel to be valid, its kernel matrix must be positive semi-definite, meaning it has positive eigenvalues and is symmetric.
- 🔑 **Feature Representation (Phi)**: Kernels imply the existence of a feature representation such that the kernel between two graphs is the dot product of their feature vectors.
- 🛍️ **Bag-of-Words Model**: Graphs can be represented as a 'bag-of-words' model where nodes are considered as words and their frequencies are counted.
- 🔄 **Graphlet Kernel**: The graphlet kernel represents a graph as a count of different graphlets (small subgraphs) within the graph.
- 🚀 **Efficiency of Weisfeiler-Lehman (WL) Kernel**: The WL kernel is highlighted for its efficiency and effectiveness in measuring graph similarity through iterative color refinement.
- 🚩 **Normalization**: Graphlet kernel features are normalized to account for graph size and density, making them more representative.
- 🚨 **Limitations**: Counting graphlets is computationally expensive, leading to limitations in scalability for larger graphs.
- 🌐 **Graph Representation**: The lecture concludes that representing graphs as bags of graphlets or colors is crucial for traditional machine learning on graphs.
Q & A
What are graph-level features?
-Graph-level features are characteristics that describe the overall structure of an entire graph, as opposed to individual nodes or edges.
What is the purpose of graph kernels?
-Graph kernels are designed to measure the similarity between two graphs or the similarity between different data points in general, allowing for predictions for entire graphs.
What is the significance of a kernel matrix in graph analysis?
-A kernel matrix measures the similarity between all pairs of data points or graphs and must be positive semi-definite, meaning it has positive eigenvalues and is symmetric.
How does the graphlet kernel represent a graph?
-The graphlet kernel represents a graph as a count of the number of different graphlets within the graph, which are small subgraphs of a certain size.
What is the difference between graphlets in node-level features and graphlet kernels?
-In node-level features, graphlets do not need to be connected and are not rooted, whereas in graphlet kernels, they are considered as connected subgraphs of a certain size.
Why is counting graphlets computationally expensive?
-Counting graphlets is computationally expensive because it takes time exponential in the size of the graphlet and polynomial in the number of nodes in the graph.
What is the Weisfeiler-Lehman graph kernel and how does it work?
-The Weisfeiler-Lehman graph kernel is an efficient graph feature descriptor that uses a neighborhood structure to iteratively enrich node vocabulary through a color refinement process.
How does the Weisfeiler-Lehman kernel handle the similarity between graphs?
-The Weisfeiler-Lehman kernel counts the number of nodes with a given color after several iterations of color refinement and measures similarity by taking the dot product of these color count vectors for two graphs.
What is the computational complexity of the Weisfeiler-Lehman graph kernel?
-The computational complexity of the Weisfeiler-Lehman graph kernel is linear in the number of edges in the two graphs, making it extremely fast and efficient.
How does the bag-of-words representation relate to graph kernels?
-The bag-of-words representation is analogous to representing a graph as a bag of graphlets or colors, which is used in graph kernels to define a feature vector for a given graph.
What are some other graph kernels mentioned in the script?
-Other graph kernels mentioned include the random-walk kernel and the shortest-path kernel, though they are not discussed in detail within the scope of the lecture.
Outlines
🌐 Graph-Level Features and Kernels
The paragraph introduces the concept of graph-level features and graph kernels for predicting the properties of entire graphs, as opposed to individual nodes or edges. It explains the need for features that can characterize the structure of a whole graph. The lecturer uses an example graph with two loosely connected parts to illustrate the idea. The paragraph then delves into kernel methods, which are widely used in machine learning for graph-level prediction. Kernels are described as functions that measure the similarity between two graphs, or data points in general, and the importance of the kernel matrix being positive semi-definite and symmetric is emphasized. The concept of a feature representation Phi is introduced, which allows the kernel to be computed as the dot product of the feature vectors of two graphs. The paragraph concludes by mentioning that kernel methods can be used with existing machine learning models like kernel support vector machines for predictions.
🔑 Kernels and Graph Representation
This paragraph discusses the concept of kernels in more detail, focusing on their use in graph-level tasks. It explains that kernels can be thought of as a 'bag-of-words' type representation for graphs, drawing an analogy with text documents. The paragraph points out the limitations of using simple node counts as features, as it does not capture the structural differences between graphs. It then introduces the idea of using more complex features like node degrees to represent graphs. The concept of the graphlet kernel is introduced, which represents a graph as a count of different graphlets within the graph. The paragraph also explains the normalization of graphlet count vectors to account for graph size and density. The computational expense of counting graphlets is acknowledged as a limitation of the graphlet kernel.
🔍 Weisfeiler-Lehman Graph Kernel
The paragraph introduces the Weisfeiler-Lehman graph kernel as an efficient alternative to the graphlet kernel. It describes the process of iteratively enriching node vocabulary using neighborhood structures through a color refinement algorithm. The algorithm starts by assigning an initial color to each node and then iteratively aggregates and hashes colors from neighbors to create new colors. This process is likened to a multi-hop generalization of node degrees. The paragraph explains how this method can distinguish between graphs with similar structures but subtle differences. It also highlights the computational efficiency of the Weisfeiler-Lehman kernel, which has a linear time complexity relative to the size of the graph, making it practical for use in machine learning tasks.
📊 Summary of Traditional Graph Machine Learning
The final paragraph summarizes the lecture's discussion on traditional machine learning approaches to graph data. It reviews the different types of features discussed: node-level features like degree, centrality, and graphlets; edge-level features based on distance and neighborhood overlap; and graph-level features using graph kernels. The paragraph emphasizes the computational efficiency and effectiveness of the Weisfeiler-Lehman kernel for measuring graph similarity. It concludes by thanking the audience and noting that the lecture has covered various scalable and insightful ways to create feature vectors from nodes, links, and entire graphs.
Mindmap
Keywords
💡Graph-level features
💡Kernel methods
💡Kernel matrix
💡Graph kernels
💡Graphlet kernel
💡Weisfeiler-Lehman kernel
💡Bag-of-words
💡Feature vector
💡Color refinement
💡Graph isomorphism
💡Machine learning models
Highlights
Introduction to graph-level features and graph kernels for predicting entire graphs.
Goal is to create features that characterize the structure of an entire graph.
Kernel methods are widely used for graph-level prediction in traditional machine learning.
A kernel measures similarity between graphs or data points.
Kernel matrix is a matrix that measures similarity between all pairs of graphs.
For a kernel to be valid, its kernel matrix must be positive semi-definite.
Kernels have a feature representation such that the kernel value is the dot product of the feature vectors.
Kernels can be computed without explicitly creating the feature representation.
Graph kernels allow the use of off-the-shelf machine learning models for predictions.
Different graph kernels discussed: graphlet kernel and Weisfeiler-Lehman kernel.
Graph kernels provide competitive performance in graph-level tasks.
Key idea behind kernels is to define a feature vector for a given graph.
The graphlet kernel represents a graph as a count of different graphlets in the graph.
Graphlets in the graphlet kernel are not rooted and don't have to be connected.
Graphlet kernel defines the graphlet count vector as the number of instances of a given graphlet.
Graphlet kernel is defined as the dot product between the graphlet count vectors of two graphs.
Normalization of the feature vector representation is common due to different graph sizes.
Graphlet kernel has an important limitation due to the high computational cost of counting graphlets.
Weisfeiler-Lehman graph kernel is introduced as a more efficient alternative.
Weisfeiler-Lehman kernel uses a neighborhood structure to iteratively enrich node vocabulary.
The algorithm generalizes node degrees to multi-hop neighborhood information.
Weisfeiler-Lehman kernel is based on the color refinement algorithm.
The kernel counts the number of nodes with a given color after color refinement.
Weisfeiler-Lehman kernel is computationally efficient with linear time complexity.
Weisfeiler-Lehman kernel is closely related to graph neural networks.
Conclusion of the lecture on traditional machine learning approaches to graphs.
Transcripts
So far we discussed node and edge level features,
uh, for a prediction in graphs.
Now, we are going to discuss graph-level features and
graph kernels that are going to- to allow us to make predictions for entire graphs.
So the goal is that we want one
features that characterize the structure of an entire graph.
So for example, if you have a graph like I have here,
we can think about just in words,
how would we describe the structure of this graph,
that it seems it has, kind of,
two loosely connected parts that there are quite a few edges, uh,
ins- between the nodes in each part and that there is
only one edge between these two different parts.
So the question is,
how do we create the feature description- descriptor that will allow us to characterize,
uh, the structure like I just, uh, explained?
And the way we are going to do this,
is we are going to use kernel methods.
And kernel methods are widely used for
traditional machine learning in, uh, graph-level prediction.
And the idea is to design a kernel rather than a feature vector.
So let me tell you what is a kernel and give you a brief introduction.
So a kernel between graphs G,
and G', uh, returns a real value,
and measures a similarity between these two graphs,
or in general, measure similarity between different data points.
Uh, kernel matrix is then a matrix where
simply measures the similarity between all pairs of data points,
or all pairs of graphs.
And for a kernel to be a valid kern- kernel this ma- eh,
kernel matrix, uh, has to be positive semi-definite.
Which means it has to have positive eigenvalues,
for exam- and- and- as a consequence,
it has to be, symet- it is a symmetric, uh, ma- matrix.
And then what is also an important property of kernels,
is that there exist a feature representation, Phi,
such that the kernel between two graphs is simply a feature representation,
uh, wa-, uh, of the first graph dot product
with the feature representation of the second graph, right?
So Phi of G is a vector,
and Phi of G is a- is another vector,
and the value of the kernel is simply a dot product of this vector representation,
uh, of the two, uh, graphs.
Um, and what is sometimes nice in kernels,
is that this feature representation, Phi,
doesn't even need to- to be explicitly created for us to be able to compute the value,
uh, of the kernel.
And once the kernel is defined,
then off-the-shelf machine learning models,
such as kernel support vector machines,
uh, can be used to make, uh, predictions.
So in this le- in this part of the lecture,
we are going to discuss different, uh,
graph kernels, which will allow us to measure similarity between two graphs.
In particular, we are going to discuss
the graphlet kernel as well as Weisfeiler-Lehman kernel.
There are oth- also other kernels that are proposed in the literature,
uh, but this is beyond the scope of the lecture.
For example, random-walk kernel,
shortest-path kernel, uh, and many, uh, others.
And generally, these kernels provide a very competitive performance in graph-level tasks.
So what is the key idea behind kernels?
The key idea in the goal of kernels is to define a feature vector,
Phi of a given graph,
G. And the- the idea is that,
we are going to think of this feature vector, Phi,
as a bag-of-words type representation of a graph.
So what is bag of words?
When we have text documents,
one way how we can represent that text document,
is to simply to represent it as a bag of words.
Where basically, we would say,
for every word we keep a count of how often that word appears in the document.
So we can think of, let's say,
words sorted alphabetically, and then,
you know, at position, i,
of this bag-of-words representation,
we will have the frequency,
the number of occurrences of word,
i, in the document.
So in our- in the same way,
and naive extension of this idea to graphs would be to regard nodes as words.
However, the problem is that since both- since graphs can have very different structure,
but the same number of nodes,
we would get the same feature vector,
or the same representation for two very different graphs, right?
So if we re- regard nodes as words,
then this graph has four nodes,
this graphs has four nodes,
so their representation would be the same.
So we need a different candidate for- for the-
for the word in this kind of bag-of-words representation.
To be, for example, a bit more expressive,
we could have what we could call,
uh, degree kernel, where we could say,
w- how are we going to represent a graph?
We are going to represent it as a bag-of-node degrees, right?
So we say, "Aha,
we have one node of degree 1,
we have three nodes of degree 2,
and we have 0 nodes of degree, uh, 3."
In the same way,
for example, uh, here,
we could be asking,
how many nodes, uh,
of different degrees do we have here?
We have 0 nodes of degree, um, 1,
we have two nodes, uh, of degree 2,
and two nodes, uh, of degree, um, 3.
So, um, and this means that now we would, er,
obtain different feature representations for these,
uh, different, uh, graphs,
and that would allow us to distinguish these different, uh, graphs.
And now, both the graphlets kernel as well as the Weisfeiler-Lehman kernel,
use this idea of bag-of-something representation of a graph where- where the star,
this something is more sophisticated than node degree.
So let's, uh, first talk about the graphlets kernel.
The idea is that writing 1,
I represented the graph as a count of the number of different graphlets in the graph.
Here, I wanna make,
uh, um important point;
the definition of graphlets for a graphlet kernel,
is a bit different than the definition of a graphlet in the node-level features.
And there are two important differences that graphlets in
the node-level features do not need to be connected,
um, and, um also that they are not, uh, uh, rooted.
So graphlets, uh, in this- in the, eh,
graphlets kernel are not rooted,
and don't have to be connected.
And to give you an example,
let me, uh, show you, uh, the next slide.
So for example, if you have a list of graphlets that we are interested
in little g_1 up to the little g_n_k,
let's say these are graphlets of size k,
then let say for k equals 3,
there are four different graphlets, right?
There are four different con- graphs on three nodes,
and directed, fully connected two edges,
one edge, and no edges.
So this is the definition of graphlets in the graph kernel.
And for example, for k equals 4,
that are 11 different graphlets,
fully connected graph all the way to the graph on four nodes without any connections.
And now, uh, given a graph,
we can simply represent it as count of the number of structures,
um, er, different graphlets that appear, uh, in the graph.
So for example, given a graph,
and the graphr- graphlet list,
we define the graphlet count vector f,
simply as the number of instances of a given graphlet that appears,
uh, in our graph of interest.
For example, if these G is our graph of interest,
then in this graph,
there resides one triangle,
there resides three different parts of land 2,
there reside six different edges with an isolated nodes, and there, uh,
exist no, uh, triplet, uh, of nodes,
uh, that are, uh, that are not connected,
uh, in this graph.
So the graphlet feature vector in this case, uh,
would be here, would have ready 1,
3, 6, uh, and 0.
Now, given two graphs,
we can define the graphlet kernel simply as the dot product between the graphlet, uh,
count vector of the first graph times,
uh, the graphlet count vector of the second graph.
Um, this is a good idea,
but actually, there is a slight problem.
The problem is that graphs G1 and G2 may have different sizes,
so the row counts will be very,
uh, different of, uh,
graphlets in different graphs.
So a common solution people apply is to normalize this feature vector representation,
uh, for the graph.
So this means that, um,
the- the graphlet, uh,
vector representation for a given graph is simply the can- the count of
individual graphlets divided by the total number of graphlets that appear in the graph.
So if the- this essentially
normalizes for the size and the density of the underlying graph,
and then the graphlet kernel is defined as the dot product between these,
um, uh, feature vector representations of graphs,
uh, uh, h that capture,
uh, the frequency or the proportion of- of our given graphlet,
um, in a- in a graph.
There is an important limitation of the graphlet graph kernel.
And the limitation is that counting graphlets is very expens- expensive.
Counting a k-size graphlet in a graph with n nodes, uh,
by enumeration takes time or the n raised to the power k. So,
um, this means that counting graphlets of
size k is polynomial in the number of nodes in the graph,
but it is exponential in the graphlet size.
Um, and this is unavoidable in the worse-case
since sub-graph isomorisic- isomorphism judging whether,
uh, a sub-graph is a- is a, uh,
isomorphic to another, uh,
graph, is, uh, NP-hard.
Um, and, uh, there are faster algorithms if,
uh, graphs node, uh,
node degree is bounded by d,
then there exist a fa- faster algorithm to count the graphlets of size k. However,
the issue still remains that counting
these discrete structures in a graph is very time consuming, um, very expensive.
So we can only count graphlets up to,
uh, you know, uh,
a handful of nodes.
And then the- and then the exponential complexity takes over and we cannot count,
uh, graphlets, uh, that are,
uh, larger than that.
Um, so the question is,
how do we design a more efficient graph kernel?
Um, and Weisfeiler-Lehman graph kernel, uh, achieves this goal.
The goal here is to design an efficient graph feature descriptor Phi of G, uh,
where the idea is that we wanna use a neighborhood structure to iteratively enrich,
uh, node, uh, vocabulary.
And, um, we generalize a version of node degrees since node degrees are
one hot- one-hop neighborhood information to multi-hop neighborhood information.
And the algorithm that achieves this is, uh, uh,
called the Weisfeiler-Lehman graph isomorphism test,
or also known as color refinement.
So, uh, let me explain, uh, this next.
So the idea is that we are given a graph G with a set of nodes V,
and we're going to assign an initial color,
um, c^0, so this is an initial color to each node.
And then we are going to iteratively, er,
aggregate or hash colors from the neighbors to invent new colors.
So the way to think of this, uh,
the new color for a given node v will be a hashed value of its own color, um,
from the previous time step and a concatenation
of colors coming from the neighbors u of the node v of interest,
where hash is basically a hash functions that
maps different inputs into different, uh, colors.
And after k steps of this color refinement,
um, um, c, uh,
capital v of, uh,
capital K of v summarizes the structure, uh,
of the graph, uh, at the level of,
uh, K-hop, uh, neighborhood.
So let me, um, give you an example, uh, and explain.
So for example, here I have two, uh,
graphs that have very similar structure but are just slightly, uh, different.
The difference is, uh, this, uh,
edge and here, um, the- the, uh,
the diagonal edge, the triangle closing edge,
um, um, is, uh, different.
So first we are going to assign initial colors to nodes.
So every node gets the same color,
every node gets a color of one.
Now we are going to aggregate neighboring colors.
For example, this particular node aggregate colors 1,1,
1, um, and, uh,
adds it to it- to itself,
while this particular node up here aggregates colors from its neighbors,
one and one, uh, and it is here.
And the same process, uh,
happens in this second,
uh, graphs- graph as well.
Now that, um, we have collected the colors,
uh, we go, uh,
and hash them, right?
So we apply a hash- hash function that takes
nodes' own color plus the colors from neighbors and produces new colors.
And let's say here the hash function for the first combination returns one,
then two, then three, uh, four, and five.
So now we color the graphs,
uh, based on these new refined colors, right?
So this is the coloring of the first graph,
and this is the coloring of the second graph based on the hash values
of the- of the aggregated colors from the first step.
Now we take these two graphs and,
again, apply the same color aggregation scheme, right?
So for example, this node, uh,
with color 4 aggregates colors from its neighbors,
so aggregates the 3, 4, and 5.
So we have 3, 4, and 5 here, while, for example,
this node here of color 2 aggregates from its neighbor,
uh, that is colored 5,
so it gets 2, 5.
And again, for this graph,
the same process happens.
Now, again, we take, um, uh, this,
uh, aggregated colors, um, and we hash them.
And let's say our hash function, uh,
assigns different, uh, new colors, uh,
to, uh, to this,
uh, colors that are,
uh, combined aggregated from the previous timesteps.
So now we can take this, uh, original, uh,
aggregated colored graph and,
uh, relabel the colors based on the hash value.
So 4,345, uh, 4, um,
er, where is, uh, er, uh,
34- uh, 345- um, is, um,
layer hashes into color 10s,
so we replace a color 10, uh, here.
And we could keep iterating this and we would come up, uh, with, uh,
more and more, uh, refinement,
uh, of the, uh,
uh, colors of the graph.
So now that we have run this color refinement for a,
uh, a fixed number of steps,
let say k iterations,
the Weisfeiler-Lehman, uh, kernel counts number of nodes with a given color.
So in our case,
we run- we run this three times,
so we have 13 different colors.. And now
the feature description for a given graph is simply the count,
the number of nodes of a given color, right?
In the first iteration,
uh, all the nodes were colored,
um- all six nodes were colored the same one- uh, the same way.
Um, so there is six instances of color 1.
Then, um, after we iter- um,
agg- aggregated the colors and refined them, you know,
there were two nodes of color 2,
one node of color 3,
two nodes of color 4, um, and so on.
So here is now the feature description in terms of color counts, uh, for, uh,
for, uh, different colors for the first graph and different colors,
uh, for the second graph.
So now that we have the feature descriptions,
the Weisfeiler-Lehman graph kernel would simply take the dot product between these two,
uh, uh, feature descriptors and return a value.
So for example, in our case,
the si- the, uh, Weisfeiler-Lehman, uh,
kernel similarity between the two graphs is the dot product between the,
uh, feature descriptors, uh, here.
These are the two, uh,
feature descriptors and we compute the dot product,
we would get a value of, uh, 49.
So WL kernel is very popular and very strong,
uh, gives strong performance,
and it is also computationally efficient because the time complexity of
this color refinement at each step is linear in the size of the graph.
It is linear in the number of edges because all that
every node has to do is collect the colors, uh, from its, uh,
neighboring nodes and- and produce- and apply
a simple hash function to- to come up with a new,
um, uh, with a new, uh, color.
When computing the kernel value,
many colors, uh, appeared in two graphs need to be tracked.
So the number of colors will be at most the number of nodes,
uh, in the network.
So this again won't be too- too large.
And counting the colors again takes linear time because it's just a sweep over the nodes.
So the- the total complexity, uh,
of computing the Weisfeiler-Lehman graph kernel between a pair of, uh,
graphs is simply linear in the number of edges in the two graphs.
So this means this is extremely, uh,
fast and actually works,
uh, really well in practice.
So to summarize the graph level features that we have discussed,
first we talked about, uh,
the notion of graph kernels,
where basically graph is represented as a bag of graphlets or a bag of, uh, colors.
Um, and, uh, when we represent the graph as a graph- uh,
as a bag of graphlets,
this is extremely- this is very expensive representation because counting the graphlets,
uh, takes time exponential in the size of the graph.
At the same time, Weisfeiler-Lehman, uh,
kernel is based on this case step color refinement algorithm that
enriches and produces new node colors that are aggregated from the,
um, colors of the immediate neighbors of the node.
And as multiple rounds of this color refinement are run,
the node kind of gathers color information from farther away,
uh, parts of the network.
So here we represent the graph as a bag of colors.
This is computationally efficient.
The time is linear in the size of the graph, um,
and it is also closely related to graph neural networks that we are going to study,
uh, later in this course.
So, um, Weisfeiler-Lehman is a really, uh,
good way to measure similarity, um,
between graphs, and in many cases,
it is, uh, very hard to beat.
So this concludes the today lecture where we talked about, um,
three different, uh, approaches to traditional,
uh, graph, uh, level, um, machine learning.
We talked about, um, handcrafted features for node-level prediction,
uh, in terms of node degree,
centrality, clustering, coefficient, and graphlets.
We talked about link-level or edge-level features,
distance-based, as well as local and global neighborhood overlap.
And then last we'd talk about how do we characterize the structure of the entire graph.
We talked about graph kernels, uh,
and in particular about graphlet kernel and the WL,
meaning Weisfeiler-Lehman graph kernel.
So this concludes our discussion of traditional machine learning approaches, uh,
to graphs and how do we create feature vectors from nodes, links, um,
and graphs, um, in a- in a scalable and interesting way. Uh, thank you very much.
Ver Más Videos Relacionados
Stanford CS224W: ML with Graphs | 2021 | Lecture 2.1 - Traditional Feature-based Methods: Node
Stanford CS224W: Machine Learning with Graphs | 2021 | Lecture 3.1 - Node Embeddings
Stanford CS224W: ML with Graphs | 2021 | Lecture 6.1 - Introduction to Graph Neural Networks
Stanford CS224W: Machine Learning with Graphs | 2021 | Lecture 3.3 - Embedding Entire Graphs
Stanford CS224W: Machine Learning with Graphs | 2021 | Lecture 1.3 - Choice of Graph Representation
Stanford CS224W: ML with Graphs | 2021 | Lecture 4.4 - Matrix Factorization and Node Embeddings
5.0 / 5 (0 votes)