Stanford CS224W: Machine Learning with Graphs | 2021 | Lecture 1.3 - Choice of Graph Representation
Summary
TLDRThis lecture from Stanford's CS 224W course on machine learning with graphs covers essential concepts in graph theory. It introduces the basic components of graphs: nodes (vertices) and edges (links), and discusses different graph representations, such as directed, undirected, and bipartite graphs. The speaker explains the importance of choosing the right graph representation for different types of data, such as social networks, protein interactions, and citation networks. Additionally, the lecture explores topics like adjacency matrices, node degrees, graph sparsity, and graph connectivity, including strong and weak connections in directed graphs.
Takeaways
- 📊 A network or graph consists of nodes (entities) and edges (interactions) that define the relationships between nodes.
- 🔄 Directed graphs have edges with specific directions, while undirected graphs represent reciprocal relationships without direction.
- 📈 Node degree refers to the number of edges connected to a node, with undirected graphs counting each edge twice.
- ⚖️ Bipartite graphs are composed of two distinct sets of nodes where edges only exist between nodes from different sets, such as authors linked to papers or customers linked to products.
- 🧠 Choosing the correct graph representation (nodes and edges) is crucial for accurate machine learning predictions in different contexts, from social networks to biological systems.
- 🔗 Strongly connected components in directed graphs mean that there is a path from every node to every other node in the component.
- 📋 Graphs can be represented by adjacency matrices, edge lists, or adjacency lists, each having different strengths depending on the size and sparsity of the network.
- 🧩 Sparse networks, common in real-world applications, have few edges compared to the number of possible connections, impacting matrix representations and graph analysis.
- 📚 Graphs can have attributes associated with nodes and edges, such as weights, rankings, or types, which can add depth to the data represented by the network.
- 🎯 The connectivity of a graph (whether it's connected or disconnected) is crucial in understanding how nodes interact and in identifying isolated or clustered components.
Q & A
What are the two main components of a graph or network?
-A graph or network is composed of nodes (or vertices), which represent objects or entities, and edges (or links), which represent interactions or relationships between these nodes.
Why is the choice of graph representation important?
-The choice of graph representation is crucial because it determines how well the graph captures the underlying relationships of the data. A proper representation allows for more accurate predictions and insights, while an improper representation might lead to misleading or less meaningful conclusions.
What are directed and undirected graphs, and when are they used?
-Directed graphs have edges with a specific direction, indicating a one-way relationship between nodes (e.g., phone calls, financial transactions). Undirected graphs have bidirectional edges, representing reciprocal relationships (e.g., friendships, collaborations).
What is the significance of node degree in a graph?
-Node degree refers to the number of edges connected to a node. In undirected graphs, it is the total number of connections. In directed graphs, it is split into in-degree (edges pointing to the node) and out-degree (edges originating from the node). The node degree helps to understand the node's connectivity and importance in the network.
What is a bipartite graph, and can you give an example?
-A bipartite graph consists of two types of nodes, where edges only connect nodes of different types. An example is an author-paper graph, where authors are linked to the papers they have written, but authors are not directly connected to each other.
How does projecting a bipartite graph work?
-Projecting a bipartite graph involves creating a new graph using only one type of node. Nodes are connected if they share at least one neighbor in the original bipartite graph. For example, projecting an author-paper graph can result in a co-authorship network where authors are connected if they have co-authored a paper.
What are the three common ways to represent a graph computationally?
-The three common ways are: 1) Adjacency matrix: a square matrix indicating connections between nodes; 2) Edge list: a list of node pairs representing edges; 3) Adjacency list: a list of each node's neighbors. Adjacency matrices are good for dense networks, while adjacency lists are better for large, sparse networks.
What is the concept of graph sparsity, and why is it important?
-Graph sparsity refers to the fact that most real-world networks have a low average degree compared to the total number of nodes, meaning they have few connections relative to the number of possible connections. This sparsity is important because it affects the efficiency of graph storage and computation.
What are strongly connected components in directed graphs?
-Strongly connected components are subsets of nodes in a directed graph where there is a directed path between any two nodes within the subset. This means that every node can be reached from every other node in the component.
What role do attributes play in graphs, and how can they be represented?
-Attributes provide additional information about nodes, edges, or the entire graph. For example, an edge can have a weight indicating the strength of a relationship. These attributes can be represented in the adjacency matrix, where entries can hold values other than binary to indicate connection strengths.
Outlines
🧠 Introduction to Graph Representations and Network Components
The instructor discusses the components of a graph, which consist of nodes (also called vertices) and edges (or links). These can represent various entities such as people, actors, or molecules, and the relationships between them. The same mathematical representation of a graph can be applied across different domains, whether analyzing social networks, protein interactions, or collaboration networks. The choice of graph representation greatly influences the quality of analysis, making it crucial to decide on how to define nodes and edges based on the dataset and domain.
🔗 Understanding Node Degree and Bipartite Graphs
This paragraph explains node degree in both undirected and directed graphs. In undirected graphs, node degree is the number of edges connected to a node, while in directed graphs, in-degree and out-degree indicate edges pointing to and away from the node, respectively. The concept of bipartite graphs is introduced, where nodes are split into two groups, and edges only connect nodes between these groups. Examples include scientific papers and authors, or customers and products.
📊 Graph Projections and Adjacency Matrices
The instructor discusses how bipartite graphs can be projected onto one of the partitions to create collaboration networks or co-rating networks. This projection creates connections between nodes in one partition based on shared neighbors. The paragraph also introduces adjacency matrices, a method to represent graphs using a matrix format. The matrix is binary, with 1s indicating connections between nodes and 0s for no connection. In undirected graphs, this matrix is symmetric, while in directed graphs it is not.
🔢 Graph Sparsity and Representation
The concept of sparse networks is highlighted, where most nodes have very few connections, resulting in sparse adjacency matrices. This characteristic is common in real-world networks such as social networks or scientific collaboration networks. Sparse matrices require special representation methods, such as edge lists or adjacency lists, to efficiently handle large, sparse graphs. Adjacency lists are particularly useful for analyzing large, sparse networks because they allow quick retrieval of a node's neighbors.
⚖️ Weighted Graphs, Self-loops, and Connectivity
This paragraph introduces weighted graphs, where edges can have weights representing the strength of a relationship. It also discusses self-loops, where nodes connect to themselves, and multi-graphs, which allow multiple edges between nodes. Connectivity is another important concept, where a graph is connected if all nodes can be reached via paths. In directed graphs, weak and strong connectivity are differentiated, with strong connectivity requiring a path between every pair of nodes.
📐 Summary of Graph Representations and Theories
The lecture concludes by summarizing various graph representation choices, including directed and undirected graphs, bipartite graphs, weighted graphs, and adjacency matrices. Important graph theory concepts such as connectivity, weak and strong connectivity, and node degree are also revisited. The importance of selecting the right graph representation based on the problem and dataset is emphasized, influencing the types of machine learning tasks and predictions that can be performed using graph structures.
Mindmap
Keywords
💡Graph
💡Nodes
💡Edges
💡Directed vs. Undirected Graphs
💡Adjacency Matrix
💡Node Degree
💡Bipartite Graph
💡Sparse Matrix
💡Weighted Graph
💡Strongly Connected Components
Highlights
Introduction to graph representation and its importance in machine learning with graphs.
A graph is composed of two main components: nodes (or vertices) and edges (or links).
Graph representations can vary depending on the problem; nodes and edges represent different objects and relationships in various datasets.
The choice of a proper graph representation is crucial for successful analysis and predictions.
Graphs can be directed or undirected. Directed graphs are useful for modeling relationships with direction, like phone calls or financial transactions.
Node degree refers to the number of edges adjacent to a given node, and it can be calculated differently for directed and undirected graphs.
Bipartite graphs consist of two different types of nodes, where connections only occur between different node types, e.g., authors linked to papers or customers linked to products.
Projected or folded networks are created from bipartite graphs, such as collaboration networks formed by co-authorship between authors.
Adjacency matrices represent graphs as square matrices where the presence or absence of edges is indicated by binary values (0 or 1).
Real-world networks tend to be sparse, meaning large portions of the adjacency matrix will be empty (no connections).
Sparse graphs are efficiently represented using edge lists or adjacency lists, making it easier to manipulate and analyze.
Nodes and edges can have attributes, such as weights, directions, or other properties, which can be directly represented in the adjacency matrix.
Self-loops and multi-graphs are special graph types where a node may have edges connected to itself, or multiple edges between node pairs.
Connectivity in graphs determines whether nodes can be reached from one another, with concepts of strong and weak connectivity in directed graphs.
Machine learning applications of graphs span across node, edge, and graph-level prediction tasks, all depending on the chosen representation.
Transcripts
In this part of, uh, Stanford, CS 224 W, um,
machine-learning with graphs course,
I wanna talk about the choice of graph representation.
[NOISE] So what are components of a graph or a network?
So network is composed of two types of objects.
We- first, we have objects or entities themselves called,
uh, referred to as nodes, uh,
and vertices, and then we have interactions or edges between them,
uh, called links or, uh, edges.
And then the entire system, the entire, um,
domain we then call a- a network,
uh, or a graph.
Usually, for nodes, we will use, uh, uh,
the word- the letter capital N or capital V,
um, and then for edges,
we- we are usually using the-
the letter capital E so that the graph G is then composed of a set of nodes,
uh, N and a set of edges, uh,
E. What is important about graphs is that graphs are a common language,
meaning that I can take, for example, uh,
actors and connect them based on which movies they appeared in,
or I can take people based on the relationships they have with each other,
or I can take molecules, like proteins,
and build a network based on which proteins interact with each other.
If I look at what is the structure of this network,
what is the underlying mathematical representation,
in all these cases,
we have the same underlying mathematical representation,
which means that the same machine learning algorithm will be
able to make predictions be it that these nodes,
um, uh, correspond to actors, correspond to, uh,
people, or they correspond to molecules like proteins.
[NOISE] Of course, choosing a proper graph representation is very important.
So for example, if you have a set of people,
we can connect individuals that work with each
other and we will have a professional network.
However, we can also take the same set of individuals
and connect them based on sexual relationships, but then,
we'll ab- creating a sexual network, or for example,
if we have a set of scientific papers,
we can connect them based on citations,
which paper cites which other paper.
But for example, if we were to connect them based
on whether they use the same word in the title,
the- the quality of underlying network and the underlying,
uh, representations might be, uh, much worse.
So the choice of what the- the nodes are and what the links are is very important.
So whenever we are given a data set,
then we need to decide how are we going to design the underlying graph,
what will be the objects of interest nodes,
and what will be the relationships between them,
what will be the edges.
The choice of this proper network representation of a given domain or a given problem
deter- will determine our ability to use networks, uh, successfully.
In some cases, there will be a unique,
unambiguous way to represent this, um, problem,
this domain as a graph,
while in other cases,
this representation may, by no means, be unique.
Um, and the way we will assign links between the objects will determine, uh,
the nature of the questions we will be able to study and the nature of the,
um, predictions we will be able to make.
So to show you some examples of design choices we are faced with when co-creating graphs,
I will now go through some concepts and different types of graphs,
uh, that we can- that we can create from data.
First, I wi- I will distinguish between directed and undirected graphs, right?
Undirected graphs have links,
um, that- that are undirected, meaning,
that they are useful for modeling symmetric or reciprocal relationships,
like collaboration, friendship, um,
and interaction between proteins,
and so on, while directed, um,
relationships are captured by directed links,
where every link has a direction, has a source,
and has a destination denoted by a- by an arrow.
And examples of these types of, um,
links occurring in real-world would be phone calls,
financial transactions, uh, following on Twitter,
where there is a source and there is a destination.
The second type of, um- um, uh,
graphs that we are going to then talk about is that as we have, um,
created undirected graphs, then,
um, we can talk about the notion of a node degree.
And node degree is simply the number of edges,
um, adjacent to a given, uh, node.
So for example, the node a in this example has degree 4.
The average node degree is simply the- is
simply the average over the degrees of all the nodes in the network.
And if- if you work this out,
it turns out to be twice number of edges divided by the number of nodes,
uh, in the network.
The reason there is this number 2 is
because when we are computing the degrees of the nodes,
each edge gets counted twice, right?
Each endpoint of the n- of the edge gets
counted once because the edge has two end points,
every edge gets counted twice.
This also means that having a self edge or self-loop, um,
adds a degree of two to the node,
not a degree of one to the node because both end points attach to the same, uh, node.
This is for undirected networks.
In directed networks, we distinguish between, uh,
in-degree and out-degree, meaning
in-degree is the number of edges pointing towards the node.
For example, node C has in-degree 2 and the out-degree, um, 1,
which is the number of edges pointing outside- outward
from the- from the node, uh, c. Um,
another, uh, very popular type of graph structure
that is- that is used a lot and it's very natural in different domains,
it's called a bipartite graph.
And bipartite graph is a graph generally of nodes of two different types,
where nodes only interact with the other type of node,
but not with each other.
So for example, a bipartite graph is a graph where nodes can be split
into two partitions and the- the edges only go from left,
uh, to the right partition and not inside the same partition.
Examples of, uh, bipartite graphs that naturally occur are,
for example, uh, scientific authors linked to the papers they authored,
actors linked to the movies they appeared in,
users linked to the movies they rated or watched,
um, and so on.
So- or for example,
customers buying products, uh,
is also a bipartite graph where we have a set of customers,
a set of products,
and we link, uh, customer to the product, uh, she purchased.
Now that we have defined a bipartite network,
we can also define the notion of a folded or projected network, where we can create,
for example, author collaboration networks,
or the movie co-rating network.
And the idea is as follows: if I have a bipartite graph,
then I can project this bipartite graph to either to the left side or to the right side.
And when- and when I project it, basically,
I only use the nodes from one side in my projection graph,
and the way I connect the nodes is to say,
I will create a connection between a pair of nodes
if they have at least one neighbor in common.
So if these are authors and these are scientific papers,
then basically, it says,
I will create a co- collaboration or a co-authorship graph where I will
connect a pair of authors if they co-authored at least one paper in common.
So for example, 1, 2,
and 3 co-authored this paper,
so they are all connected with each other.
For example, 3 and 4 did not co-author a paper,
so there is no link between them.
But for example, 5 and 2 co-authored a paper,
so there is a link between them because they co-authored this,
uh, this paper here.
And in analogous way,
you can also create a projection of
this bipartite network to the- to the right-hand side,
and then you will- you would obtain a graph like this.
And as I said, bipartite graphs or multipartite graphs,
if you have multiple types of edges,
are very popular, especially,
if you have two different types of nodes,
like users and products, um,
uh, users and movies, uh,
authors and papers, um,
and so on and so forth.
[NOISE] Another interesting point about graphs is how do we represent them,
um, and representing graphs,
uh, is an interesting question.
One way to represent a graph is to represent it with an adjacency matrix.
So essentially, if for a given,
uh, undirected, for example, graph,
in this case on end nodes, in our case,
4, we will create a square matrix,
where this matrix will be binary.
It will o- only take entries of 0 and 1.
And essentially, an entry of matrix ij will be set to 1 if nodes i and j are connected,
and it will be set to 0 if they are not connected.
So for example, 1 and 2 are connected,
so at entry 1, row 1,
column 2, there is a 1.
And also, because 2 is connected to 1 at row 2,
column 1, we also have a 1.
So this means that adjacency matrices of,
uh, undirected graphs are naturally symmetric.
If the graph is directed,
then the matrix won't be symmetric because 2 links to 1.
We have a 1 here,
but 1 does not link back to 2,
so there is a 0.
Um, and in similar way,
we can then think of node degrees, um, uh,
simply as a summation across a given row or
across a given one column of the graph, uh, adjacency matrix.
So rather than kind of thinking here how many edges are adjacent,
we can just go and sum the- basically,
count the number of ones,
number of other nodes that this given node is connected to.
Um, this is for, um, undirected graphs.
For directed graphs, uh,
in and out degrees will be sums over columns and sums over rows, uh,
of the graph adjacency matrix,
as- as I illustrate here, uh,
with this, um, illustration.
One important consequence of a real-world network is that they are extremely sparse.
So this means if you would look at the adjacency matrix,
series on adjacency matrix of a real-world network where basically for every, um, row I,
column J, if there is an edge,
we put a dot and otherwise the cell is empty, uh,
you get these types of super sparse matrices where,
where there are large parts of the matrix that are empty, that are white.
Um, and this has important consequences for properties
of these matrices because they are extremely, uh, sparse.
To show you an example, right?
Uh, if you have a network on n nodes,
nodes, then the maximum degree of a node,
the number of connections a node has is n minus one
because you can connect to every oth- in principle,
connect to every other node in the network.
So for example, if you are a human and you think about human social network, uh,
the maximum degree that you could have,
the maximum number of friends you could have is every other human in the world.
However, nobody has seven billion friends, right?
Our number of friendships is much, much smaller.
So this means that, let's say the human social network is extremely sparse,
and it turns out that a lot of other,
uh, different types of networks,
you know, power-grids, uh, Internet connection,
science collaborations, email graphs,
uh, and so on and so forth are extremely sparse.
They have average degree that these, you know,
around 10 maybe up to, up to 100.
So, uh, what is the consequence?
The consequence is that the underlying adjacency matrices,
um, are extremely sparse.
So we would never represent the matrix as a dense matrix,
but we've always represent it as a sparse matrix.
There are two other ways to represent graphs.
One is simply to represent it as a edge list,
simply as a list of edges.
Uh, this is a representation that is quite popular, um,
in deep learning frameworks because we can simply
represent it as a two-dimensional matrix.
The problem of this representation is that it is very
hard to do any kind of graph manipulation or
any kind of analysis of the graph because even
computing a degree of a given node is non-trivial,
uh, in this case.
A much, uh, better, uh,
representation for a graph analysis and manipulation is the notion of adjacency list.
Um, and adjacency lists are good because they are easier to
work with if for large and sparse networks.
And adjacency list simply allows us to
quickly retrieve al- all the neighbors of a given node.
So you can think of it, that for every node,
you simply store a list of its neighbors.
So a list of nodes that the,
that the- a given node is connected to.
If the graph is undirected,
you could store, uh, neighbors.
If the graph is connected,
you could store both the outgoing neighbors,
as well as, uh, incoming neighbors based on the direction of the edge.
And the last important thing I want to mention here is that of course,
these graph can- can have attached attributes to them.
So nodes address, as well as
entire graphs can have attributes or properties attached to them.
So for example, an edge can have a weight.
How strong is the relationship?
Perhaps it can have my ranking.
It can have a type.
It can have a sign whether this is a friend-based relationship or whether it's animosity,
a full distrust, let say based relationships.
Um, and edges can have di- many different types of properties,
like if it's a phone call, it's,
it's duration, for example.
Nodes can have properties in- if these are people,
it could be age, gender,
interests, location, and so on.
If a node is a, is a chemical,
perhaps it is chemical mass,
chemical formula and other properties of the- of
the chemical could be represented as attributes of the node.
And of course, also entire graphs can have features or, uh,
attributes based on, uh,
the properties of the underlying object that the graphical structure is modeling.
So what this means is that the graphs you will be considering
are not just the topology nodes and edges,
but it is also the attributes,
uh, attached to them.
Um, as I mentioned,
some of these properties can actually be
represented directly in the adjacency matrix as well.
So for example, properties of edges like
weights can simply be represented in the adjacency matrix, right?
Rather than having adjacency matrix to be binary,
we can now have adjacency matrix to have real values where
the strength of the connection corresponds simply to the value,
uh, in that entry.
So two and four are more strongly linked,
so the value is four,
while for example, one and three are linked with
a weak connection that has weight only 0.5.
Um, as a- um,
another important thing is that when we create the graphs is that we also
can think about nodes having self-loops.
Um, for example, here,
node four has a self-loop, uh,
and now the degree of node four equals to three.
Um, self-loops are simply correspond to
the entries on the diagonal of the adjacency matrix.
And in some cases,
we may actually create a multi-graph where we
allow multiple edges between a pair of nodes.
Sometimes we can, we can think of a multi-graph as
a weighted graph where the entry on the matrix counts the number of edges,
but sometimes you want to represent every edge individually,
separately because these edges might have different properties,
um, and different, um, attributes.
Both, um, the self-loops,
as well as multi-graphs occur quite frequently in nature.
Uh, for example, if you think about phonecalls transactions,
there can be multiple transactions between a pair of nodes
and we can accurately represent this as a multi-graph.
Um, as we have these graphs, I,
I also want to talk about the notion of connectivity,
in a sense, whether the graph is connected or disconnected.
And graph is connected if any pair of nodes in, uh, in this, uh,
graph can be, can be connected via a path along the edges of the graph.
So for example, this particular graph is
connected while this other graph is not connected,
it has three connected components.
This is one connected component, second connected component,
then a third connected component,
the node h, which is an isolated node.
This is the notion of connectivity for undirected graphs, uh,
and what is interesting in this notion is,
that when we, um,
have graphs that are,
for example, disconnect it and look at what is
the structure of the underlying adjacency matrix,
we will have these block diagonal structure, where, basically,
if this is a graph that is composed of two components, then we will have,
um, um, block diagonal structure where the edges only go between the,
um, nodes inside the same, um, connected component,
and there is no edges in the off-diagonal part,
which would mean that there is no edge between,
uh, red and blue,
uh, part of the graph.
The notion of connectivity also generalizes to directed graphs.
Here, we are talking about two types of connectivity,
strong and weak connectivity.
A weakly connected directed graph is simply a graph that is connected,
uh, in- if we ignore the directions of the edges.
A strongly connected graph, um,
or a graph is strongly connected if for every pair of
nodes there exists a directed path between them.
So, um, this means that there has to exist a directed path from, for example,
from node A to node B,
as well as from node B back to, uh,
node A if the graph is strongly connected.
What this also means is that we can talk about notion of
strongly connected components where strongly connected components are,
uh, sets of nodes in the graph, uh,
such that every node, uh,
in that set can visit each other via the- via a directed path.
So for example, in this case here,
nodes, uh, A, B,
and C form a strongly connected component because they are on a cycle.
So we ca- any- from any node we can visit, uh, any other node.
Uh, the example here shows, uh,
directed graph with two strongly connected component,
again, two cycles on, um three nodes.
So this concludes the discussion of the- er- the graph representations,
um, that- and ways how we can create graphs from real data.
Um, in this lecture,
we first talked about machine-learning with graphs and various applications in use cases.
We talked about node level, edge level,
and graph level machine-learning prediction tasks.
And then we discussed the choice of a graph representation in terms of directed,
undirected graphs, bipartite graphs,
weighted, uh, unweighted graphs,
adjacency matrices, as well as some definitions from graph theory,
like the connectivity, um, of graphs,
weak connectivity, strong connectivity,
as well as the notion of node degree.
Um, thank you very much.
Посмотреть больше похожих видео
AQA A’Level Graphs
Projeto e Análise de Algoritmos - Aula 11 - Conceitos básicos e representação de grafos
Stanford CS224W: Machine Learning with Graphs | 2021 | Lecture 1.1 - Why Graphs
Graph Neural Networks: A gentle introduction
Stanford CS224W: Machine Learning with Graphs | 2021 | Lecture 3.3 - Embedding Entire Graphs
6.10 Topological Sorting (with Examples) | How to find all Topological Orderings of a Graph
5.0 / 5 (0 votes)