Stanford CS224W: ML with Graphs | 2021 | Lecture 2.1 - Traditional Feature-based Methods: Node

Stanford Online
15 Apr 202127:30

Summary

TLDRThis lecture delves into traditional machine learning methods for graph analysis, focusing on node-level, link-level, and graph-level prediction tasks. It emphasizes feature design, particularly structural features that capture the network's topology and node attributes. The instructor discusses node importance measures like degree, centrality, and introduces graphlets to characterize local network structure. The goal is to utilize these handcrafted features for effective machine learning model predictions on graphs.

Takeaways

  • πŸ“š The lecture focuses on traditional machine learning methods in graph analysis, emphasizing the importance of feature design for predictive performance.
  • πŸ”¬ Two main types of features are discussed: attributes of nodes and structural features that describe the topology of the network.
  • πŸ”‘ The machine learning pipeline involves representing data points with feature vectors and training models like random forests or support vector machines to make predictions.
  • πŸ” The lecture is divided into three parts: discussing features for node-level prediction, link-level prediction, and graph-level prediction.
  • 🌐 The importance of considering the network's relational structure is highlighted for accurate predictions, especially through handcrafted features.
  • πŸ“ˆ Node-level tasks involve semi-supervised learning where the model predicts the labels of unlabeled nodes based on the structure and attributes of the network.
  • πŸ”‘ Centrality measures such as degree, eigenvector, betweenness, and closeness are introduced to capture the importance or position of nodes within the network.
  • 🀝 The clustering coefficient is a measure of how connected a node's neighbors are, indicating the presence of triangles in the network.
  • πŸŒ€ Graphlets are defined as rooted connected non-isomorphic subgraphs, and the Graphlet Degree Vector is a feature that counts the instances of these subgraphs in a node's local neighborhood.
  • πŸ“Š The Graphlet Degree Vector provides a detailed signature of a node's local network topology, allowing for fine-grained comparison of node neighborhoods.
  • πŸ›  The discussed features are crucial for tasks such as predicting node roles in networks or identifying influential nodes in social networks.

Q & A

  • What are the three main levels of tasks in graph-based machine learning discussed in the script?

    -The three main levels of tasks are node-level prediction tasks, link-level or edge-level prediction tasks, and graph-level prediction tasks.

  • What is the importance of designing proper features in a traditional machine learning pipeline for graphs?

    -Proper feature design is crucial as it allows for more accurate predictions by capturing the attributes and properties of the nodes, as well as the topology of the network.

  • What are the two types of features considered for nodes in a graph?

    -The two types of features are attributes associated with the nodes themselves, such as chemical properties in a protein interaction network, and additional features that describe the node's position and local network structure.

  • How does the traditional machine learning pipeline represent data points for model training?

    -In the traditional machine learning pipeline, data points like nodes, links, and entire graphs are represented with vectors of features, which are then used to train a classifier or model.

  • What is the role of a random forest classifier in the context of the machine learning pipeline for graphs?

    -A random forest classifier is one of the classical machine learning models that can be trained using the feature vectors of nodes, links, or graphs to make future predictions.

  • What is a node's degree and how is it used as a feature in graph-based machine learning?

    -A node's degree is the number of edges connected to the node. It is used as a feature to capture the structure around the node, indicating the number of neighbors the node has.

  • Can you explain the concept of Eigenvector Centrality and its significance in graph analysis?

    -Eigenvector Centrality is a measure of a node's importance based on the importance of its neighboring nodes. It suggests that a node is important if it is connected to other important nodes, and it is calculated using the largest eigenvector of the graph's adjacency matrix.

  • What is the purpose of Betweenness Centrality in graph analysis?

    -Betweenness Centrality measures how many shortest paths in the network pass through a particular node, indicating the node's role as a connector or bridge within the network.

  • How is Closeness Centrality different from other centrality measures?

    -Closeness Centrality measures the distance of a node to all other nodes in the network, with higher values indicating a node is more central and closer to the network's center, thus having shorter paths to other nodes.

  • What is the clustering coefficient and how does it relate to the local network structure?

    -The clustering coefficient measures how connected a node's neighbors are, indicating the likelihood that friends of a node are also friends with each other, which is a measure of the local network's transitivity.

  • Can you describe the concept of Graphlets and their role in characterizing the local structure around a node?

    -Graphlets are rooted connected non-isomorphic subgraphs that occur in a node's local neighborhood. They help to generalize the concept of counting specific structures like triangles (as in clustering coefficient) to a broader range of subgraph structures, providing a detailed characterization of the local network topology.

  • What is a Graphlet Degree Vector and how does it enhance the understanding of a node's local neighborhood?

    -A Graphlet Degree Vector is a count vector that represents the number of times a node participates in different graphlets rooted at that node. It provides a fine-grained measure of local topological similarity, offering a more detailed comparison of the structure of neighborhoods between different nodes than simple measures like node degree or clustering coefficient.

Outlines

00:00

πŸ“š Introduction to Machine Learning on Graphs

This paragraph introduces the topic of traditional machine learning methods applied to graph structures. The lecturer outlines the different levels of tasks in graph analysis, including node-level, link-level, and graph-level predictions. The importance of feature design is emphasized, with a distinction made between inherent node attributes and additional features that describe the node's position and local network structure. The goal is to create accurate predictive models by incorporating both structural and attribute features of the nodes. The traditional machine learning pipeline involves two main steps: representing data points with feature vectors and training a model such as random forest or support vector machines. The lecture will focus on handcrafted features, especially structural features, and is divided into three parts: discussing features for individual nodes, pairs of nodes, and entire graphs, with a focus on undirected graphs.

05:01

πŸ” Node-Level Prediction and Feature Design

The second paragraph delves into node-level prediction tasks within a semi-supervised learning context, where the objective is to predict the labels of uncolored nodes given a few labeled ones. The importance of structural features such as node degree is highlighted, which captures the connectivity of a node but does not distinguish between different types of neighbors. The concept of node centrality measures is introduced to capture the importance of a node within the graph, including eigenvector centrality, which considers the importance of a node's neighbors. The paragraph also touches on betweenness centrality and closeness centrality as additional centrality measures that provide different perspectives on a node's position and importance in the network.

10:02

🌐 Understanding Node Centrality and Local Structure

This paragraph continues the discussion on node centrality, explaining the computation of betweenness centrality and its significance in identifying important connectors or transit hubs within a network. Closeness centrality is also explored, emphasizing the importance of a node's proximity to all other nodes in the network. The focus then shifts to local network structure, introducing the concept of the clustering coefficient, which measures the interconnectedness of a node's neighbors. The paragraph provides examples to illustrate the computation and implications of these centrality measures and the clustering coefficient, setting the stage for a deeper exploration of local network structure in subsequent paragraphs.

15:03

πŸ”¬ Advanced Local Structure Analysis with Graphlets

The fourth paragraph introduces a more nuanced approach to analyzing local network structure through the concept of graphlets. It explains how the clustering coefficient is essentially counting the number of triangles in a node's ego-network and then broadens this idea to include the counting of various pre-specified subgraph structures, or graphlets, around a node. The paragraph defines graphlets as rooted, connected, non-isomorphic subgraphs and illustrates the variety of graphlets that can exist with different numbers of nodes. It also discusses the importance of graphlets in social networks, where the presence of triangles indicates the natural evolution of social connections.

20:05

πŸ“Š Graphlet Degree Vector for Comprehensive Node Characterization

This paragraph builds on the concept of graphlets by introducing the Graphlet Degree Vector (GDV), a feature vector that counts the number of times a node participates in various graphlets. The GDV provides a detailed characterization of a node's local network topology, offering a more granular comparison of node neighborhoods than traditional measures like node degree or clustering coefficient. The paragraph explains how the GDV is constructed by counting instances of graphlets rooted at a given node, using examples to illustrate the process. It also discusses the potential dimensionality of the GDV, which can be quite large when considering graphlets of varying sizes, and how it captures the node's interconnections up to a certain distance.

25:06

πŸ”‘ Conclusion on Node Feature Importance and Structure

The final paragraph summarizes the various node features discussed throughout the lecture, categorizing them into importance-based features, such as node degree and centrality measures, and structure-based features, including node degree, clustering coefficient, and graphlet degree vector. It highlights the applications of these features in predicting a node's importance or role within a network, such as identifying influential users in social networks or predicting protein functions in biological networks. The paragraph concludes by emphasizing the utility of these features in understanding and predicting the behavior and structure of nodes within complex graph-based systems.

Mindmap

Keywords

πŸ’‘Machine Learning in Graphs

Machine Learning in Graphs refers to the application of machine learning techniques to analyze and make predictions based on graph-structured data. In the video, this concept is central as it sets the stage for discussing how traditional methods can be used for various tasks at different levels within a graph, such as node-level, link-level, and graph-level predictions.

πŸ’‘Node-Level Prediction

Node-level prediction is a task focused on making predictions for individual nodes within a graph. The script discusses this in the context of semi-supervised learning, where the model predicts the attributes or labels of nodes based on the known labels of some nodes, exemplified by the task of coloring uncolored nodes in a network based on the pattern of connections.

πŸ’‘Link-Level or Edge-Level Prediction

Link-level or edge-level prediction involves predicting whether a pair of nodes in a graph is connected or not. This is a critical task in graph analysis, and the script mentions it as one of the levels at which machine learning can be applied to understand relationships between nodes.

πŸ’‘Graph-Level Prediction

Graph-level prediction is the task of making predictions for an entire graph, such as classifying a molecule or a piece of code. The script discusses this in the context of using features derived from the entire graph to train machine learning models for such predictions.

πŸ’‘Features

In the context of the video, features refer to characteristics or attributes that are used to represent nodes, links, or entire graphs in a machine learning model. The script emphasizes the importance of designing proper features, including both attributes of nodes and structural features that describe the topology of the network.

πŸ’‘Node Attributes

Node attributes are inherent properties associated with the nodes in a graph, such as the chemical properties of proteins in a protein interaction network. The script mentions these as a type of feature that can be used to represent nodes in a machine learning model.

πŸ’‘Structural Features

Structural features describe the topological properties of a node's position within the network, such as its degree or the presence of triangles in its local neighborhood. The video script discusses how these features, including node degree, clustering coefficient, and graphlets, are crucial for making accurate predictions in graph-based machine learning.

πŸ’‘Node Degree

Node degree is a simple structural feature that represents the number of edges connected to a node. The script explains that while it is a basic measure, it does not differentiate between the importance of different neighbors, which is why more complex measures like centrality are also discussed.

πŸ’‘Node Centrality Measures

Node centrality measures are used to determine the importance or influence of a node within a graph. The script introduces several types of centrality, such as eigenvector centrality, betweenness centrality, and closeness centrality, each capturing different aspects of a node's position and connectivity within the graph.

πŸ’‘Eigenvector Centrality

Eigenvector centrality is a measure that captures the importance of a node based on the importance of its neighbors. The script explains this concept by describing it as a normalized sum of the importances of its neighbors, emphasizing that a node is important if it is connected to other important nodes.

πŸ’‘Graphlets

Graphlets are small connected subgraphs that are used to characterize the local structure around a node. The script discusses how graphlets generalize the concept of counting triangles (as in the clustering coefficient) to counting various types of small subgraph structures that a node participates in, providing a detailed signature of the node's local topology.

Highlights

Introduction to traditional machine learning methods for graph data, focusing on node-level, link-level, and graph-level prediction tasks.

Discussion on the importance of designing proper features for machine learning on graphs, emphasizing the role of node attributes and network topology.

Explanation of structural features that describe the local network structure around a node and their significance in making accurate predictions.

Introduction of the traditional machine learning pipeline involving feature representation and model training for graph data.

Emphasis on the role of handcrafted features in capturing the relational structure of networks for effective machine learning.

Division of the lecture into three parts: node-level features, link-level features, and graph-level topology structure features.

Focus on undirected graphs for simplicity and the goal of making predictions for sets of nodes, edges, or entire graphs.

Introduction of node-level tasks and features, specifically in a semi-supervised context with labeled and unlabeled nodes.

Use of node degree as a basic but important structural feature for characterizing the network around a node.

Limitations of node degree in distinguishing nodes with the same degree but different network positions.

Introduction of node centrality measures to capture the importance of a node within the graph.

Explanation of eigenvector centrality and its computation as a measure of a node's importance based on its neighbors' importance.

Introduction of betweenness centrality as a measure of a node's role as a connector or bridge within the network.

Explanation of closeness centrality as a measure of how central a node's position is within the network.

Introduction of clustering coefficient as a measure of how connected a node's neighbors are, indicating local network density.

Concept of graphlets as a generalization of clustering coefficient to count various small subgraph structures around a node.

Definition and significance of Graphlet Degree Vector as a feature that captures the local network topology around a node.

Comparison of Graphlet Degree Vector to simpler measures like node degree and clustering coefficient for detailed topological analysis.

Practical applications of node features in predicting node importance, influence, and role within networks, such as identifying celebrity users in social networks or protein function prediction.

Transcripts

play00:04

Welcome to the class. Today we are going to talk about traditional methods for,

play00:09

uh, machine learning in graphs.

play00:11

Um, and in particular,

play00:13

what we are going to investigate is, uh,

play00:15

different levels of tasks,

play00:17

uh, that we can have in the graph.

play00:19

In particular, we can think about the node-level prediction tasks.

play00:22

We can think about the link-level or edge-level prediction tasks

play00:27

that consider pairs of nodes and tries to predict whether the pair is connected or not.

play00:32

And we can think about graph-level prediction,

play00:34

where we wanna make a prediction for an entire graph.

play00:38

For example, for an entire uh,

play00:40

molecule or for a- for an entire uh, piece of code.

play00:44

The traditional machine learning pipeline um, eh,

play00:48

is all about designing proper uh, features.

play00:51

And here we are going to think of two types of features.

play00:54

We are going to assume that nodes already

play00:57

have some types of attributes associated with them.

play01:00

So this would mean, for example, if this is a protein,

play01:02

protein interaction network uh,

play01:04

proteins have different um,

play01:06

chemical structure, have different chemical properties and we can think of these

play01:10

as attributes attached to the nodes uh, of the network.

play01:14

At the same time, what we also wanna do is we wanna be able to

play01:18

create additional features that will describe how um,

play01:23

this particular node is uh,

play01:25

positioned in the rest of the network,

play01:27

and what is its local network structure.

play01:30

And these additional features that describe the topology of the network of the graph uh,

play01:35

will allow us to make more accurate predictions.

play01:38

So this means that we will always be thinking about two types of uh, features,

play01:43

structural features, as well as features

play01:45

describing the attributes and properties uh, of the nodes.

play01:49

So the goal in uh,

play01:51

in what we wanna do today is especially focused

play01:54

on structural features that will describe um,

play01:57

the structure of a link in the broader surrounding of the network,

play02:01

that will describe the structure of the um,

play02:04

network neighborhood around a given node of interest,

play02:07

as well as features that are going to describe the structure of the entire uh,

play02:12

graph so that we can then feed these features

play02:14

into machine learning models uh, to make predictions.

play02:18

Traditionally um, in traditional machine learning pipelines, we have two steps.

play02:22

In the first step, we are going to take our data points, nodes, links,

play02:26

entire graphs um, represent them um,

play02:29

with vectors of features.

play02:31

And then on top of that, we are going then to train a classical machine learning uh,

play02:36

a classifier or a model, for example,

play02:38

a random forest, perhaps a support vector machine uh,

play02:42

a feed-forward neural network um,

play02:44

something of that sort.

play02:45

So that then in the future,

play02:47

we are able to apply this model where a new node,

play02:50

link, or graph uh, appears.

play02:53

Uh, we can obtain its features um,

play02:55

and make a prediction.

play02:56

So this is the setting uh,

play02:58

generally in which we are going to um, operate today.

play03:02

So in this lecture,

play03:03

we are going to focus as I said,

play03:05

on feature design, where we are going to use effective features uh,

play03:09

over the graphs, which will be the key to obtain good predictive performance,

play03:13

because you wanna capture the structure,

play03:15

the relational structure of the network.

play03:17

Um, and traditional machine learning pipelines use hand-designed, handcrafted features.

play03:22

And today's lecture will be all about these handcrafted features.

play03:26

And we are going to split the lecture into three parts.

play03:29

First, we are going to talk about features that describe

play03:32

individual nodes and we can use them for node-level prediction,

play03:35

then we are going to move and talk about features that can des- describe a pair of nodes.

play03:39

And you can think of these as features for link-level prediction,

play03:43

and then we're also going to talk about features and

play03:45

approaches that describe topology structure of

play03:48

entire graphs so that different graphs can be compared um, and uh, classified.

play03:53

And for simplicity, we are going to- to- today focus on uh, undirected graphs.

play03:59

So the goal will be,

play04:01

how do we make predictions for a set of objects of interest where

play04:06

the design choice will be that our feature vector will be a d-dimensional vector?

play04:10

Uh, objects that we are interested in will be nodes,

play04:13

edges, sets of nodes uh,

play04:15

meaning entire graphs um,

play04:17

and the objective function we'll be thinking about is,

play04:20

what are the labels uh,

play04:22

we are trying to predict?

play04:24

So the way we can think of this is that we are given a graph as a set of vertices,

play04:29

as a set of edges,

play04:30

and we wanna learn a function that for example,

play04:32

for every node will give- will give us a real valued prediction um, which for example,

play04:38

would be useful if we're trying to predict uh,

play04:40

edge of uh, every node in our, uh social network.

play04:44

And the question is, how can we learn this function

play04:47

f that is going to make uh, these uh, predictions?

play04:50

So first, we are going to talk about

play04:53

node-level tasks and features that describe individual nodes.

play04:58

The way we are thinking of this is that we are thinking of

play05:00

this in what is known as semi-supervised uh,

play05:04

case, where we are given a network,

play05:07

we are given a couple of nodes that are labeled with different colors, for example,

play05:11

in this case, and the goal is to predict the colors of un- uh, un-uncolored uh, nodes.

play05:17

And if you look at this example here,

play05:19

so given the red and green nodes,

play05:22

we wanna color uh, the gray nodes.

play05:24

And you know, if you stir a bit at this,

play05:27

the rule here is that um,

play05:29

green nodes should have at least two a- edges adjacent to them,

play05:33

while red nodes have at least one edge uh,

play05:36

have e- exactly one edge at um, uh, connected to them.

play05:40

So if we are now able to describe um,

play05:43

the node degree of every node um,

play05:46

as a- as a structural feature uh, in this graph,

play05:49

then we will be able to learn the model that in this uh,

play05:51

simple case, correctly colors uh,

play05:54

the nodes uh, of the graph.

play05:56

So we need features that will describe this particular topological uh, pattern.

play06:02

So the goal is to characterize the structure o- um,

play06:06

of the network around a given node,

play06:08

as well as in some sense,

play06:09

the position, the location of the node in the broader network context.

play06:14

And we are going to talk about four different uh,

play06:17

approaches that allow us to do this.

play06:18

First, we can always use the degree of the node as a characterization of um,

play06:23

uh, structure of the network around the node,

play06:25

then we can think about the importance,

play06:27

the position of the node through the notion of uh,

play06:30

node centrality measures, and we are going to review a few.

play06:34

Then we will talk about um,

play06:36

characterizing the local network structure.

play06:38

Not only how many uh, uh,

play06:40

edges a given node has,

play06:42

but also what is the structure of the node around it?

play06:44

First, we are going to talk about clustering coefficient,

play06:47

and then we are going to generalize this to the concept known uh, graphlets.

play06:52

So first, let's talk about the node degree.

play06:54

We have already introduced it.

play06:56

There is nothing special,

play06:57

but it is a very useful feature,

play06:59

and many times it is um,

play07:01

quite important where basically we will say the-

play07:04

we capture the- s- the structure of the node uh,

play07:07

um, v in the network with the number of edges that this node has.

play07:12

Um, and uh, you know,

play07:13

the drawback of this is that it treats uh,

play07:15

all the neighbors equally.

play07:17

But in- and in the sense, for example,

play07:19

nodes with the same degree are indistinguishable even though if they may be in uh,

play07:23

different parts of the network.

play07:25

So for example, you know,

play07:26

the node C and the node E have the same degree,

play07:29

so our classifier uh,

play07:31

won't be able to distinguish them or perhaps node A, uh,

play07:34

H, E, uh, uh, F and G also have all degree one.

play07:38

So they will all have the same feature value,

play07:40

so the- our simple uh, machine learning model would uh,

play07:43

that would only use the node degree as a feature would we only-

play07:46

we'd- would be able to predict the same value uh,

play07:48

or would be forced to predict the same value for all these uh,

play07:52

different nodes because they have the same degree.

play07:55

So um, to generalize bi- a bit this very simple notion,

play08:00

we can then start thinking about uh, you know,

play08:03

node degree only counts neighbors of the node and uh,

play08:06

without capturing, let's say their importance or who they really are.

play08:09

So the node centrality measures try to uh,

play08:14

capture or characterize this notion of how important is the node in- in the graph?

play08:18

And there are many different ways how we can capture um,

play08:22

or model this notion of importance.

play08:25

I'm quickly going to introduce um, Eigenvector Centrality um,

play08:29

which we are going to further uh,

play08:31

work on and extend to the uh,

play08:33

seminal PageRank algorithm uh, later uh.

play08:36

In the- in the course,

play08:38

I'm going to talk about between the centrality that will tell us how- uh,

play08:43

how important connector a given node is,

play08:45

as well as closeness centrality that we'll try to capture

play08:48

how close to the center of the network a given node is.

play08:52

Um, and of course, there are many other, um,

play08:55

measures of, uh, centrality or importance.

play08:58

So first, let's define what is an eigenvector centrality.

play09:02

We say that node v is as important if it

play09:06

is su- surrounded by important neighboring nodes u.

play09:10

So the idea then is that we say that the importance of a given node v is simply, um,

play09:16

normalized divided by 1 over Lambda,

play09:18

sum over the importances of its neighbors,

play09:22

uh, uh, in the network.

play09:23

So the idea is, the more important my friends are,

play09:26

the higher my own importance, uh, is.

play09:29

And if you-if you look at this,

play09:31

um, and you've write it down,

play09:32

you can write this in terms of, uh,

play09:35

a simple, uh, uh,

play09:36

matrix equation where basically, Lambda is this,

play09:39

uh, uh, positive constant like a normalizing factor,

play09:42

c is the vector of our centrality measures,

play09:45

A is now, uh, graph adjacency matrix,

play09:47

and c is again that vector of centrality measures.

play09:50

And if you write this in this type of, uh, forum,

play09:53

then you see that this is a simple, um,

play09:55

eigenvector, eigenvalue, uh, equation.

play09:58

So what this means is that, um,

play10:00

the solution to this, uh, uh,

play10:02

problem, uh, here- to this equation here, um,

play10:05

is, um, um, this is solved by the, uh,

play10:08

la- uh, by the given eigenvalue and the associated eigenvector.

play10:12

And what people take as, uh, uh,

play10:16

measure of nodes centrality is- is

play10:18

the eigenvector that is associated with the largest, um, eigenvalue.

play10:23

So in this case, if eigen- largest eigenvalue is Lambda max, um, be- uh,

play10:27

by Perron–Frobenius theorem, uh,

play10:30

because we are thinking of the graph is undirected,

play10:32

it is always positive, uh, and unique.

play10:34

Then, uh, the associated leading eigenvector c_max

play10:39

is usually used as a centrality score for the nodes in the network.

play10:43

And again, um, uh,

play10:44

the intuition is that the importance of a node is the sum,

play10:48

the normalized sum of the importances of the nodes that it links to.

play10:52

So it is not about how many connections you have but it is, uh,

play10:55

how, um, who these connections point to and how important,

play10:59

uh, are those people.

play11:00

So this is the notion of, uh,

play11:02

nodes centrality captured by the eigenvector, uh, centrality.

play11:06

A different type of centrality that has

play11:09

a very different intuition and captures a different aspect of the,

play11:13

uh, nodes position in the network is- is what is called betweenness centrality.

play11:18

And betweenness centrality says that node is important if it lies on many shortest paths,

play11:24

um, between, uh, other pairs of nodes.

play11:27

So the idea is if a node is an important connector, an important bridge,

play11:31

an important kind of transit hub,

play11:33

then it has a high importance.

play11:36

Um, the way we compute, um,

play11:38

betweenness centrality is to say betweenness centrality of

play11:41

a given node v is a summation over pairs of nodes, uh,

play11:45

s and t. And we count how many shortest paths between s and t,

play11:49

uh, pass through the node, uh, v,

play11:53

and we normalize that by the total number of shortest paths,

play11:57

um, of the same length between, uh,

play11:59

s and t. So essentially, um,

play12:01

the more shorter paths a given node,

play12:04

uh, appears on, uh,

play12:06

the more important it is.

play12:07

So it means that this kind of measures how good a connector or how good of a transit,

play12:12

uh, point a given node is.

play12:14

So if we look at this example that- that we have here, for example,

play12:17

between a centrality of these,

play12:19

uh, nodes that are,

play12:20

uh, on the- uh, uh, on the edge of the network like A, B,

play12:23

and E is zero,

play12:25

but the- the betweenness centrality of node,

play12:27

uh, uh, C equals to three,

play12:30

because the shortest path from A to B pass through C,

play12:34

a shortest path from, uh,

play12:35

A to D, uh, passes through C,

play12:38

and a shortest path between,

play12:40

uh, A and E, again,

play12:41

passes through C. So these are the three shortest paths that pass through the node C,

play12:46

so it's between a centrality equals to three.

play12:49

By a similar argument,

play12:51

the betweenness centrality of the node D will be the same, equals to three.

play12:55

Here are the corresponding shortest paths between

play12:57

different pairs of nodes that actually pass through, uh,

play13:00

this node D. Now that we talked about how important transit hub a given, uh,

play13:07

node is captured by the betweenness centrality,

play13:10

the third type of centrality that,

play13:12

again, captures a different aspect of

play13:15

the position of the node is called closeness centrality.

play13:18

And this notion of centrality importance says that a node is important if it

play13:23

has small shortest path- path lengths to oth- all other nodes in the network.

play13:28

So essentially, the more center you are,

play13:30

the shorter the path to everyone else,

play13:33

um, uh, is, the more important you are.

play13:36

The way, um, uh, we operationalize this is to

play13:38

say that the closeness centrality of a given node v is one

play13:41

over the sum of the shortest path length

play13:44

between the node of interest three and any other node,

play13:48

uh, u, uh, in the network.

play13:50

So, uh, to give an example, right?

play13:53

Uh, the idea here is that the- the-

play13:54

the smaller this- the- the more in the center you are,

play13:58

the smaller the summation will be,

play14:00

so one over a smaller number will be a big number.

play14:03

And if somebody is on the, let's say,

play14:04

very far on the edge of the network and needs a lot of, uh, uh, uh,

play14:08

long paths to reach other nodes in the network,

play14:11

then its betweenness centrality will be low

play14:13

because the sum of the shortest path lengths, uh, will be high.

play14:16

Um, in this case,

play14:17

for example, you know, the- uh,

play14:19

the closest centrality of node a equals 1/8 because,

play14:24

um, it has, uh,

play14:25

to node B, it has the shortest path of length two, to node C,

play14:29

it has a shortest path of length one, to D,

play14:31

it is two, and to E is three, so it's 1/8.

play14:35

While for example, the node D that is a bit more in

play14:37

the center of the network has a length two,

play14:40

shortest path to node A and length one,

play14:43

shortest paths to all other nodes in the network.

play14:45

So this is one over five,

play14:47

so it's betweenness centrali- oh,

play14:49

sorry, its clo- node centrality,

play14:50

closeness centrality, uh, is higher.

play14:53

And then now, I'm going to shift gears a bit.

play14:57

I was talking about centralities in terms of how- how important,

play15:01

uh, what is the position,

play15:03

uh, of the node in the network.

play15:05

Now, we are going to start talk- we are going back to think about the node degree and uh,

play15:10

the local structure, uh, around the node.

play15:14

And when I say local structure,

play15:15

it me- really means that for a given node,

play15:16

we only look, uh, in its immediate vicinity, uh,

play15:19

and decide, uh, on the, uh,

play15:22

pro- on the- and characterize the properties of the network around it.

play15:26

And, uh, classical measure of this is called clustering coefficient.

play15:30

And the clustering coefficients measures how connected one's neighbors are.

play15:36

So how connected the friends of node v are.

play15:39

And the way we, uh, define this is to say,

play15:41

clustering coefficient of node, uh, v,

play15:43

is the number of edges among the,

play15:46

uh, neighboring nodes of, uh,

play15:48

v divided by the degree of v, uh, choose 2.

play15:52

So this is the number- this is, um,

play15:54

k choose 2 measures how many pairs can you select out of,

play15:59

uh, uh, k different objects.

play16:00

So this is saying, how many- how many node pairs there exists in your neighborhood?

play16:04

So how many potential edges are there in the net- in- in your neighborhood?

play16:10

Um, and, um, uh,

play16:11

D says, how many edges actually occur?

play16:15

So this metric is naturally between zero and one

play16:18

where zero would mean that none of your friends,

play16:21

not on- none of your connections know each other, and,

play16:24

uh, one would mean that all your friends are also friends with each other.

play16:27

So, uh, here's an example.

play16:29

Imagine this simple graph on five nodes and we have our,

play16:32

uh, red node, uh, v,

play16:34

to be the, uh, node of interest.

play16:36

So for example, in this case,

play16:38

node v has clustering coefficient, um, of one, uh,

play16:41

because all of its, uh,

play16:43

four friends are also,

play16:45

uh, connected, uh, with each other.

play16:48

So here, um, the clustering is one.

play16:51

In this particular case, for example,

play16:52

the clustering is, uh, uh, 0.5.

play16:55

The reason being that, um,

play16:57

out of, um, six, uh,

play17:00

possible connections between the, uh,

play17:02

four neighboring nodes of node v,

play17:04

there are only, uh,

play17:05

three of them are connected.

play17:07

While here in the last example,

play17:09

the clustering is zero because, um, out of,

play17:12

uh, all four, uh, neighbors of, uh,

play17:15

node v, none of them are connected, uh, with each other.

play17:19

The observation that is interesting that then leads us to generalize, uh,

play17:24

this notion to the not- of clustering coefficient to the notion of graphlets,

play17:28

the observation is that clustering coefficient basically

play17:31

counts the number of triangles in the ego-network of the node.

play17:35

So let me explain what I mean by that.

play17:37

First, ego-network of a given node is simply

play17:40

a network that is induced by the no- node itself and its neighbors.

play17:44

So it's basically degree 1 neighborhood network around a given node.

play17:49

And then what I mean by counts triangles?

play17:52

I mean that now, if I have this ego-network of a node,

play17:55

I can count how many triples of nodes are connected.

play17:58

And in this particular, uh,

play17:59

use case where, um, um,

play18:02

uh, the clustering coefficient of this node is 0.5,

play18:05

it means that I find three triangles, um,

play18:09

in the network, uh, neighborhood, in the,

play18:12

um, ego-network of my node of interest.

play18:15

So this means that clustering coefficient is really counting these triangles,

play18:19

which in social networks are very important because in social networks,

play18:23

a friend of my friend is also a friend with me.

play18:26

So social networks naturally evolve by triangle closing where basically,

play18:30

the intuition is if somebody has two friends in common, then more or la- uh,

play18:35

sooner or later these two friends will be

play18:37

introduced by this node v and there will be a link, uh, forming-

play18:41

Here, so social networks tend to have a lot of triangles syndrome and,

play18:45

uh, clustering coefficient is a very important metric.

play18:49

So now with this, the question is,

play18:51

could we generalize this notion of triangle accounting, uh,

play18:55

to more interesting structures and- and count, uh,

play18:59

the number of pre-specified graphs in the neighborhood of a given node.

play19:04

And this is exactly what the concept of graphlet, uh, captures.

play19:09

So the last, um, uh,

play19:12

way to characterize the structure of the net, uh,

play19:16

of the network around the given node will be through this concept of

play19:20

graphlets that render them just to- to- to count triangles.

play19:24

Also, counts other types of structures, uh, around the node.

play19:28

So let me define that.

play19:29

So graphlet is a rooted connected non isomorphic, um, subgraph.

play19:35

So what do I mean by this?

play19:37

For example, here are all,

play19:38

um, possible graphlets, um,

play19:41

that- that have, um,

play19:44

uh, um, different number,

play19:45

uh, of nodes which start with,

play19:47

uh, a graphlet on two nodes.

play19:49

So it's basically nodes connected by an edge.

play19:51

There are, uh, three possible graphlets on three nodes and,

play19:56

uh, there is, uh,

play19:57

and then here are the graphlets of 4-nodes.

play20:00

And these are the graphlets on, uh, five nodes.

play20:03

So now let me explain what we are looking at.

play20:05

So for example, if you look at the graphlets on three nodes, it is either,

play20:09

uh, a chain on three,

play20:11

um, nodes, audits or triangle, right?

play20:13

Its all three nodes are connected.

play20:15

These are all possible connected graphs on three nodes.

play20:19

Now why do I say there are, uh,

play20:21

uh, three graphlets not two?

play20:23

Uh, the reason for that,

play20:25

is that the position of the node of interest also matters.

play20:29

So here, for example,

play20:30

you can be at this position.

play20:32

And then the question is in how many graphs like this do you participate in.

play20:36

Or you can be at this other, uh,

play20:38

position here- position number two,

play20:40

and it's basically saying,

play20:41

how many pairs of friends do you have?

play20:44

And then this- this,

play20:45

in the case of a triangle,

play20:47

all this position are isomorphic,

play20:49

they're equivalent, so there is,

play20:50

uh, only one of them,

play20:51

so this is the, uh, position in which you may- you can be.

play20:55

Now, similarly, if you look at now at, uh,

play20:57

four node graphlets, there is,

play20:59

uh, um, many more of them, right?

play21:01

Um, uh, they- they look the following, right?

play21:04

You again have a chain on four nodes and you have two positions on the chain.

play21:07

You are either in the edge or you are one- one away

play21:10

from the edge- from- if you go from the other end, it is just symmetric.

play21:13

Here in the second,

play21:15

this kind of star graphlet you can either be, um,

play21:17

the satellite on the edge of the star or you can be

play21:19

the center on- of the star in a- in a square.

play21:22

All the positions are isomorphic,

play21:24

so you can be just part of the square.

play21:26

Here's another interesting, um,

play21:28

example where, um, you test three different positions.

play21:32

You can be- you can be here- you can be here,

play21:34

or you can be at position 10,

play21:35

which is i-isomorphic to the- to the other side in this kind of square,

play21:39

v dot diagonal, again,

play21:40

you have, uh, two different positions.

play21:43

And in this last fully connected graph on four nodes,

play21:46

uh, all nodes are equivalent.

play21:47

So there is- all- all positions are equivalent, so there is only one.

play21:50

So what this shows is that, um,

play21:52

if you say how many graphlets are there on five nodes,

play21:55

uh, there is, uh,

play21:56

73 of them, right?

play21:57

Labeled from one, uh,

play21:59

all the way to 73 because it's different graphs as well as,

play22:03

uh, positions in these graphs.

play22:05

So now that we know what the graphlets are,

play22:07

we can define what is called Graphlet Degree Vector,

play22:11

which is basically a graphlet-base,

play22:14

uh, based features, uh, for nodes.

play22:16

And graphlet degree counts, um,

play22:20

the number of times a given graphlet appears,

play22:23

uh, uh, rooted at that given node.

play22:25

So the way you can think of this is, degree counts,

play22:28

the number of edges that the node touches, uh,

play22:31

clustering coefficient counts the number of triangles

play22:34

that are node touches or participates in and,

play22:38

uh, graphlet degree vector counts the number of

play22:40

graphlets that- that a node, uh, participates in.

play22:44

So to give you an example, um, uh,

play22:48

a Graphlet Degree Vector is then simply

play22:50

a count vector of graphlets rooted at that given node.

play22:54

So to, uh, to give an example,

play22:56

consider this particular graph here,

play22:58

and we are interested in the node v. Then

play23:01

here is a set of graphlets on two and three nodes.

play23:04

This is our universe of graphlets.

play23:06

We are only look- going to look at

play23:08

the graphlets all the way up to size of three nodes and node v there.

play23:11

Um, and then, uh, you know,

play23:13

what are the- what are now the graphlets instances, for example,

play23:16

the- the graphlets of type a,

play23:18

these node v participates in two of them, right?

play23:21

It's- one is here and the other one is there, right?

play23:23

So this means this one and that one.

play23:25

Then the graphlet of type b node v participates in one of them, right?

play23:30

This is, um, this is the graphlet here, right, uh, b.

play23:34

And then you say, how about how many graphlets of type d does, uh,

play23:38

node, um, sorry, of type c,

play23:41

does node, uh, v participate in.

play23:43

And it doesn't participate in any because,

play23:46

uh, here it is, but these two nodes are also connected.

play23:49

So, um, because graphlets are induced,

play23:52

this edge appears here as well.

play23:55

So for d we get zero,

play23:57

sorry, for c we get zero.

play23:58

How about for d?

play24:00

D is a path of- on two nodes.

play24:02

If you look at it, there are two instances of d. Uh,

play24:05

here is one and here is the other as, uh, illustrated here.

play24:09

So graphlet, uh, degree vector for node v would be two, one, zero, two.

play24:16

So two, one, zero, two.

play24:19

And this now characterizes the local neighborhood structure, uh,

play24:23

around the given node of interest based on the frequencies of these graphlets that,

play24:30

uh, the node v, uh, participates in.

play24:33

So, um, if we consider graphlets from,

play24:37

uh, two to five no- nodes,

play24:40

we can now describe every node in the network with a vector that

play24:44

has 73 dimensions or 73- 73 coordinates.

play24:48

Uh, and this is essentially a signature of a node that describes the topology,

play24:52

uh, of nodes, uh, neighborhood.

play24:54

Uh, and it captures its interact- interconnections,

play24:58

um, all the way up to the distance of four- four hops, right?

play25:01

Because, um, uh, a chain of four edges has five nodes,

play25:06

so if you are at the edge of the chain,

play25:08

which means you count how many paths of length four- do,

play25:11

uh, uh, lead out of that node.

play25:13

So it characterizes the network up to the distance, uh, of four.

play25:17

And, uh, Graphlet Degree Vector provides a measure of nodes,

play25:20

local network topology and comparing vectors now of- of two nodes,

play25:25

proves, uh, provides a more detailed measure of local topological similarity.

play25:30

Then for example, just looking at node degree or clustering coefficient.

play25:34

So this gives me a really fine-grained way to compare the neighborhoods, uh,

play25:38

the structure of neighborhoods, uh,

play25:40

of two different nodes, uh,

play25:41

in perhaps through different, uh, networks.

play25:44

So to conclude, uh, so far,

play25:47

we have introduced different ways to obtain node features.

play25:50

Uh, and they can be categorized based on

play25:53

importance features like node degree and different centrality measures,

play25:57

as well as structure-based features where again,

play25:59

no degrees like the simplest one, then, uh,

play26:02

that counts edges, clustering, counts triangles.

play26:05

And the Graphlet Degree Vector is a generalization that counts,

play26:10

uh, other types of structures that a given node of interest, uh, participates in.

play26:15

So to summarize, the importance

play26:18

based features capture the importance of a node in a graph.

play26:21

We talked about degree centrality,

play26:24

um, and we talked about different notions of, uh,

play26:27

centrality closeness, betweenness, um,

play26:30

as well as, um,

play26:32

the eigenvector, uh centrality.

play26:34

And these types of features are using in predicting,

play26:37

for example, how important or influential are nodes in the graph.

play26:40

So for example, identifying celebrity users in social networks would be one,

play26:45

uh, such, uh, example.

play26:47

Um, the other type of node level features that we talked about,

play26:52

were structured based-fea- structure-based features that capture

play26:55

the topological properties of local neighborhood around the node.

play26:59

We talked about node degree,

play27:00

clustering coefficient and graphlet degree vector

play27:03

and these types of features that are very useful,

play27:06

uh, for predicting a particular nodes role,

play27:09

uh, in the network.

play27:10

So for example, if you think about predicting protein function,

play27:14

then, um, these type of graphlet, uh,

play27:16

features are very useful because they characterize, uh,

play27:20

the structure of the, uh,

play27:21

network, uh, around, given, uh, node.

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Machine LearningGraph AnalysisNode PredictionGraph-Level TasksFeature DesignNetwork TopologyCentrality MeasuresClustering CoefficientGraphletsStructural Features