M4ML - Linear Algebra - 5.7 Introduction to PageRank

Digital Learning Hub - Imperial College London
15 Nov 201908:44

Summary

TLDRThe video script explains the PageRank algorithm, developed by Google's Larry Page, which determines website importance based on link relationships. It introduces a model using a mini Internet of web pages and links, represented as a matrix. The script describes how to calculate page ranks using the link matrix and an iterative process, leading to an eigenvector with an eigenvalue of 1. The concept of a damping factor is also mentioned, which accounts for random web address entry. The summary highlights the algorithm's evolution and its core concept's resilience in modern search ranking.

Takeaways

  • 📜 The PageRank algorithm was developed by Larry Page and colleagues in 1998 to rank websites based on links to and from other sites.
  • 🌐 The importance of a webpage in PageRank is related to its connections with other pages, calculated using Eigen theory.
  • 🔗 A webpage's link structure is modeled using vectors that describe the probability of transitioning between pages based on the available links.
  • 📊 The link matrix (L) is built using link vectors from each webpage, representing the probabilities of moving from one page to another.
  • 🧮 The rank of a page is determined iteratively by multiplying a rank vector (r) by the link matrix, updating the rank values until convergence.
  • 🔄 The process of finding the ranks is iterative, repeatedly adjusting the rank values until they stop changing, representing equilibrium.
  • 🧠 The final rank vector (r) is an eigenvector of the link matrix L, with an eigenvalue of 1, indicating the page's importance.
  • 💡 The power method is used to calculate PageRank, which effectively finds the desired eigenvector with an eigenvalue of 1.
  • 🎛️ A damping factor (d) is introduced in the algorithm to simulate the chance of a user randomly navigating to a new page, improving the stability of the ranking process.
  • 💻 Although the internet has grown to billions of websites, the core PageRank concept remains the same, with algorithms evolving for efficiency in handling large, sparse matrices.

Q & A

  • What is the PageRank algorithm?

    -The PageRank algorithm is a way of measuring the importance of website pages. It was developed by Larry Page and colleagues and is used by Google to rank web pages in their search results. It works on the assumption that more important pages are likely to be linked to by more pages.

  • How does the importance of a website relate to its links according to PageRank?

    -In the PageRank algorithm, the importance of a website is related to the number and quality of links to and from other websites. A page that is linked to by many other pages is considered more important and thus is likely to rank higher in search results.

  • What is the concept of 'Procrastinating Pat' in the context of PageRank?

    -Procrastinating Pat is an imaginary user who randomly clicks on links to avoid work. The concept is used to model how a user might navigate the web, and it helps in estimating the amount of time a user is expected to spend on each webpage, which in turn influences the page's ranking.

  • How are link vectors represented in the PageRank model?

    -In the PageRank model, link vectors are represented as vectors where each element corresponds to a webpage. An element is '1' if there is a link to the corresponding page and '0' if there is no link. These vectors are then normalized to represent probabilities.

  • What is the significance of normalizing link vectors?

    -Normalizing link vectors ensures that the sum of the probabilities equals one, which is necessary for them to be used as probabilities in the PageRank calculations. This means that the total probability of moving from one page to any other page sums up to 100%.

  • How is the link matrix L constructed in the PageRank algorithm?

    -The link matrix L is constructed by using the normalized link vectors as columns. This results in a square matrix where each row represents the normalized inward links to a page from all other pages.

  • What does the self-referential nature of the PageRank problem imply?

    -The self-referential nature of the PageRank problem implies that the rank of each page depends on the ranks of all other pages. This creates a system of interdependent equations that need to be solved simultaneously.

  • How is the rank of a page calculated in the PageRank algorithm?

    -The rank of a page is calculated by summing the product of the ranks of all pages that link to it, each weighted by the link probability from the link matrix L. This reflects the idea that a page's importance is influenced by the importance of the pages that link to it.

  • What is the iterative approach used in the PageRank algorithm?

    -The iterative approach in the PageRank algorithm involves repeatedly multiplying the current rank vector by the link matrix L until the rank vector stabilizes. This process is known as power iteration and converges to the principal eigenvector of the matrix, which represents the stable page ranks.

  • What is the damping factor in the PageRank algorithm and why is it used?

    -The damping factor, denoted as 'd', is a value between 0 and 1 that represents the probability that a user will follow a link from the current page. It is used to model the scenario where a user might type a new URL into the address bar instead of following a link, adding a random element to the page ranking process to ensure diversity and stability in the results.

  • How has the PageRank algorithm evolved since its initial publication?

    -Since its initial publication, the PageRank algorithm has evolved to handle the vast increase in the number of websites on the Internet. While the core concept of using link analysis to determine page importance remains the same, the methods for efficiently calculating page ranks have had to become more sophisticated to manage the scale and complexity of the modern web.

Outlines

00:00

🌐 Introduction to PageRank Algorithm

The video script introduces the PageRank algorithm, developed by Google's Larry Page and colleagues in 1998. It's a method used by Google to determine the order in which search results are displayed. The algorithm is based on the concept that a website's importance is related to its links with other websites. A model mini Internet is described, where webpages are bubbles and links are arrows. The importance of a webpage is determined by the probability of a random surfer landing on it, represented by a vector of links normalized to describe probabilities. The script explains how to create a link matrix from these vectors and how this matrix represents the probability of landing on each page. The concept of iteratively updating the rank vector until it converges to an eigenvector of the link matrix with an eigenvalue of 1 is discussed.

05:02

🔄 Iterative Process and Damping Factor in PageRank

The script continues by explaining the iterative process of updating the rank vector until it stabilizes, which signifies that the vector has become an eigenvector of the link matrix with an eigenvalue of 1. It mentions that while diagonalization could be used, it's impractical since it requires knowledge of all eigenvectors beforehand. The script then discusses the power method as an effective approach for the PageRank problem due to the structure of the link matrix and the sparsity of the real-world web. The concept of the damping factor 'd' is introduced, which accounts for the probability of a random jump in the browsing process. The script concludes by noting the evolution of search and ranking methods to handle the growth of the Internet while maintaining the core concept of PageRank.

Mindmap

Keywords

💡PageRank

PageRank is an algorithm used by Google to rank websites in their search engine results. It was named after Larry Page, one of the founders of Google, and was first published in 1998. The algorithm is based on the concept that a website's importance can be determined by the number and quality of links to and from other websites. In the video, PageRank is explained as a model for estimating the relevance of web pages to a search query, using the example of an imaginary person, Procrastinating Pat, who randomly clicks links to navigate the web.

💡Eigenproblems

Eigenproblems refer to mathematical problems involving eigenvalues and eigenvectors. In the context of the video, the PageRank algorithm is related to eigentheory because it involves finding an eigenvector of a matrix that represents the link structure of web pages. The video explains how the PageRank algorithm can be expressed as an eigenvector problem, where the eigenvector corresponds to the ranking of web pages.

💡Procrastinating Pat

Procrastinating Pat is a hypothetical character introduced in the video to illustrate how the PageRank algorithm models web browsing behavior. Pat is imagined to randomly click on links to avoid work, and the algorithm estimates the amount of time Pat would spend on each webpage based on the network of links. This character serves to simplify the concept of how link probabilities are used to calculate the importance of web pages.

💡Link Matrix

A link matrix is a square matrix used in the PageRank algorithm to represent the structure of links between web pages. Each column of the matrix corresponds to a web page and shows the links to other pages. The video describes how these matrices are constructed and used to calculate the probability of landing on each page, which is essential for determining the PageRank of each web page.

💡Eigenvector

An eigenvector is a vector that, when multiplied by a matrix, results in a scaled version of the same vector. In the context of the video, the PageRank algorithm seeks to find the eigenvector of the link matrix that corresponds to an eigenvalue of 1. This eigenvector represents the steady-state distribution of page ranks, indicating the relative importance of web pages.

💡Normalization

Normalization in the video refers to the process of adjusting the link vectors so that they represent probabilities. This is done by dividing each link vector by the total number of links on a page, ensuring that the sum of probabilities equals one. Normalization is crucial for the PageRank algorithm to work correctly, as it allows the algorithm to interpret link vectors as probabilities.

💡Iterative Process

The iterative process mentioned in the video is the method used to approximate the PageRank vector. It involves repeatedly multiplying the current estimate of the rank vector by the link matrix until the vector converges to a stable state. This process is essential for calculating the PageRank because it allows the algorithm to update and refine the rankings of web pages until they reach a consistent state.

💡Damping Factor

The damping factor, denoted as 'd' in the video, is a parameter in the PageRank algorithm that accounts for the possibility that a user may not always follow a link but instead enter a new URL directly. It is a value between 0 and 1, and it modifies the iterative formula used to calculate the PageRank. The damping factor helps to balance the convergence speed and stability of the algorithm.

💡Power Method

The power method is an iterative numerical technique for finding the largest eigenvalue and corresponding eigenvector of a matrix. In the video, it is mentioned as an effective method for solving the PageRank problem. The method involves multiplying the initial guess vector by the link matrix repeatedly until convergence. The power method is particularly well-suited for the PageRank algorithm due to the sparse nature of the link matrix.

💡Sparse Matrix

A sparse matrix is a matrix in which most of the elements are zero. In the context of the video, the link matrix representing the web is described as sparse because most web pages do not link to most other pages. The sparsity of the matrix is important because it allows for more efficient computation of the PageRank, as algorithms can be optimized to handle the large number of zeros.

Highlights

Introduction to PageRank algorithm, developed by Google founder Larry Page.

PageRank determines website importance based on links to and from other websites.

Eigen theory is fundamental to the PageRank algorithm.

Bubble diagram represents a mini Internet model for understanding link structures.

Procrastinating Pat model simulates random web browsing to estimate time spent on each webpage.

Link vectors are used to describe the structure of webpage links.

Normalization of link vectors to represent probabilities.

Construction of the link matrix L from individual link vectors.

The self-referential nature of PageRank where page ranks depend on each other.

Matrix L represents the probability of landing on each webpage.

Vector r is introduced to store the rank of all webpages.

Expression for calculating the rank of a page based on links and rank of other pages.

Matrix multiplication r = Lr to calculate ranks for all pages simultaneously.

Iterative approach to solving PageRank by updating rank vector r.

Convergence of rank vector r to an eigenvector of matrix L with eigenvalue 1.

The power method as an effective approach for the PageRank problem.

Efficiency of the power method due to the sparse nature of the web's link matrix.

Introduction of the damping factor d to the PageRank algorithm.

Damping factor d as a measure of random web address entry by users.

Evolution of search and ranking methods to handle over one billion websites.

Enduring core concept of PageRank despite changes in the scale of the Internet.

Transcripts

play00:00

[MUSIC]

play00:02

The final topic of this module on Eigenproblems, as well as the final topic

play00:06

of this course as a whole, will focus on an algorithm called PageRank.

play00:11

This algorithm was famously published by and

play00:13

named after Google founder Larry Page and colleagues in 1998.

play00:17

And was used by Google to help them decide which order to display their websites when

play00:21

they returned from search.

play00:24

The central assumption underpinning page rank

play00:27

is that the importance of a website is related to its links to and

play00:30

from other websites, and somehow Eigen theory comes up.

play00:35

This bubble diagram represents a model mini Internet,

play00:38

where each bubble is a webpage and each arrow from A, B, C,

play00:43

and D represents a link on that webpage which takes you to one of the others.

play00:48

We're trying to build an expression that tells us, based on this network structure,

play00:52

which of these webpages is most relevant to the person who made the search.

play00:57

As such, we're going to use the concept of Procrastinating Pat

play01:00

who is an imaginary person who goes on the Internet and

play01:02

just randomly click links to avoid doing their work.

play01:06

By mapping all the possible links, we can build a model to estimate the amount of

play01:10

time we would expect Pat to spend on each webpage.

play01:13

We can describe the links present on page A as a vector, where each row is either

play01:18

a one or a zero based on whether there is a link to the corresponding page.

play01:23

And then normalise the vector by the total number of the links,

play01:26

such that they can be used to describe a probability for that page.

play01:30

For example, the vector of links from page A will be 0 1 1 1.

play01:35

Because vector A has links to sites B, to C, and to D, but

play01:39

it doesn't have a link to itself.

play01:42

Also, because there are three links in this page in total,

play01:45

we would normalize by a factor of a third.

play01:47

So the total click probability sums to one.

play01:50

So we can write, L_A =

play01:56

(0, a third, a third, a third).

play02:02

Following the same logic, the link vectors in the next two sites are shown here.

play02:06

And finally, for page D, we can write L_D is going to equal,

play02:12

so D is connected to B and C, but not A, two sides in total,

play02:18

(0, a half, a half, 0).

play02:22

We can now build our link matrix L by using each of our

play02:27

linked vectors as a column, which you can see will form a square matrix.

play02:32

What we're trying to represent here with our matrix L

play02:35

is the probability of ending up on each of the pages.

play02:38

For example, the only way to get to A is by being at B.

play02:44

So you then need to know the probability of being at B,

play02:47

which you could've got to from either A or D.

play02:51

As you can see, this problem is self-referential,

play02:55

as the ranks on all the pages depend on all the others.

play02:59

Although we built our matrix from columns of outward links, we can see that

play03:02

the rows describe inward links normalized with respect to their page of origin.

play03:08

We can now write an expression which summarises the approach.

play03:11

We're going to use the vector r to store the rank of all webpages.

play03:16

To calculate the rank of page A,

play03:17

you need to know three things about all other pages on the Internet.

play03:22

What's your rank?

play03:23

Do you link to page A?

play03:24

And how many outgoing links do you have in total?

play03:27

The following expression combines these three pieces of information for

play03:31

webpage A only.

play03:32

So r_A is going to equal the sum from j = 1 to n,

play03:37

where n is all the webpages of the link matrix relevant to webpage A and

play03:45

location j, multiplied by the rank at location j.

play03:51

So this is going to scroll through each of our webpages.

play03:54

Which means that the rank of A is the sum of the ranks of all the pages which link

play03:58

to it, weighted by their specific link probability taken from matrix L.

play04:03

Now we want to be able to write this expression for

play04:06

all pages and solve them simultaneously.

play04:08

Thinking back to our linear algebra, we can rewrite the above

play04:12

expression applied to all webpages as a simple matrix multiplication.

play04:16

So r = Lr.

play04:21

Clearly, we start off not knowing r.

play04:24

So we simply assume that all the ranks are equally and normalise them by the total

play04:29

number of webpages in our analysis, which in this case is 4.

play04:33

So r equals a quarter, a quarter,

play04:38

a quarter, a quarter.

play04:45

Then, each time you multiply r by our matrix L,

play04:48

this gives us an updated value for r.

play04:52

So we can say that ri+1 is going to be L times ri.

play04:58

Applying this expression repeatedly means that we are solving this problem

play05:02

iteratively.

play05:03

Each time we do this, we update the values in r until, eventually, r stops changing.

play05:08

So now r really does equal Lr.

play05:12

Thinking back to the previous videos in this module, this implies that r is now

play05:16

an eigenvector of matrix L, with an eigenvalue of 1.

play05:21

At this point, you might well be thinking,

play05:24

if we want to multiply r by L many times, perhaps this will be best

play05:28

tackled by applying the diagonalization method that we saw in the last video.

play05:33

But don't forget, this would require us to already know all of the Eigen vectors,

play05:38

which is what we're trying to find in the first place.

play05:40

So now that we have an equation, and

play05:43

hopefully some idea of where it came from, we can ask our computer

play05:47

to iteratively apply it until it converges to find our rank vector.

play05:51

You can see that although it takes about ten iterations for the numbers to settle

play05:55

down, the order is already established after the first iteration.

play05:59

However, this is just an artifact of our system being so tiny.

play06:02

So now we have our result, which says that as Procrastinating Pat randomly clicks

play06:08

around our network, we'd expect them to spend about 40% of their time on page D.

play06:13

But only about 12% of their time on page A, with 24% on each of pages B and C.

play06:20

We now have our ranking, with D at the top and A at the bottom, and B and

play06:24

C equal in the middle.

play06:26

As it turns out, although there are many approaches for

play06:29

efficiently calculating eigenvectors that have been developed over the years,

play06:32

repeatedly multiplying a randomly selected initial guest vector by a matrix,

play06:37

which is called the power method, is still very effective for

play06:40

the page rank problem for two reasons.

play06:43

Firstly, although the power method will clearly only give you one eigenvector, when

play06:47

we know that there will be n for an n webpage system,

play06:50

it turns out that because of the way we've structured our link matrix,

play06:53

the vector it gives you will always be the one that you're looking for, with an eigenvalue of 1.

play06:59

Secondly, although this is not true for the full webpage mini Internet,

play07:03

when looking at the real Internet you can imagine that almost every entry in

play07:07

the link matrix will be zero, i.e,, most pages don't connect to most other pages.

play07:13

This is referred to as a sparse matrix.

play07:16

And algorithms exist such that multiplications can be performed

play07:19

very efficiently.

play07:21

One key aspect of the page rank algorithm that we haven't discussed so

play07:25

far is called the damping factor, d.

play07:28

This adds an additional term to our iterative formula.

play07:31

So r i + 1 is now going to equal d, times Lri + 1- d over n.

play07:38

d is something between 0 and 1.

play07:41

And you can think of it as 1 minus the probability with which

play07:46

procrastinating Pat suddenly, randomly types in a web address,

play07:52

rather than clicking on a link on his current page.

play07:57

The effect that this has on the actual calculation is about finding a compromise

play08:01

between speed and stability of the iterative convergence process.

play08:06

There are over one billion websites on the Internet today, compared with just a few

play08:10

million when the page rank algorithm was first published in 1998.

play08:13

And so the methods for search and ranking have had to evolve to maximize efficiency,

play08:19

although the core concept has remained unchanged for many years.

play08:23

This brings us to the end of our introduction to the page rank algorithm.

play08:27

There are, of course, many details which we didn't cover in this video.

play08:30

But I hope this has allowed you to come away with some insight and

play08:34

understanding into how the page rank works, and

play08:36

hopefully the confidence to apply this to some larger networks yourself.

play08:40

[MUSIC]

Rate This

5.0 / 5 (0 votes)

Связанные теги
PageRankGoogleAlgorithmWeb RankingLarry PageSearch EngineLink AnalysisInternet ModelEigenvectorDamping Factor
Вам нужно краткое изложение на английском?