How does Google Search work?

Google Search Central
23 Apr 201207:45

Summary

TLDRIn this informative video, Matt Cutts from Google explains the intricacies of Google's search engine operations, focusing on crawling, indexing, and ranking. He details how Google crawls the web comprehensively, using PageRank as a primary determinant to prioritize sites. The indexing process organizes words in document order, allowing for efficient search queries. Cutts also touches on the evolution from the 'Google dance' to daily crawls for freshness and the use of over 200 factors in ranking, emphasizing the balance between authority and relevance. The video offers insights into Google's infrastructure and the speed at which it processes searches, all within half a second.

Takeaways

  • 🌐 Google's ranking and website evaluation process is comprehensive and involves crawling, indexing, and ranking.
  • 🕸️ Crawling the web is complex and involves determining the order of pages to crawl based on PageRank and reputation.
  • 🔄 The old Google dance was a result of the crawling and indexing process taking approximately 30 days.
  • 📅 Google transitioned to daily crawling in 2003 with Update Fritz to keep the index more up-to-date.
  • 🔄 Incremental updates to the index mean Google can quickly find and incorporate new updates.
  • 📚 Indexing involves taking words from documents and creating an order of documents for each word.
  • 🔍 Document selection and ranking involve using over 200 factors, including PageRank and word proximity.
  • 🏆 The goal of ranking is to find reputable documents that are also relevant to the search query.
  • 💻 Google's search process involves parallel processing across hundreds of machines to find the best match for a query.
  • 🕒 Google aims to return search results, including a useful snippet, in under half a second.
  • 📈 For those interested in search engine workings, Google offers resources and job opportunities to learn more.

Q & A

  • What are the three main aspects Matt Cutts mentions as crucial for being the world's best search engine?

    -Matt Cutts mentions that to be the world's best search engine, one must crawl the web comprehensively and deeply, index those pages, and then rank or serve those pages by returning the most relevant ones first.

  • How does Google determine the order in which it crawls web pages?

    -Google uses PageRank as the primary determinant for crawling order. Pages with more PageRank, meaning more reputable links from other sites, are more likely to be discovered and crawled earlier in the process.

  • What was the 'Google dance' and why was it a problem?

    -The 'Google dance' referred to the period when Google would crawl for several weeks, then index for about a week, and finally push the data out, which could take another week. This meant that the search results could be outdated, as it took a long time to refresh the entire index.

  • What significant update changed Google's crawling strategy?

    -In 2003, Google implemented an update called Update Fritz, which allowed them to crawl a significant chunk of the web every day, leading to a more incremental and up-to-date index.

  • How does Google ensure that its index remains fresh?

    -Google breaks the web into segments and refreshes each segment every night, ensuring that the main base index is not significantly out of date. This strategy allows Google to quickly find and index updates.

  • What is the difference between the main index and the supplemental index mentioned by Matt Cutts?

    -The main index contains fresh content that is crawled and refreshed more frequently, while the supplemental index contains a larger number of documents that are not refreshed as often.

  • How does Google's indexing process work?

    -Indexing involves taking the words in a document and recording in which documents each word appears. This reverses the order from document-centric to word-centric, allowing Google to quickly identify documents containing specific search terms.

  • What factors does Google consider when ranking search results?

    -Google uses over 200 factors in its rankings, including PageRank and the reputation of the document, as well as the proximity of search terms on the page, to determine the most relevant documents for a given query.

  • How does Google handle a search query?

    -When a user types in a query, Google sends the request to hundreds of machines that search through their fraction of the indexed web. These machines return potential matches, and Google then determines the best page to display, often in under half a second.

  • What is the role of snippets in Google search results?

    -Snippets provide context for the search terms within the document, helping users understand why a particular page is relevant to their query and improving the user experience.

  • How can someone learn more about how search engines work?

    -Matt Cutts suggests that interested individuals can read academic papers and articles about Google, PageRank, and search engine operations. Additionally, he mentions that job opportunities at Google could provide deeper insights into search engine mechanics.

Outlines

00:00

📊 Introduction to Google's Ranking and Evaluation System

Matt Cutts introduces a broad question from Robert in Munich about Google's ranking and website evaluation process. The question covers Google's approach to crawling, indexing, and ranking sites. Matt explains that it's a very expansive topic, touching on aspects he's discussed for hours with new Google engineers. He provides a general overview of how Google handles crawling, indexing, and serving results, emphasizing that there are three main objectives: crawling the web deeply, indexing the pages, and ranking them effectively.

05:03

🔍 The Challenge of Web Crawling and Google's Early Struggles

Matt discusses the complexity of crawling the web and reflects on the challenges Google faced when it first started. In the early 2000s, Google could only manage to crawl the web after months of effort, with issues requiring a 'war room' approach. He explains that PageRank was used as a primary method to determine which pages to crawl first, starting with highly ranked pages like CNN and The New York Times. Google initially had a 30-day crawl cycle, where they would crawl for weeks, then index, and finally push the data out—a process known as the 'Google Dance.'

⚙️ Google’s Shift to Incremental Crawling and the Introduction of Update Fritz

Matt explains how in 2003, Google switched to a more efficient, incremental crawl system known as 'Update Fritz.' Instead of waiting for a full 30-day cycle to finish, Google began refreshing a segment of the web every day, allowing the index to be continuously updated. This approach made Google's data more up-to-date. He also touches on the existence of the supplemental index, a layer of documents that weren't crawled as frequently but still held a significant amount of data. Over time, Google's ability to crawl and update the web in real-time improved dramatically.

📂 How Google Indexes and Structures Web Data

In this section, Matt describes the indexing process in more detail, using an example query for 'Katy Perry.' Indexing involves reversing the document order into word order, so instead of storing documents based on their structure, Google tracks where words appear across documents. For instance, Google tracks all the documents containing the word 'Katy' and 'Perry' separately, then cross-references documents that contain both words together. This is how Google begins identifying relevant documents for a given search query.

🏆 Document Selection and Ranking Process

Once documents are selected, Google uses over 200 factors to rank them. PageRank is one important signal, but Google also looks at proximity (e.g., how close 'Katy' and 'Perry' appear together on a page), the document’s authority, and other criteria to determine relevance. Matt notes that combining these factors is part of Google's 'secret sauce,' allowing them to return the most relevant and authoritative results for a given query.

💻 Google's Parallel Processing and Speed in Returning Results

Matt concludes by explaining how Google processes search queries. When a user submits a query, it’s sent to multiple machines simultaneously, each responsible for a portion of the web index. These machines work together to find the best matching documents. Google’s system ranks the results, generates a useful snippet showing the context of the keywords, and returns the best result—all in under half a second. He briefly touches on academic resources and job opportunities at Google for those interested in learning more about how search engines function.

Mindmap

Keywords

💡Crawling

Crawling refers to the process by which search engines like Google discover and retrieve web pages to be indexed. It's a fundamental part of the search engine's operation, ensuring that new and updated content is found and added to the search engine's database. In the script, Matt Cutts discusses the evolution of Google's crawling process, noting the shift from crawling the entire web every 30 days to a more dynamic system where significant portions of the web are crawled daily, allowing for more frequent updates to the index.

💡PageRank

PageRank is an algorithm used by Google to rank web pages in their search engine results. It operates on a scale of 0 to 10 and measures the importance of website pages by counting the number and quality of links to the page. The script mentions PageRank as a primary determinant in the crawling process, where pages with higher PageRank are crawled and indexed more frequently, reflecting their perceived importance and relevance.

💡Indexing

Indexing is the process of organizing and storing web pages in a search engine's database for quick retrieval. It involves cataloging the words and phrases found on web pages so that when a user enters a search query, the search engine can efficiently find and rank relevant pages. In the script, Cutts explains indexing by describing how words are mapped to the documents they appear in, allowing Google to match search queries to the most relevant documents.

💡Relevance

Relevance in the context of search engines refers to the extent to which the content of a web page matches what a user is searching for. The goal of Google's ranking system is to return the most relevant results at the top of the search results page. The script discusses how Google balances factors like PageRank and word proximity to determine the relevance of a page to a particular search query.

💡Supplemental Index

The supplemental index is a part of Google's index that contains documents that are not as frequently crawled or refreshed. These documents might be less authoritative or contain less unique content. The script mentions the concept of a supplemental index as a way to manage a larger volume of documents that may not be as timely or relevant as those in the main index.

💡Update Fritz

Update Fritz refers to a specific update to Google's crawling and indexing system that occurred in 2003. This update allowed Google to crawl and refresh parts of the web on a daily basis, rather than every 30 days. The script highlights Update Fritz as a significant improvement in Google's ability to provide fresh and up-to-date search results.

💡Document Selection

Document selection is the process by which a search engine decides which pages to include in its index and which to exclude. It's part of the broader process of determining which documents are relevant to a user's search query. The script describes document selection in the context of identifying which documents contain specific search terms and are therefore potential matches for a given query.

💡Ranking Signals

Ranking signals are the various factors that search engines use to determine the order in which search results are displayed. These can include factors like PageRank, keyword usage, link quality, and user behavior. The script mentions over 200 different ranking signals that Google uses to rank search results, emphasizing the complexity of the algorithms that determine search engine rankings.

💡Proximity

Proximity in search engine optimization refers to the closeness of keywords to each other on a web page. Search engines like Google may consider the proximity of keywords when determining the relevance of a page to a search query. The script uses the example of the search term 'Katy Perry' to illustrate how the proximity of the words 'Katy' and 'Perry' on a web page can affect its ranking for that query.

💡Snippet

A snippet is a brief summary or excerpt from a web page that is displayed in the search results, often containing the keywords from the user's search query. The purpose of a snippet is to give users a preview of the page's content and help them determine if it's relevant to their search. The script mentions snippets as part of Google's process for presenting search results, aiming to show the most useful context from the page.

💡Parallelization

Parallelization is the process of dividing a large computational task into smaller, independent tasks that can be executed simultaneously. In the context of the script, parallelization refers to how Google distributes search queries across hundreds of machines to process and return results quickly. This approach allows Google to handle the massive volume of search queries it receives while maintaining fast response times.

Highlights

Google's ranking and website evaluation process involves crawling, indexing, and serving the most relevant pages.

Crawling the web comprehensively and deeply is crucial for a search engine.

PageRank is used as the primary determinant for crawling and discovering pages.

High PageRank sites are discovered early in the crawl process.

The Google dance was a period where Google's index was updated, causing fluctuations in search results.

Update Fritz in 2003 allowed Google to crawl and refresh parts of the web daily.

Incremental updating of the index ensures fresh content.

Supplemental index was used for documents not refreshed as often.

Indexing involves organizing words in document order rather than word order.

Document selection is the process of finding documents that match search queries.

Ranking involves balancing PageRank with over 200 other factors.

Proximity of search terms and reputation of documents are considered in ranking.

Google aims to find authoritative documents that are relevant to user queries.

Google's infrastructure allows for massive parallelization to serve search results quickly.

Search results are returned in under half a second.

Snippets provide context for search results, showing keywords within the document.

For more information on Google's search engine workings, there are articles and academic papers available.

Interested individuals can apply to Google for jobs to learn more about search engine operations.

Transcripts

play00:00

MATT CUTTS: Hi, everybody.

play00:01

We got a really interesting and very expansive question

play00:04

from RobertvH in Munich.

play00:06

RobertvH wants to know--

play00:09

Hi Matt, could you please explain how Google's ranking

play00:12

and website evaluation process works starting with the

play00:14

crawling and analysis of a site, crawling time lines,

play00:18

frequencies, priorities, indexing and filtering

play00:21

processes within the databases, et cetera?

play00:25

OK.

play00:25

So that's basically just like, tell me

play00:27

everything about Google.

play00:28

Right?

play00:29

That's a really expansive question.

play00:30

It covers a lot of different ground.

play00:32

And in fact, I have given orientation lectures to

play00:35

engineers when they come in.

play00:37

And I can talk for an hour about all those different

play00:40

topics, and even talk for an hour about a very small subset

play00:43

of those topics.

play00:45

So let me talk for a while and see how much of a feel I can

play00:48

give you for how the Google infrastructure works, how it

play00:51

all fits together, how our crawling and indexing and

play00:53

serving pipeline works.

play00:55

Let's dive right in.

play00:57

So there's three things that you really want to do well if

play00:59

you want to be the world's best search engine.

play01:01

You want to crawl the web comprehensively and deeply.

play01:03

You want to index those pages.

play01:05

And then you want to rank or serve those pages and return

play01:08

the most relevant ones first.

play01:10

Crawling is actually more difficult

play01:11

than you might think.

play01:13

Whenever Google started, whenever I joined back in

play01:16

2000, we didn't manage to crawl the web for something

play01:18

like three or four months.

play01:20

And we had to have a war room.

play01:22

But a good way to think about the mental model is we

play01:25

basically take page rank as the primary determinant.

play01:28

And the more page rank you have-- that is, the more

play01:31

people who link to you and the more reputable those people

play01:34

are-- the more likely it is we're going to discover your

play01:37

page relatively early in the crawl.

play01:39

In fact, you could imagine crawling in strict page rank

play01:41

order, and you'd get the CNNs of the world and The New York

play01:45

Times of the world and really very high page rank sites.

play01:49

And if you think about how things used to be, we used to

play01:51

crawl for 30 days.

play01:53

So we'd crawl for several weeks.

play01:56

And then we would index for about a week.

play01:59

And then we would push that data out.

play02:01

And that would take about a week.

play02:04

And so that was what the Google dance was.

play02:05

Sometimes you'd hit one data center that had old data.

play02:07

And sometimes you'd hit a data center that had new data.

play02:10

Now there's various interesting tricks

play02:13

that you can do.

play02:13

For example, after you've crawled for 30 days, you can

play02:16

imagine recrawling the high page rank guys so you can see

play02:19

if there's anything new or important that's hit on the

play02:21

CNN home page.

play02:22

But for the most part, this is not fantastic.

play02:25

Right?

play02:25

Because if you're trying to crawl the web and it takes you

play02:28

30 days, you're going to be out-of-date.

play02:30

So eventually, in 2003, I believe, we switched as part

play02:36

of an update called Update Fritz to crawling a fairly

play02:40

interesting significant chunk of the web every day.

play02:43

And so if you imagine breaking the web into a certain number

play02:47

of segments, you could imagine crawling that part of the web

play02:51

and refreshing it every night.

play02:53

And so at any given point, your main base index would

play02:58

only be so out of date.

play03:00

Because then you'd loop back around and you'd refresh that.

play03:03

And that works very, very well.

play03:04

Instead of waiting for everything to finish, you're

play03:06

incrementally updating your index.

play03:08

And we've gotten even better over time.

play03:10

So at this point, we can get very, very fresh.

play03:14

Any time we see updates, we can usually

play03:16

find them very quickly.

play03:18

And in the old days, you would have not just a main or a base

play03:20

index, but you could have what were called supplemental

play03:24

results, or the supplemental index.

play03:26

And that was something that we wouldn't crawl and refresh

play03:28

quite as often.

play03:29

But it was a lot more documents.

play03:31

And so you could almost imagine having really fresh

play03:35

content, a layer of our main index, and then more documents

play03:40

that are not refreshed quite as often, but there's a lot

play03:42

more of them.

play03:43

So that's just a little bit about the crawl and how to

play03:45

crawl comprehensively.

play03:47

What you do then is you pass things around.

play03:49

And you basically say, OK, I have crawled a large fraction

play03:53

of the web.

play03:54

And within that web you have, for example, one document.

play03:58

And indexing is basically taking things in word order.

play04:04

Well, let's just work through an example.

play04:06

Suppose you say Katy Perry.

play04:10

In a document, Katy Perry appears right

play04:13

next to each other.

play04:14

But what you want in an index is which documents does the

play04:18

word Katy appear in, and which documents does the word

play04:20

Perry appear in?

play04:22

So you might say Katy appears in documents 1, and 2, and 89,

play04:26

and 555, and 789.

play04:32

And Perry might appear in documents number 2, and 8, and

play04:37

73, and 555, and 1,000.

play04:42

And so the whole process of doing the index is reversing,

play04:47

so that instead of having the documents in word order, you

play04:50

have the words, and they have it in document order.

play04:53

So it's, OK, these are all the documents that a

play04:54

word appears in.

play04:56

Now when someone comes to Google and they type in Katy

play04:59

Perry, you want to say, OK, what documents might match

play05:02

Katy Perry?

play05:03

Well, document one has Katy, but it doesn't have Perry.

play05:06

So it's out.

play05:08

Document number two has both Katy and Perry, so that's a

play05:11

possibility.

play05:12

Document eight has Perry but not Katy.

play05:15

89 and 73 are out because they don't have the right

play05:18

combination of words.

play05:19

555 has both Katy and Perry.

play05:22

And then these two are also out.

play05:25

And so when someone comes to Google and they type in

play05:27

Chicken Little, Britney Spears, Matt Cutts, Katy

play05:29

Perry, whatever it is, we find the documents that we believe

play05:32

have those words, either on the page or maybe in back

play05:35

links, in anchor text pointing to that document.

play05:38

Once you've done what's called document selection, you try to

play05:41

figure out, how should you rank those?

play05:43

And that's really tricky.

play05:44

We use page rank as well as over 200 other factors in our

play05:49

rankings to try to say, OK, maybe this document is really

play05:52

authoritative.

play05:53

It has a lot of reputation because it has

play05:55

a lot of page rank.

play05:56

But it only has the word Perry once.

play05:58

And it just happens to have the word Katy somewhere else

play06:01

on the page.

play06:02

Whereas here is a document that has the word Katy and

play06:04

Perry right next to each other, so there's proximity.

play06:07

And it's got a lot of reputation.

play06:09

It's got a lot of links pointing to it.

play06:12

So we try to balance that off.

play06:13

You want to find reputable documents that are also about

play06:16

what the user typed in.

play06:18

And that's kind of the secret sauce, trying to figure out a

play06:20

way to combine those 200 different ranking signals in

play06:23

order to find the most relevant document.

play06:25

So at any given time, hundreds of millions of times a day,

play06:30

someone comes to Google.

play06:32

We try to find the closest data center to them.

play06:34

They type in something like Katy Perry.

play06:36

We send that query out to hundreds of different machines

play06:38

all at once, which look through their little tiny

play06:41

fraction of the web that we've indexed.

play06:43

And we find, OK, these are the documents that

play06:45

we think best match.

play06:47

All those machines return their matches.

play06:49

And we say, OK, what's the creme de la creme?

play06:52

What's the needle in the haystack?

play06:53

What's the best page that matches this query across our

play06:56

entire index?

play06:57

And then we take that page and we try to show it with a

play07:00

useful snippet.

play07:01

So you show the key words in the context of the document.

play07:03

And you get it all back in under half a second.

play07:06

So that's probably about as long as we can go on without

play07:10

straining YouTube.

play07:11

But that just gives you a little bit of a feel about how

play07:13

the crawling system works, how we index documents, how things

play07:16

get returned in under half a second through that massive

play07:19

parallelization.

play07:20

I hope that helps.

play07:21

And if you want to know more, there's a whole bunch of

play07:23

articles and academic papers about Google, and page rank,

play07:26

and how Google works.

play07:28

But you can also apply to--

play07:30

there's [email protected], I think, or google.com/jobs, if

play07:34

you're interested in learning a lot more about how search

play07:36

engines work.

play07:37

OK.

play07:37

Thanks very much.

Rate This

5.0 / 5 (0 votes)

関連タグ
Search EngineGoogle RankingsCrawling ProcessIndexing SystemPageRankWeb InfrastructureSEO InsightsWeb CrawlingData RefreshReputation Signals
英語で要約が必要ですか?