How does Google Search work?
Summary
TLDRIn this informative video, Matt Cutts from Google explains the intricacies of Google's search engine operations, focusing on crawling, indexing, and ranking. He details how Google crawls the web comprehensively, using PageRank as a primary determinant to prioritize sites. The indexing process organizes words in document order, allowing for efficient search queries. Cutts also touches on the evolution from the 'Google dance' to daily crawls for freshness and the use of over 200 factors in ranking, emphasizing the balance between authority and relevance. The video offers insights into Google's infrastructure and the speed at which it processes searches, all within half a second.
Takeaways
- 🌐 Google's ranking and website evaluation process is comprehensive and involves crawling, indexing, and ranking.
- 🕸️ Crawling the web is complex and involves determining the order of pages to crawl based on PageRank and reputation.
- 🔄 The old Google dance was a result of the crawling and indexing process taking approximately 30 days.
- 📅 Google transitioned to daily crawling in 2003 with Update Fritz to keep the index more up-to-date.
- 🔄 Incremental updates to the index mean Google can quickly find and incorporate new updates.
- 📚 Indexing involves taking words from documents and creating an order of documents for each word.
- 🔍 Document selection and ranking involve using over 200 factors, including PageRank and word proximity.
- 🏆 The goal of ranking is to find reputable documents that are also relevant to the search query.
- 💻 Google's search process involves parallel processing across hundreds of machines to find the best match for a query.
- 🕒 Google aims to return search results, including a useful snippet, in under half a second.
- 📈 For those interested in search engine workings, Google offers resources and job opportunities to learn more.
Q & A
What are the three main aspects Matt Cutts mentions as crucial for being the world's best search engine?
-Matt Cutts mentions that to be the world's best search engine, one must crawl the web comprehensively and deeply, index those pages, and then rank or serve those pages by returning the most relevant ones first.
How does Google determine the order in which it crawls web pages?
-Google uses PageRank as the primary determinant for crawling order. Pages with more PageRank, meaning more reputable links from other sites, are more likely to be discovered and crawled earlier in the process.
What was the 'Google dance' and why was it a problem?
-The 'Google dance' referred to the period when Google would crawl for several weeks, then index for about a week, and finally push the data out, which could take another week. This meant that the search results could be outdated, as it took a long time to refresh the entire index.
What significant update changed Google's crawling strategy?
-In 2003, Google implemented an update called Update Fritz, which allowed them to crawl a significant chunk of the web every day, leading to a more incremental and up-to-date index.
How does Google ensure that its index remains fresh?
-Google breaks the web into segments and refreshes each segment every night, ensuring that the main base index is not significantly out of date. This strategy allows Google to quickly find and index updates.
What is the difference between the main index and the supplemental index mentioned by Matt Cutts?
-The main index contains fresh content that is crawled and refreshed more frequently, while the supplemental index contains a larger number of documents that are not refreshed as often.
How does Google's indexing process work?
-Indexing involves taking the words in a document and recording in which documents each word appears. This reverses the order from document-centric to word-centric, allowing Google to quickly identify documents containing specific search terms.
What factors does Google consider when ranking search results?
-Google uses over 200 factors in its rankings, including PageRank and the reputation of the document, as well as the proximity of search terms on the page, to determine the most relevant documents for a given query.
How does Google handle a search query?
-When a user types in a query, Google sends the request to hundreds of machines that search through their fraction of the indexed web. These machines return potential matches, and Google then determines the best page to display, often in under half a second.
What is the role of snippets in Google search results?
-Snippets provide context for the search terms within the document, helping users understand why a particular page is relevant to their query and improving the user experience.
How can someone learn more about how search engines work?
-Matt Cutts suggests that interested individuals can read academic papers and articles about Google, PageRank, and search engine operations. Additionally, he mentions that job opportunities at Google could provide deeper insights into search engine mechanics.
Outlines
📊 Introduction to Google's Ranking and Evaluation System
Matt Cutts introduces a broad question from Robert in Munich about Google's ranking and website evaluation process. The question covers Google's approach to crawling, indexing, and ranking sites. Matt explains that it's a very expansive topic, touching on aspects he's discussed for hours with new Google engineers. He provides a general overview of how Google handles crawling, indexing, and serving results, emphasizing that there are three main objectives: crawling the web deeply, indexing the pages, and ranking them effectively.
🔍 The Challenge of Web Crawling and Google's Early Struggles
Matt discusses the complexity of crawling the web and reflects on the challenges Google faced when it first started. In the early 2000s, Google could only manage to crawl the web after months of effort, with issues requiring a 'war room' approach. He explains that PageRank was used as a primary method to determine which pages to crawl first, starting with highly ranked pages like CNN and The New York Times. Google initially had a 30-day crawl cycle, where they would crawl for weeks, then index, and finally push the data out—a process known as the 'Google Dance.'
⚙️ Google’s Shift to Incremental Crawling and the Introduction of Update Fritz
Matt explains how in 2003, Google switched to a more efficient, incremental crawl system known as 'Update Fritz.' Instead of waiting for a full 30-day cycle to finish, Google began refreshing a segment of the web every day, allowing the index to be continuously updated. This approach made Google's data more up-to-date. He also touches on the existence of the supplemental index, a layer of documents that weren't crawled as frequently but still held a significant amount of data. Over time, Google's ability to crawl and update the web in real-time improved dramatically.
📂 How Google Indexes and Structures Web Data
In this section, Matt describes the indexing process in more detail, using an example query for 'Katy Perry.' Indexing involves reversing the document order into word order, so instead of storing documents based on their structure, Google tracks where words appear across documents. For instance, Google tracks all the documents containing the word 'Katy' and 'Perry' separately, then cross-references documents that contain both words together. This is how Google begins identifying relevant documents for a given search query.
🏆 Document Selection and Ranking Process
Once documents are selected, Google uses over 200 factors to rank them. PageRank is one important signal, but Google also looks at proximity (e.g., how close 'Katy' and 'Perry' appear together on a page), the document’s authority, and other criteria to determine relevance. Matt notes that combining these factors is part of Google's 'secret sauce,' allowing them to return the most relevant and authoritative results for a given query.
💻 Google's Parallel Processing and Speed in Returning Results
Matt concludes by explaining how Google processes search queries. When a user submits a query, it’s sent to multiple machines simultaneously, each responsible for a portion of the web index. These machines work together to find the best matching documents. Google’s system ranks the results, generates a useful snippet showing the context of the keywords, and returns the best result—all in under half a second. He briefly touches on academic resources and job opportunities at Google for those interested in learning more about how search engines function.
Mindmap
Keywords
💡Crawling
💡PageRank
💡Indexing
💡Relevance
💡Supplemental Index
💡Update Fritz
💡Document Selection
💡Ranking Signals
💡Proximity
💡Snippet
💡Parallelization
Highlights
Google's ranking and website evaluation process involves crawling, indexing, and serving the most relevant pages.
Crawling the web comprehensively and deeply is crucial for a search engine.
PageRank is used as the primary determinant for crawling and discovering pages.
High PageRank sites are discovered early in the crawl process.
The Google dance was a period where Google's index was updated, causing fluctuations in search results.
Update Fritz in 2003 allowed Google to crawl and refresh parts of the web daily.
Incremental updating of the index ensures fresh content.
Supplemental index was used for documents not refreshed as often.
Indexing involves organizing words in document order rather than word order.
Document selection is the process of finding documents that match search queries.
Ranking involves balancing PageRank with over 200 other factors.
Proximity of search terms and reputation of documents are considered in ranking.
Google aims to find authoritative documents that are relevant to user queries.
Google's infrastructure allows for massive parallelization to serve search results quickly.
Search results are returned in under half a second.
Snippets provide context for search results, showing keywords within the document.
For more information on Google's search engine workings, there are articles and academic papers available.
Interested individuals can apply to Google for jobs to learn more about search engine operations.
Transcripts
MATT CUTTS: Hi, everybody.
We got a really interesting and very expansive question
from RobertvH in Munich.
RobertvH wants to know--
Hi Matt, could you please explain how Google's ranking
and website evaluation process works starting with the
crawling and analysis of a site, crawling time lines,
frequencies, priorities, indexing and filtering
processes within the databases, et cetera?
OK.
So that's basically just like, tell me
everything about Google.
Right?
That's a really expansive question.
It covers a lot of different ground.
And in fact, I have given orientation lectures to
engineers when they come in.
And I can talk for an hour about all those different
topics, and even talk for an hour about a very small subset
of those topics.
So let me talk for a while and see how much of a feel I can
give you for how the Google infrastructure works, how it
all fits together, how our crawling and indexing and
serving pipeline works.
Let's dive right in.
So there's three things that you really want to do well if
you want to be the world's best search engine.
You want to crawl the web comprehensively and deeply.
You want to index those pages.
And then you want to rank or serve those pages and return
the most relevant ones first.
Crawling is actually more difficult
than you might think.
Whenever Google started, whenever I joined back in
2000, we didn't manage to crawl the web for something
like three or four months.
And we had to have a war room.
But a good way to think about the mental model is we
basically take page rank as the primary determinant.
And the more page rank you have-- that is, the more
people who link to you and the more reputable those people
are-- the more likely it is we're going to discover your
page relatively early in the crawl.
In fact, you could imagine crawling in strict page rank
order, and you'd get the CNNs of the world and The New York
Times of the world and really very high page rank sites.
And if you think about how things used to be, we used to
crawl for 30 days.
So we'd crawl for several weeks.
And then we would index for about a week.
And then we would push that data out.
And that would take about a week.
And so that was what the Google dance was.
Sometimes you'd hit one data center that had old data.
And sometimes you'd hit a data center that had new data.
Now there's various interesting tricks
that you can do.
For example, after you've crawled for 30 days, you can
imagine recrawling the high page rank guys so you can see
if there's anything new or important that's hit on the
CNN home page.
But for the most part, this is not fantastic.
Right?
Because if you're trying to crawl the web and it takes you
30 days, you're going to be out-of-date.
So eventually, in 2003, I believe, we switched as part
of an update called Update Fritz to crawling a fairly
interesting significant chunk of the web every day.
And so if you imagine breaking the web into a certain number
of segments, you could imagine crawling that part of the web
and refreshing it every night.
And so at any given point, your main base index would
only be so out of date.
Because then you'd loop back around and you'd refresh that.
And that works very, very well.
Instead of waiting for everything to finish, you're
incrementally updating your index.
And we've gotten even better over time.
So at this point, we can get very, very fresh.
Any time we see updates, we can usually
find them very quickly.
And in the old days, you would have not just a main or a base
index, but you could have what were called supplemental
results, or the supplemental index.
And that was something that we wouldn't crawl and refresh
quite as often.
But it was a lot more documents.
And so you could almost imagine having really fresh
content, a layer of our main index, and then more documents
that are not refreshed quite as often, but there's a lot
more of them.
So that's just a little bit about the crawl and how to
crawl comprehensively.
What you do then is you pass things around.
And you basically say, OK, I have crawled a large fraction
of the web.
And within that web you have, for example, one document.
And indexing is basically taking things in word order.
Well, let's just work through an example.
Suppose you say Katy Perry.
In a document, Katy Perry appears right
next to each other.
But what you want in an index is which documents does the
word Katy appear in, and which documents does the word
Perry appear in?
So you might say Katy appears in documents 1, and 2, and 89,
and 555, and 789.
And Perry might appear in documents number 2, and 8, and
73, and 555, and 1,000.
And so the whole process of doing the index is reversing,
so that instead of having the documents in word order, you
have the words, and they have it in document order.
So it's, OK, these are all the documents that a
word appears in.
Now when someone comes to Google and they type in Katy
Perry, you want to say, OK, what documents might match
Katy Perry?
Well, document one has Katy, but it doesn't have Perry.
So it's out.
Document number two has both Katy and Perry, so that's a
possibility.
Document eight has Perry but not Katy.
89 and 73 are out because they don't have the right
combination of words.
555 has both Katy and Perry.
And then these two are also out.
And so when someone comes to Google and they type in
Chicken Little, Britney Spears, Matt Cutts, Katy
Perry, whatever it is, we find the documents that we believe
have those words, either on the page or maybe in back
links, in anchor text pointing to that document.
Once you've done what's called document selection, you try to
figure out, how should you rank those?
And that's really tricky.
We use page rank as well as over 200 other factors in our
rankings to try to say, OK, maybe this document is really
authoritative.
It has a lot of reputation because it has
a lot of page rank.
But it only has the word Perry once.
And it just happens to have the word Katy somewhere else
on the page.
Whereas here is a document that has the word Katy and
Perry right next to each other, so there's proximity.
And it's got a lot of reputation.
It's got a lot of links pointing to it.
So we try to balance that off.
You want to find reputable documents that are also about
what the user typed in.
And that's kind of the secret sauce, trying to figure out a
way to combine those 200 different ranking signals in
order to find the most relevant document.
So at any given time, hundreds of millions of times a day,
someone comes to Google.
We try to find the closest data center to them.
They type in something like Katy Perry.
We send that query out to hundreds of different machines
all at once, which look through their little tiny
fraction of the web that we've indexed.
And we find, OK, these are the documents that
we think best match.
All those machines return their matches.
And we say, OK, what's the creme de la creme?
What's the needle in the haystack?
What's the best page that matches this query across our
entire index?
And then we take that page and we try to show it with a
useful snippet.
So you show the key words in the context of the document.
And you get it all back in under half a second.
So that's probably about as long as we can go on without
straining YouTube.
But that just gives you a little bit of a feel about how
the crawling system works, how we index documents, how things
get returned in under half a second through that massive
parallelization.
I hope that helps.
And if you want to know more, there's a whole bunch of
articles and academic papers about Google, and page rank,
and how Google works.
But you can also apply to--
there's [email protected], I think, or google.com/jobs, if
you're interested in learning a lot more about how search
engines work.
OK.
Thanks very much.
5.0 / 5 (0 votes)