Design a Basic Search Engine (Google or Bing) | System Design Interview Prep

Interview Pen
29 Apr 202319:45

Summary

TLDRThis video script outlines the process of building a scalable web search engine akin to Google. It covers the creation of an API to handle search queries, the necessity of a database to store web pages, and the use of a crawler to gather new URLs and HTML content. The script also delves into the importance of a URL Frontier for managing and prioritizing URLs to be crawled, ensuring politeness and avoiding overloading websites. The discussion includes scaling challenges, database sharding, and the use of a blob store for efficient data management.

Takeaways

  • 🔍 **Search Engine Basics**: A scalable web search engine requires an API to handle user queries, a database to store relevant site information, and a mechanism to display titles and descriptions of search results.
  • 🕷️ **Crawlers**: A crawler is essential for discovering and downloading web pages to populate the database, working by following links from one page to another.
  • 🌐 **Web Structure**: The internet is viewed as a vast network of interconnected pages, where each page can link to multiple others, facilitating the crawler's discovery process.
  • 🗄️ **Database Design**: The database must store URLs, site content, titles, descriptions, hashes for uniqueness, last updated dates, and scraping priorities.
  • 🔑 **Sharding**: To manage large datasets, sharding is used to distribute data across multiple nodes, with each node holding a subset of the data determined by a shard key.
  • 📚 **Blob Storage**: For efficiency, site content is stored in a separate blob store like Amazon S3, with only references kept in the main database.
  • 🔑 **Global Index**: A global index is used for quick lookup of data based on hashes, which helps in managing duplicates and ensuring data integrity.
  • 🔍 **Text Indexing**: Another sharded database is used for text indexing, allowing for efficient searching of words across web pages.
  • 🌐 **Load Balancer**: A load balancer is crucial for distributing user requests across multiple API instances to ensure scalability and performance.
  • 🚀 **Scalability**: The system is designed to scale horizontally, with the ability to add more nodes and infrastructure as the number of users and data grows.
  • 📈 **URL Frontier**: The URL Frontier is a complex system for managing and prioritizing URLs to be crawled, ensuring politeness and efficiency in the crawling process.

Q & A

  • What are the basic requirements for a scalable web search engine?

    -The basic requirements include the ability for users to enter a search query, the search engine's capability to find relevant sites, and the provision of a list with titles, descriptions, and URLs of each site.

  • What is the role of an API in a web search engine?

    -The API is responsible for handling user search requests, looking up relevant sites in a database, and returning results to the user.

  • How does a crawler contribute to building a web search engine?

    -A crawler goes out to the internet, finds HTML pages, downloads them, and adds them to the database, ensuring the search engine has an up-to-date index of web pages.

  • Why is a database necessary for a web search engine?

    -A database is necessary to store the URLs, site content, titles, descriptions, hashes, last updated dates, and scraping priorities for all the web pages indexed by the search engine.

  • What is the purpose of a hash in the database?

    -The hash is used to identify unique pages on the internet, saving storage space and bandwidth by only including unique records.

  • How does a load balancer improve the scalability of a web search engine?

    -A load balancer distributes user requests across multiple API instances, ensuring that a large number of users can access the site simultaneously without overloading any single server.

  • What is a blob store and how does it relate to a web search engine's database?

    -A blob store is a storage system for large binary objects, like web page content. Instead of storing this content in the database, it's stored in the blob store, and the database holds a reference to it, improving efficiency.

  • Why is sharding used in the database of a web search engine?

    -Sharding is used to distribute the database across multiple nodes, allowing for efficient storage and retrieval of data at scale by partitioning the data based on a shard key.

  • What is a global index in the context of a web search engine?

    -A global index is a sharded database that contains hashes and URLs for quick lookup, allowing the system to efficiently find and access data in the primary database.

  • How does the URL Frontier ensure politeness and priority in crawling?

    -The URL Frontier uses multiple queues sorted by priority and host to ensure that high-priority URLs are crawled more frequently and that only one crawler accesses a single host at a time to avoid overwhelming the host.

Outlines

00:00

🔍 Building a Scalable Web Search Engine

The paragraph introduces the concept of building a scalable web search engine akin to Google. It outlines the basic requirements, such as the ability for users to input search queries, the engine's capability to find relevant sites, and the display of titles and descriptions. The script then delves into the backend processes, including the use of an API to handle search requests, a database to store web pages, and a crawler to discover and download new pages. The crawler, named 'cola crawler', is tasked with navigating the internet and using URLs to find associated web pages. The discussion also touches on the challenges of acquiring a comprehensive database of internet sites and the need for a URL database to manage the vast number of links.

05:02

🌐 Components of a Search Engine System

This section discusses the various components involved in a search engine system. It starts with the API, which has a single endpoint to accept search queries and return relevant web pages. A load balancer is introduced to distribute user requests efficiently. The script then moves on to the database, emphasizing the need to store URLs, site content, titles, descriptions, and hashes to ensure uniqueness. It also mentions the importance of a last updated date and a priority for scraping. The concept of a blob store is introduced to handle large amounts of site content, with references stored in the database. The paragraph concludes with a discussion on sharding the database to distribute data across multiple nodes and the use of shard keys for efficient data retrieval.

10:04

🕷️ The Role of Crawlers and URL Frontier

The paragraph explains the role of crawlers in a search engine, which is to fetch web pages and add them to the database. It introduces the URL Frontier, a system that manages a list of URLs to be crawled. The script discusses the importance of respecting robots.txt files to avoid overloading websites. It also addresses the need for a robots.txt cache to improve crawling efficiency. The discussion then turns to scaling crawlers, calculating the number of concurrent crawls needed based on the number of pages and the frequency of updates. The importance of geographical distribution of crawlers to optimize bandwidth usage and latency is also highlighted.

15:05

🗂️ URL Frontier: Prioritization and Politeness

This section delves into the complexities of the URL Frontier, which is responsible for managing the order in which URLs are crawled. It discusses the need for prioritization, as different sites update at varying frequencies, and politeness, to avoid overwhelming a single website with multiple crawls. The script outlines a solution involving multiple queues for different priorities and a heap to ensure that only one crawler accesses a single host at a time. It also introduces a router to distribute URLs to the correct queues. The paragraph concludes with some considerations for scaling the URL Frontier and ensuring efficient use of resources.

🔚 Final Thoughts and Further Exploration

The final paragraph summarizes the complete solution for building a scalable web search engine. It recaps the roles of the API, database, crawlers, and URL Frontier. The script encourages further exploration into areas such as fault tolerance, handling close duplicates, indexing algorithms, and personalizing search results. It also invites viewers to engage with the content provider's community for more learning resources and support.

Mindmap

Keywords

💡Search Engine

A search engine is a software system that is designed to carry out searches on the World Wide Web. In the context of the video, building a scalable web search engine like Google is the central theme. The video discusses how such an engine would need to handle user queries, find relevant sites, and display titles and descriptions of those sites.

💡API

An API, or Application Programming Interface, is a set of rules and protocols for building and interacting with software applications. The video introduces the concept of an API handling user search requests. It is the backend component that receives a search query and processes it to return relevant results.

💡Crawler

A crawler, in the context of web search engines, is a software bot that systematically browses the internet to discover new pages and add them to the database. The video explains the need for a crawler, referred to as a 'web crawler', to find HTML pages and download them to be indexed in the database.

💡Database

A database is an organized collection of data typically stored and accessed electronically. The video describes the necessity of having a comprehensive database that contains entries for every web page on the internet. This database is crucial for the search engine to retrieve relevant results based on user queries.

💡Load Balancer

A load balancer is a networking device or software that distributes network or application traffic across multiple servers. In the video, a load balancer is mentioned as a way to ensure that a large number of users can access the search engine simultaneously by routing them to the correct API endpoint based on current load.

💡Blob Store

A blob store, or binary large object store, is a type of database designed to handle large binary files. The video discusses using a blob store, like Amazon S3, to store the actual content of web pages efficiently and separately from the metadata stored in the database.

💡Sharding

Sharding is the process of distributing data across multiple servers or databases to improve manageability and performance. The video describes sharding the database to distribute the large amount of metadata across different nodes, ensuring efficient data retrieval.

💡URL Frontier

The URL Frontier is a component of the system that manages a list of URLs to be crawled. The video explains how the URL Frontier ensures that URLs are crawled in the correct order, taking into account factors like priority and politeness to avoid overwhelming web servers.

💡Robots.txt

A robots.txt file is a standard within the internet that specifies which parts of a website should not be accessed by crawlers. The video discusses the importance of respecting robots.txt files to avoid crawling pages that a site owner has disallowed.

💡Indexing

Indexing, in the context of search engines, is the process of organizing data so that it can be searched quickly and efficiently. The video touches on the need for indexing raw page content and metadata to enable fast and relevant search results.

💡Hash

A hash is a data set that represents a unique fingerprint of a set of data. In the video, a hash is used to determine the uniqueness of web pages for storage efficiency. The hash function helps in identifying duplicate content across the internet.

Highlights

Introduction to building a scalable web search engine.

User requirements for a search engine: entering queries, finding relevant sites, and viewing titles and descriptions.

Behind-the-scenes process of a search engine query.

The role of an API in handling user search requests.

Necessity of a database containing all websites for relevance.

Crawler's function to find and download HTML pages for the database.

How a crawler discovers new URLs to crawl.

Challenges in building the API for scalability.

Implementation of a load balancer for handling multiple users.

Database storage requirements for URLs, content, titles, descriptions, and hashes.

Use of hashes to save storage space and bandwidth by including only unique records.

Importance of last updated dates and scraping priorities in the database.

Query patterns for efficient data retrieval in the database.

Challenge of managing 200 petabytes of raw page content.

Introduction of a blob store for efficient storage of large binary objects.

Sharding as a method for distributing the database.

Use of a global index for efficient lookup by hash.

Building a text index for word frequency and efficient search.

System design including API, database, blob store, and crawlers.

Crawler's role in fetching pages and updating the database.

Respecting robots.txt files to avoid overloading websites.

Use of a robots.txt cache to improve crawling efficiency.

Mathematical scaling considerations for crawlers.

Importance of geographical distribution of crawlers for bandwidth management.

URL Frontier's role in managing and prioritizing URLs for crawling.

Politeness policy in URL Frontier to avoid overloading hosts.

Router's function in the URL Frontier to assign URLs to queues.

Handling of empty queues and refilling them in the URL Frontier.

Final system overview with API, database, crawlers, and URL Frontier.

Suggestions for further improvements and considerations.

Transcripts

play00:00

So today we're going to be learning how

play00:01

to build a scalable web search engine

play00:03

something like google.com for example

play00:05

there's some simple requirements that we

play00:07

have for a service like this the user of

play00:10

course needs to be able to enter a

play00:11

search query the search engine needs to

play00:13

be able to find sites that are relevant

play00:15

to that query and the user needs to be

play00:16

able to see a list with titles and

play00:18

descriptions of each one of those sites

play00:20

so of course we're all familiar with

play00:22

this process we've probably used Google

play00:23

a million times but let's take a look at

play00:26

what actually happens behind the scenes

play00:27

when we do this so let's say you're

play00:30

going to your search engine and you're

play00:32

searching for cats of course there needs

play00:34

to be something that can handle that

play00:35

request in the back end so we're going

play00:37

to introduce this API here and the API

play00:40

is just going to be responsible for

play00:41

handling the user's request for a search

play00:43

in terms of how it actually handles that

play00:45

request it's going to need to look up a

play00:48

site that's relevant to the user in some

play00:50

sort of database so we need to have a

play00:52

database of all of the websites on the

play00:54

internet so this database here is going

play00:55

to contain an entry for every single

play00:57

page on the internet well The Logical

play00:59

next question is how do we acquire a

play01:01

database with every single site on the

play01:04

internet and the answer is that we need

play01:06

some sort of crawler that can go out to

play01:08

the internet find the HTML associated

play01:10

with this page and download it to add it

play01:12

to the database and we're going to call

play01:13

that colada crawler so now the final

play01:15

question here is how does the crawler

play01:18

even know what sites to crawl how does

play01:20

it know which URLs are associated with

play01:22

websites and to answer that question we

play01:24

really need to think about the web as a

play01:26

web a single page on the web has

play01:28

multiple URLs to point to other pages

play01:30

and each one of those pages has URLs

play01:32

that point to other Pages

play01:34

etc etc so when we think about that we

play01:37

can think that once we download a single

play01:39

URL we can extract the page and then we

play01:41

can search the HTML content of the page

play01:43

for new URLs and we can add those URLs

play01:46

to our URL database so each one of these

play01:49

components presents its own unique

play01:50

challenges but now that we understand

play01:52

which problems we're actually solving

play01:54

let's dig into the first one which is

play01:55

building this API here if you're

play01:57

enjoying this video we have plenty more

play01:59

awesome data structure algorithm and

play02:01

system design explanations on

play02:02

interviewpen.com you can ask us any

play02:04

questions you have about any kind of

play02:06

topics surrounding data structures and

play02:07

algorithms and system design we released

play02:09

two to four videos a week you can run

play02:11

your code you can talk to a personalized

play02:13

AI teaching assistant and yeah this

play02:14

site's pretty great anyway enjoy the

play02:16

video so this is a pretty simple API and

play02:18

it only needs one endpoint and all it

play02:19

needs to do is accept a search query and

play02:21

return a list with titles descriptions

play02:24

and URLs of web pages thinking about how

play02:26

the scales we want to make sure that a

play02:28

large number of users can access the

play02:30

site at once so we want to add in a load

play02:33

balancer and the load balancer is going

play02:35

to be responsible for taking a single

play02:36

user and then routing it to the correct

play02:39

API based on load I've also added a page

play02:42

number over here so that we can make

play02:44

sure that instead of returning an entire

play02:46

list of every single site that could

play02:48

possibly be relevant to the search query

play02:50

we're only returning a subsection and

play02:52

the user can search for the next page if

play02:54

they're interested in more results so

play02:56

now that we have the API squared away

play02:58

let's take a look at the next part which

play02:59

is of the database that will actually

play03:01

store the data that the API is going to

play03:03

look up so thinking about the sort of

play03:04

data that needs to be stored in a

play03:06

database like this we of course need the

play03:08

URL and the actual content of the site

play03:10

we also need the title and the

play03:12

description associated with each site as

play03:14

that's what's actually going to be

play03:15

displayed to the user we're also going

play03:17

to include in here a hash since there's

play03:20

one trillion pages on the internet and

play03:22

there's only a hundred billion of those

play03:23

that are unique We Can Save A Lot on

play03:25

storage space and bandwidth by only

play03:28

including unique records so we can

play03:30

actually compute whether or not

play03:32

something is unique by using its hash

play03:34

we're also going to include a last

play03:36

updated date so we can determine how

play03:38

frequently a site changes along with a

play03:40

priority for which we will scrape that

play03:42

site we'll get into why we need these

play03:44

last two points once we get into the URL

play03:47

Frontier at the end of the video so that

play03:48

we have a few important query patterns

play03:50

the first one is we need to be able to

play03:52

get a site byte URL of course maybe we

play03:54

want to look up a site's site content or

play03:56

perhaps we want to find the priority for

play03:58

fetching that site we also need to be

play04:00

able to find if there's any pages with a

play04:02

specified hash so being able to plug in

play04:04

a hash and check if there's any

play04:05

duplicate records associated with it is

play04:07

super important and perhaps most

play04:09

importantly we need to be able to search

play04:11

for a word so when we type in cats we

play04:13

want to be able to look and see every

play04:15

page that has cats in it and which Pages

play04:17

cath appears the most frequently so

play04:19

let's do a little bit of math here given

play04:21

this information about how this problem

play04:23

scales we can see that there's 200

play04:25

petabytes of just raw page content that

play04:28

we'll need to index and then if we look

play04:30

at the size of our metadata and add that

play04:33

all up we can see that that's going to

play04:34

be around 30 terabytes of metadata so if

play04:37

we look at this managing 200 petabytes

play04:39

of storage inside our database is going

play04:41

to be very difficult and depending on

play04:43

the actual database that we use this

play04:46

could end up actually reducing the

play04:47

efficiency of our queries considerably

play04:49

so to get around this problem we're

play04:51

going to introduce a separate blob store

play04:53

something like for example Amazon's S3

play04:56

and what this is going to be responsible

play04:58

for doing is holding these large binary

play05:01

objects which are the site's content and

play05:03

storing them efficiently in in a

play05:06

distributed manner then instead of

play05:07

storing the actual content in our

play05:09

database we can store the content in our

play05:11

blob store and then include a reference

play05:13

to the item that we've added to our blob

play05:16

storage in our database considering we

play05:18

have 30 terabytes of metadata still

play05:20

we're going to want to make sure we're

play05:22

able to distribute our database and the

play05:24

easiest way to do that is with sharding

play05:25

in order to Shard our database we're

play05:27

going to split it across multiple

play05:29

different nodes and each node is going

play05:32

to contain a subset of the data that we

play05:34

need then in order to determine which

play05:35

node the data actually goes on we're

play05:37

going to use this Shard key here an

play05:39

algorithm can be run on The Shard key to

play05:42

immediately determine which node the

play05:44

data is stored on when looking for a

play05:45

Shard key it's super important that we

play05:47

have high cardinality and low frequency

play05:49

of the data so that we can efficiently

play05:51

spread the data across these multiple

play05:53

partitions the URL being a unique

play05:55

identifier for one of these rows is a

play05:57

very good fit for this this also

play05:59

satisfies our query pattern nicely

play06:01

because we can now search for a URL very

play06:04

efficiently because we know exactly what

play06:07

partition the data is on based on the

play06:09

URL alone next up we want to be able to

play06:11

get a page by its hash so we're going to

play06:13

introduce the concept of a global index

play06:15

the global index is a sharded database

play06:17

in and of itself except that instead of

play06:19

holding all of the data in the database

play06:20

we're only including a hash and then URL

play06:23

that can be used to look up the data in

play06:25

the primary table the hash in this case

play06:27

really becomes The Shard key and we can

play06:29

immediately see what partition the data

play06:31

is on based on the hash alone finally we

play06:33

want to be able to look up a word and

play06:35

this is again another chartered database

play06:37

except that instead of holding one

play06:39

record for every URL we're instead

play06:41

holding every single word that appears

play06:43

within that URL and the frequency at

play06:45

which it occurs so we're using the word

play06:47

as The Shard key essentially in this

play06:49

case and then the data that's within

play06:51

this one particular Shard can be sorted

play06:54

on this frequency so that means that the

play06:56

top item is always going to be the one

play06:58

with the highest frequency agency of

play07:00

words so let's say we're looking up cats

play07:02

since this is The Shard key determine

play07:04

what note it's on and then look at just

play07:06

the first record and see the one with

play07:08

the highest frequency so let's take a

play07:10

step back now that we've figured out how

play07:11

the database works and take a look at

play07:13

the solution that we've developed thus

play07:14

far we have our users that are accessing

play07:16

our site and they're going to go through

play07:18

a load balancer so we can distribute our

play07:19

API horizontally our API is then going

play07:22

to access this text index which is going

play07:24

to determine what URL is associated with

play07:27

what words the user searched then the

play07:29

API can look up the metadata in a

play07:32

database and the actual page content in

play07:34

the blob store this works great looking

play07:36

from the other side when data enters our

play07:38

system we can ensure that the data isn't

play07:40

duplicated by checking our hash index

play07:42

and then we can go ahead and add the

play07:44

site content to The Blob store and the

play07:47

actual metadata to the database which

play07:49

will then propagate to our indexes the

play07:51

next step is just to determine how the

play07:53

data comes in here right this Arrow so

play07:55

to do that we're going to take a look at

play07:57

this crawler and if we remember from

play07:59

before the the crawler's job is just to

play08:00

go out to the internet and fetch the

play08:02

page download it and put it in the

play08:04

database and this is a pretty simple

play08:06

solution all we need is a big set of

play08:08

servers that does just that all they

play08:10

need to do is take a URL from this URL

play08:13

Frontier here which stores a list of

play08:15

every URL on the internet goes out to

play08:17

the internet downloads that page and

play08:19

then extracts any URLs that it finds in

play08:22

that page so that it can add them to the

play08:24

URL Frontier this is again how the

play08:26

problem becomes recursive whereby each

play08:28

URL gives us multiple other URLs that we

play08:31

can fetch which each gives us even more

play08:34

URLs now the last problem here is we

play08:36

need to make sure that we're filtering

play08:37

out URLs that are excluded by a sites

play08:40

robots.txt if you're unfamiliar every

play08:42

site has a robots.txt file that can

play08:45

determine what pages a crawler is

play08:48

allowed to crawl so we need to make sure

play08:49

that we're respecting this file however

play08:51

if we're going to do that in this system

play08:53

we have to download that robots.txt file

play08:57

every single time we want to crawl a

play08:58

site before we can fetch the actual page

play09:00

this is going to add a lot of overhead

play09:02

and is going to make our crawling

play09:03

significantly less efficient so in order

play09:05

to solve this problem we're going to

play09:07

introduce this robots.txt cache and the

play09:09

cache is just going to take the

play09:12

robots.txt file and hold on to it until

play09:14

our crawler needs it again and if the

play09:16

robots.txt file does not exist in the

play09:18

cache then the crawler can go out to the

play09:20

internet download the file and place it

play09:22

in the cache so that it can be used

play09:23

later so looking at some math for how

play09:25

our crawlers can scale if we're doing

play09:27

100 billion pages and crawling each one

play09:30

every 10 days that's 10 million crawls

play09:32

per day and if we assume that each crawl

play09:34

takes around 2 seconds which is the

play09:36

average page load time for a website we

play09:38

can see that we need 231

play09:41

000 concurrent crawls that is a lot of

play09:43

print current crawls so in order to

play09:45

handle this let's say we can maybe run

play09:48

20 or 30 crawls on a single computer we

play09:51

will still need 10 000 nodes in our

play09:54

system right however since the rest of

play09:57

our system that we're building is very

play09:58

scalable nothing else should should

play09:59

really be a bottleneck and scaling out

play10:01

our crawler infrastructure to that size

play10:03

the next thing to consider is how much

play10:05

bandwidth we're using so if we consider

play10:07

that each page is 2 megabytes that's

play10:09

going to be about just under two

play10:11

terabits per second of bandwidth that

play10:13

we're using that's a ridiculous amount

play10:14

of bandwidth to be using so the physical

play10:16

location actually becomes really

play10:18

important here we want to make sure that

play10:20

we spread out our crawlers

play10:21

geographically so we can take advantage

play10:23

of different locations internet

play10:24

infrastructure and we also want to make

play10:26

sure that we keep our crawlers

play10:27

physically close to the pages that

play10:29

they're crawling so that we get optimal

play10:30

latency so taking a look at what we've

play10:32

built so far we have our API just like

play10:35

before and we have our database that

play10:36

stores all of this data now what we're

play10:38

adding is this set of crawlers that can

play10:41

go out to the internet download pages

play10:42

and then send them off to our database

play10:45

now the only part that we have missing

play10:47

here is this URL Frontier so if you're

play10:49

fuzzy on any of this I would highly

play10:51

recommend at this point that you go back

play10:53

in the video and re-watch some pieces

play10:55

and make sure that you're super solid on

play10:57

each one of these pieces that we've

play10:59

covered so far the next thing we're

play11:00

going to focus on is the URL Frontier

play11:02

and while this might seem simple is

play11:04

actually a really complicated problem in

play11:06

and of itself so now that we're ready

play11:07

let's take a look so there's a couple

play11:09

key requirements for the URL Frontier of

play11:11

course it needs to hold every URL on the

play11:12

internet so that it can send it to the

play11:14

crawler to be crawled but it also needs

play11:15

to send those URLs to the crawler in the

play11:17

right order well what does that mean it

play11:19

means two things the first is priority

play11:21

so it needs to handle the fact that

play11:23

different sites update at different

play11:25

frequencies for example a site like

play11:27

CNN.com might need to be updated a

play11:29

couple times every day whereas a site

play11:31

such as somebody's company home page

play11:33

does still need to be crawled eventually

play11:36

but probably doesn't need to be updated

play11:37

more than once a month the other thing

play11:39

that we need to consider here is

play11:40

politeness so if we have 231 000

play11:43

concurrent crawls and they all decide to

play11:46

crawl example.com at the same time

play11:47

example.com will probably crash so we

play11:50

want to very much avoid that at all

play11:52

costs so ideally we only want one of our

play11:55

crawlers crawling a single host at one

play11:57

time the other crawler should only be

play11:59

working on different hosts so to start

play12:01

off let's take a look at a very basic

play12:03

URL Frontier that's implemented with a

play12:05

single queue this queue can store every

play12:07

URL on the internet so that does solve

play12:09

our main problem of being able to hold

play12:11

every URL when we're actually using this

play12:13

we can simply take the URL at the end of

play12:16

the queue crawl that URL and then send

play12:18

it back to the beginning of the queue to

play12:19

be crawled again later this solves the

play12:22

problem of every URL will eventually be

play12:24

crawled however it doesn't Implement

play12:26

priority or politeness it's simply going

play12:28

through every URL in order so let's take

play12:30

a look at an example that solves the

play12:32

priority Problem instead of having a

play12:34

single queue we're going to have several

play12:35

cues where each one is assigned a

play12:37

specific priority and then when data

play12:39

enters the system our prioritizer right

play12:41

here will determine what priority the

play12:44

data is is and we'll send it to the

play12:47

correct queue based on its priority then

play12:48

when our crawler is getting the next URL

play12:51

this queue selector here is going to

play12:53

randomly select one of these cues

play12:55

although it will be biased toward ones

play12:56

with higher priority so for example

play12:58

maybe there's a fifty percent chance of

play13:00

selecting this one thirty percent chance

play13:02

of selecting this one and twenty percent

play13:03

chance of selecting the last once it

play13:05

selects that it's going to again pull

play13:07

the URL down crawl the page and then

play13:09

send it back to the prioritizer where it

play13:12

will determine which queue it should

play13:14

belong in so this does solve our

play13:16

priority Problem right every single one

play13:18

of these URLs will get crawled

play13:20

eventually however the higher priority

play13:22

ones up here will just get crawled more

play13:24

frequently however the last thing that

play13:25

we still need to focus on is politeness

play13:27

if all of these URLs are ordered and

play13:29

each one is from example.com then our

play13:31

crawlers will still end up crawling all

play13:33

of those URLs in parallel and we want to

play13:35

avoid that so here's an example of a

play13:37

solution that solves all of these

play13:39

problems so we have the same solution

play13:40

from the last slide over here and then

play13:42

all we're adding is on the right these

play13:45

extra cues that are sorted by host so

play13:47

there's a couple of key requirements

play13:48

here so every single URL in one of these

play13:51

queues has to be from the same host in

play13:53

this case race example1.com and if

play13:55

example1.com is assigned to this queue

play13:58

none of these these other cues are

play14:00

allowed to be containing URLs from

play14:02

example1.com then we're also going to

play14:04

add this Heap over here that contains a

play14:06

reference to each queue along with a

play14:08

time stamp of the next time that we're

play14:09

allowed to fetch a URL from that host

play14:12

when we're looking at this the first

play14:14

step to get the next URL is to grab the

play14:17

top item from the heat the selector will

play14:18

hold on to this item and what that means

play14:20

is that if this crawler is working on

play14:23

this Heap then no other crawlers are

play14:25

able to access this queue well this one

play14:28

is this is really important in solving

play14:29

our politeness problem once we've

play14:31

grabbed this element we can take the

play14:32

first URL call that URL and put it back

play14:35

to the prioritizer which will put it in

play14:37

the correct queue based on its priority

play14:39

just like before once we've finished

play14:40

scraping that URL we can then put this

play14:42

element back in the Heat and update the

play14:44

timestamp based on the amount of time it

play14:46

took the page to load a good rule of

play14:48

thumb here is that you wait 10 times the

play14:51

page load time so for example if your

play14:53

page takes two seconds to load you'd

play14:55

wait 20 seconds before performing the

play14:57

next request at that website event

play14:59

actually throughout this process one of

play15:01

these cues will become empty so let's

play15:02

say for example this top queue is no

play15:04

longer filled with items that's where

play15:06

this router comes into play the router

play15:08

will use the same algorithm before where

play15:10

it picks a random one of these cues and

play15:12

it'll start taking URLs out of those

play15:14

queues and assigning them to other cues

play15:17

when it takes a URL out of one of these

play15:19

queues it's first going to try to

play15:21

determine what the host of that URL is

play15:23

so for example let's assume that this

play15:25

URL here is for example3.com so the

play15:28

first step is to check whether any of

play15:30

these cues down here remember that this

play15:32

one is empty are assigned to

play15:34

example3.com if they are we'll go ahead

play15:37

and set that URL over to that queue

play15:40

otherwise we'll go ahead and put it in

play15:42

the empty queue this is how the cues

play15:45

will end up filling up over time as we

play15:47

continue scanning for more URLs from

play15:49

these queues as a quick recap we have

play15:51

two sets of cues one is designed for

play15:54

ensuring that we are taking care of

play15:55

priority and the other is designed for

play15:57

take making sure that we're taking care

play15:59

of politeness by sorting by host and

play16:02

claiming one of these elements whenever

play16:03

we're trying to scrape a site we can

play16:05

ensure that there's only one worker ever

play16:07

working on a site at once and by sorting

play16:10

these first cues By Priority we can

play16:12

ensure that our router selects cues with

play16:14

a higher priority more frequently so

play16:16

let's take a look at some math and see

play16:18

how big this URL Frontier really needs

play16:20

to be looking at 100 billion URLs times

play16:22

50 bytes per URL we can see that we have

play16:25

five terabytes of URLs that's a lot just

play16:27

for memory so we're going to store most

play16:30

of these cues on disk where the last few

play16:32

elements that are about to be pulled by

play16:34

the router here can be stored in Ram

play16:36

next looking at the iops and the

play16:38

bandwidth usage we can see that we're

play16:40

using 116 000 iops and just under 50

play16:44

megabits per second the 50 megabits per

play16:46

second is more than reasonable and we

play16:47

really don't have to worry about this

play16:49

and while the 116 000 iops is within

play16:52

reason for a high speed SSD we also

play16:55

don't have to worry about that because

play16:56

we can simply take each one of these key

play16:58

use and put it on a separate node and

play17:01

greatly increase the efficiency of this

play17:03

system another interesting point of

play17:04

consideration here is this heat this is

play17:06

pretty difficult to horizontally scale

play17:08

however since all we're storing is a

play17:10

reference to each queue and a timestamp

play17:13

we can very easily regenerate this data

play17:15

in the event of a failure so we can feel

play17:17

fairly comfortable putting this data in

play17:18

Ram in which case this hundred sixteen

play17:21

thousand iops becomes extremely

play17:22

reasonable so let's take a look at where

play17:24

we're at now with the finished solution

play17:26

so we have just like before our API we

play17:28

have our database that stores all of our

play17:30

information and we have our crawlers

play17:32

that go out to the internet download the

play17:34

site and send the data off to the

play17:35

database and finally we have our

play17:37

finished URL Frontier which will take

play17:39

URLs that the crawler finds in other

play17:42

sites prioritize them route them by host

play17:44

and ensure that only one crawler is ever

play17:47

accessing a single host at once so

play17:49

there's a ton of places where you can

play17:50

dig deeper into the solution and make

play17:53

things more efficient or make the

play17:54

solution better in general so some

play17:56

examples of things that we hadn't talked

play17:58

about yet but you should certainly think

play18:00

about if you're interested are fault

play18:01

tolerance so for example here we're

play18:04

popping an element off of this Heap well

play18:05

what if the selector dies in the process

play18:07

you want to make sure that you're able

play18:09

to recover that element in the Heap so

play18:10

things like that and making sure that

play18:12

anytime something fails we can recover

play18:14

from it is very important another thing

play18:16

that we didn't talk about is handling

play18:18

close duplicates so remember we were

play18:19

hashing the page content in order to

play18:22

make sure that there were no duplicates

play18:24

in the database well you could use a

play18:25

technique called shingles which is

play18:27

similar to hashing but enables you to

play18:29

check for close duplicates and make sure

play18:32

that things where there's only a word or

play18:34

a character difference between pages is

play18:36

handled efficiently digging into the

play18:38

index and different algorithms that

play18:40

other search engines use to efficiently

play18:42

index data at scale is also a really

play18:44

interesting problem and using pagerank

play18:47

and personalizing results to the

play18:50

specific user is also a very difficult

play18:52

problem in and of itself and of course

play18:54

there's tons more places that you can

play18:56

explore and find ways to make this more

play18:59

efficient or more functional and I would

play19:02

absolutely invite you to do so if you're

play19:03

curious so I hope this video gave you a

play19:05

general idea of how to approach a

play19:07

problem like this and I hope this helped

play19:09

you prepare for your interviews if you

play19:10

enjoyed that video you can get a lot

play19:12

more content just like this on

play19:13

interviewpen.com WE publish two to four

play19:16

videos a week really it's just an

play19:17

arbitrary number it's whenever I can sit

play19:19

down and do a video because these videos

play19:21

take a whole day to do and we're always

play19:23

online to answer any questions you may

play19:25

have join our Discord join our

play19:26

newsletter the blueprint where you can

play19:28

get more weekly data structure and

play19:30

algorithm and system design kind of

play19:32

topics and subscribe and like this video

play19:34

if you actually like this video and it

play19:36

helped you and also tell a friend that

play19:37

we exist that's all

Rate This

5.0 / 5 (0 votes)

関連タグ
Search EngineWeb ScalabilitySystem DesignCrawlingDatabaseAPILoad BalancerBlob StorageURL FrontierData Structures
英語で要約が必要ですか?