Design a Basic Search Engine (Google or Bing) | System Design Interview Prep
Summary
TLDRThis video script outlines the process of building a scalable web search engine akin to Google. It covers the creation of an API to handle search queries, the necessity of a database to store web pages, and the use of a crawler to gather new URLs and HTML content. The script also delves into the importance of a URL Frontier for managing and prioritizing URLs to be crawled, ensuring politeness and avoiding overloading websites. The discussion includes scaling challenges, database sharding, and the use of a blob store for efficient data management.
Takeaways
- 🔍 **Search Engine Basics**: A scalable web search engine requires an API to handle user queries, a database to store relevant site information, and a mechanism to display titles and descriptions of search results.
- 🕷️ **Crawlers**: A crawler is essential for discovering and downloading web pages to populate the database, working by following links from one page to another.
- 🌐 **Web Structure**: The internet is viewed as a vast network of interconnected pages, where each page can link to multiple others, facilitating the crawler's discovery process.
- 🗄️ **Database Design**: The database must store URLs, site content, titles, descriptions, hashes for uniqueness, last updated dates, and scraping priorities.
- 🔑 **Sharding**: To manage large datasets, sharding is used to distribute data across multiple nodes, with each node holding a subset of the data determined by a shard key.
- 📚 **Blob Storage**: For efficiency, site content is stored in a separate blob store like Amazon S3, with only references kept in the main database.
- 🔑 **Global Index**: A global index is used for quick lookup of data based on hashes, which helps in managing duplicates and ensuring data integrity.
- 🔍 **Text Indexing**: Another sharded database is used for text indexing, allowing for efficient searching of words across web pages.
- 🌐 **Load Balancer**: A load balancer is crucial for distributing user requests across multiple API instances to ensure scalability and performance.
- 🚀 **Scalability**: The system is designed to scale horizontally, with the ability to add more nodes and infrastructure as the number of users and data grows.
- 📈 **URL Frontier**: The URL Frontier is a complex system for managing and prioritizing URLs to be crawled, ensuring politeness and efficiency in the crawling process.
Q & A
What are the basic requirements for a scalable web search engine?
-The basic requirements include the ability for users to enter a search query, the search engine's capability to find relevant sites, and the provision of a list with titles, descriptions, and URLs of each site.
What is the role of an API in a web search engine?
-The API is responsible for handling user search requests, looking up relevant sites in a database, and returning results to the user.
How does a crawler contribute to building a web search engine?
-A crawler goes out to the internet, finds HTML pages, downloads them, and adds them to the database, ensuring the search engine has an up-to-date index of web pages.
Why is a database necessary for a web search engine?
-A database is necessary to store the URLs, site content, titles, descriptions, hashes, last updated dates, and scraping priorities for all the web pages indexed by the search engine.
What is the purpose of a hash in the database?
-The hash is used to identify unique pages on the internet, saving storage space and bandwidth by only including unique records.
How does a load balancer improve the scalability of a web search engine?
-A load balancer distributes user requests across multiple API instances, ensuring that a large number of users can access the site simultaneously without overloading any single server.
What is a blob store and how does it relate to a web search engine's database?
-A blob store is a storage system for large binary objects, like web page content. Instead of storing this content in the database, it's stored in the blob store, and the database holds a reference to it, improving efficiency.
Why is sharding used in the database of a web search engine?
-Sharding is used to distribute the database across multiple nodes, allowing for efficient storage and retrieval of data at scale by partitioning the data based on a shard key.
What is a global index in the context of a web search engine?
-A global index is a sharded database that contains hashes and URLs for quick lookup, allowing the system to efficiently find and access data in the primary database.
How does the URL Frontier ensure politeness and priority in crawling?
-The URL Frontier uses multiple queues sorted by priority and host to ensure that high-priority URLs are crawled more frequently and that only one crawler accesses a single host at a time to avoid overwhelming the host.
Outlines
🔍 Building a Scalable Web Search Engine
The paragraph introduces the concept of building a scalable web search engine akin to Google. It outlines the basic requirements, such as the ability for users to input search queries, the engine's capability to find relevant sites, and the display of titles and descriptions. The script then delves into the backend processes, including the use of an API to handle search requests, a database to store web pages, and a crawler to discover and download new pages. The crawler, named 'cola crawler', is tasked with navigating the internet and using URLs to find associated web pages. The discussion also touches on the challenges of acquiring a comprehensive database of internet sites and the need for a URL database to manage the vast number of links.
🌐 Components of a Search Engine System
This section discusses the various components involved in a search engine system. It starts with the API, which has a single endpoint to accept search queries and return relevant web pages. A load balancer is introduced to distribute user requests efficiently. The script then moves on to the database, emphasizing the need to store URLs, site content, titles, descriptions, and hashes to ensure uniqueness. It also mentions the importance of a last updated date and a priority for scraping. The concept of a blob store is introduced to handle large amounts of site content, with references stored in the database. The paragraph concludes with a discussion on sharding the database to distribute data across multiple nodes and the use of shard keys for efficient data retrieval.
🕷️ The Role of Crawlers and URL Frontier
The paragraph explains the role of crawlers in a search engine, which is to fetch web pages and add them to the database. It introduces the URL Frontier, a system that manages a list of URLs to be crawled. The script discusses the importance of respecting robots.txt files to avoid overloading websites. It also addresses the need for a robots.txt cache to improve crawling efficiency. The discussion then turns to scaling crawlers, calculating the number of concurrent crawls needed based on the number of pages and the frequency of updates. The importance of geographical distribution of crawlers to optimize bandwidth usage and latency is also highlighted.
🗂️ URL Frontier: Prioritization and Politeness
This section delves into the complexities of the URL Frontier, which is responsible for managing the order in which URLs are crawled. It discusses the need for prioritization, as different sites update at varying frequencies, and politeness, to avoid overwhelming a single website with multiple crawls. The script outlines a solution involving multiple queues for different priorities and a heap to ensure that only one crawler accesses a single host at a time. It also introduces a router to distribute URLs to the correct queues. The paragraph concludes with some considerations for scaling the URL Frontier and ensuring efficient use of resources.
🔚 Final Thoughts and Further Exploration
The final paragraph summarizes the complete solution for building a scalable web search engine. It recaps the roles of the API, database, crawlers, and URL Frontier. The script encourages further exploration into areas such as fault tolerance, handling close duplicates, indexing algorithms, and personalizing search results. It also invites viewers to engage with the content provider's community for more learning resources and support.
Mindmap
Keywords
💡Search Engine
💡API
💡Crawler
💡Database
💡Load Balancer
💡Blob Store
💡Sharding
💡URL Frontier
💡Robots.txt
💡Indexing
💡Hash
Highlights
Introduction to building a scalable web search engine.
User requirements for a search engine: entering queries, finding relevant sites, and viewing titles and descriptions.
Behind-the-scenes process of a search engine query.
The role of an API in handling user search requests.
Necessity of a database containing all websites for relevance.
Crawler's function to find and download HTML pages for the database.
How a crawler discovers new URLs to crawl.
Challenges in building the API for scalability.
Implementation of a load balancer for handling multiple users.
Database storage requirements for URLs, content, titles, descriptions, and hashes.
Use of hashes to save storage space and bandwidth by including only unique records.
Importance of last updated dates and scraping priorities in the database.
Query patterns for efficient data retrieval in the database.
Challenge of managing 200 petabytes of raw page content.
Introduction of a blob store for efficient storage of large binary objects.
Sharding as a method for distributing the database.
Use of a global index for efficient lookup by hash.
Building a text index for word frequency and efficient search.
System design including API, database, blob store, and crawlers.
Crawler's role in fetching pages and updating the database.
Respecting robots.txt files to avoid overloading websites.
Use of a robots.txt cache to improve crawling efficiency.
Mathematical scaling considerations for crawlers.
Importance of geographical distribution of crawlers for bandwidth management.
URL Frontier's role in managing and prioritizing URLs for crawling.
Politeness policy in URL Frontier to avoid overloading hosts.
Router's function in the URL Frontier to assign URLs to queues.
Handling of empty queues and refilling them in the URL Frontier.
Final system overview with API, database, crawlers, and URL Frontier.
Suggestions for further improvements and considerations.
Transcripts
So today we're going to be learning how
to build a scalable web search engine
something like google.com for example
there's some simple requirements that we
have for a service like this the user of
course needs to be able to enter a
search query the search engine needs to
be able to find sites that are relevant
to that query and the user needs to be
able to see a list with titles and
descriptions of each one of those sites
so of course we're all familiar with
this process we've probably used Google
a million times but let's take a look at
what actually happens behind the scenes
when we do this so let's say you're
going to your search engine and you're
searching for cats of course there needs
to be something that can handle that
request in the back end so we're going
to introduce this API here and the API
is just going to be responsible for
handling the user's request for a search
in terms of how it actually handles that
request it's going to need to look up a
site that's relevant to the user in some
sort of database so we need to have a
database of all of the websites on the
internet so this database here is going
to contain an entry for every single
page on the internet well The Logical
next question is how do we acquire a
database with every single site on the
internet and the answer is that we need
some sort of crawler that can go out to
the internet find the HTML associated
with this page and download it to add it
to the database and we're going to call
that colada crawler so now the final
question here is how does the crawler
even know what sites to crawl how does
it know which URLs are associated with
websites and to answer that question we
really need to think about the web as a
web a single page on the web has
multiple URLs to point to other pages
and each one of those pages has URLs
that point to other Pages
etc etc so when we think about that we
can think that once we download a single
URL we can extract the page and then we
can search the HTML content of the page
for new URLs and we can add those URLs
to our URL database so each one of these
components presents its own unique
challenges but now that we understand
which problems we're actually solving
let's dig into the first one which is
building this API here if you're
enjoying this video we have plenty more
awesome data structure algorithm and
system design explanations on
interviewpen.com you can ask us any
questions you have about any kind of
topics surrounding data structures and
algorithms and system design we released
two to four videos a week you can run
your code you can talk to a personalized
AI teaching assistant and yeah this
site's pretty great anyway enjoy the
video so this is a pretty simple API and
it only needs one endpoint and all it
needs to do is accept a search query and
return a list with titles descriptions
and URLs of web pages thinking about how
the scales we want to make sure that a
large number of users can access the
site at once so we want to add in a load
balancer and the load balancer is going
to be responsible for taking a single
user and then routing it to the correct
API based on load I've also added a page
number over here so that we can make
sure that instead of returning an entire
list of every single site that could
possibly be relevant to the search query
we're only returning a subsection and
the user can search for the next page if
they're interested in more results so
now that we have the API squared away
let's take a look at the next part which
is of the database that will actually
store the data that the API is going to
look up so thinking about the sort of
data that needs to be stored in a
database like this we of course need the
URL and the actual content of the site
we also need the title and the
description associated with each site as
that's what's actually going to be
displayed to the user we're also going
to include in here a hash since there's
one trillion pages on the internet and
there's only a hundred billion of those
that are unique We Can Save A Lot on
storage space and bandwidth by only
including unique records so we can
actually compute whether or not
something is unique by using its hash
we're also going to include a last
updated date so we can determine how
frequently a site changes along with a
priority for which we will scrape that
site we'll get into why we need these
last two points once we get into the URL
Frontier at the end of the video so that
we have a few important query patterns
the first one is we need to be able to
get a site byte URL of course maybe we
want to look up a site's site content or
perhaps we want to find the priority for
fetching that site we also need to be
able to find if there's any pages with a
specified hash so being able to plug in
a hash and check if there's any
duplicate records associated with it is
super important and perhaps most
importantly we need to be able to search
for a word so when we type in cats we
want to be able to look and see every
page that has cats in it and which Pages
cath appears the most frequently so
let's do a little bit of math here given
this information about how this problem
scales we can see that there's 200
petabytes of just raw page content that
we'll need to index and then if we look
at the size of our metadata and add that
all up we can see that that's going to
be around 30 terabytes of metadata so if
we look at this managing 200 petabytes
of storage inside our database is going
to be very difficult and depending on
the actual database that we use this
could end up actually reducing the
efficiency of our queries considerably
so to get around this problem we're
going to introduce a separate blob store
something like for example Amazon's S3
and what this is going to be responsible
for doing is holding these large binary
objects which are the site's content and
storing them efficiently in in a
distributed manner then instead of
storing the actual content in our
database we can store the content in our
blob store and then include a reference
to the item that we've added to our blob
storage in our database considering we
have 30 terabytes of metadata still
we're going to want to make sure we're
able to distribute our database and the
easiest way to do that is with sharding
in order to Shard our database we're
going to split it across multiple
different nodes and each node is going
to contain a subset of the data that we
need then in order to determine which
node the data actually goes on we're
going to use this Shard key here an
algorithm can be run on The Shard key to
immediately determine which node the
data is stored on when looking for a
Shard key it's super important that we
have high cardinality and low frequency
of the data so that we can efficiently
spread the data across these multiple
partitions the URL being a unique
identifier for one of these rows is a
very good fit for this this also
satisfies our query pattern nicely
because we can now search for a URL very
efficiently because we know exactly what
partition the data is on based on the
URL alone next up we want to be able to
get a page by its hash so we're going to
introduce the concept of a global index
the global index is a sharded database
in and of itself except that instead of
holding all of the data in the database
we're only including a hash and then URL
that can be used to look up the data in
the primary table the hash in this case
really becomes The Shard key and we can
immediately see what partition the data
is on based on the hash alone finally we
want to be able to look up a word and
this is again another chartered database
except that instead of holding one
record for every URL we're instead
holding every single word that appears
within that URL and the frequency at
which it occurs so we're using the word
as The Shard key essentially in this
case and then the data that's within
this one particular Shard can be sorted
on this frequency so that means that the
top item is always going to be the one
with the highest frequency agency of
words so let's say we're looking up cats
since this is The Shard key determine
what note it's on and then look at just
the first record and see the one with
the highest frequency so let's take a
step back now that we've figured out how
the database works and take a look at
the solution that we've developed thus
far we have our users that are accessing
our site and they're going to go through
a load balancer so we can distribute our
API horizontally our API is then going
to access this text index which is going
to determine what URL is associated with
what words the user searched then the
API can look up the metadata in a
database and the actual page content in
the blob store this works great looking
from the other side when data enters our
system we can ensure that the data isn't
duplicated by checking our hash index
and then we can go ahead and add the
site content to The Blob store and the
actual metadata to the database which
will then propagate to our indexes the
next step is just to determine how the
data comes in here right this Arrow so
to do that we're going to take a look at
this crawler and if we remember from
before the the crawler's job is just to
go out to the internet and fetch the
page download it and put it in the
database and this is a pretty simple
solution all we need is a big set of
servers that does just that all they
need to do is take a URL from this URL
Frontier here which stores a list of
every URL on the internet goes out to
the internet downloads that page and
then extracts any URLs that it finds in
that page so that it can add them to the
URL Frontier this is again how the
problem becomes recursive whereby each
URL gives us multiple other URLs that we
can fetch which each gives us even more
URLs now the last problem here is we
need to make sure that we're filtering
out URLs that are excluded by a sites
robots.txt if you're unfamiliar every
site has a robots.txt file that can
determine what pages a crawler is
allowed to crawl so we need to make sure
that we're respecting this file however
if we're going to do that in this system
we have to download that robots.txt file
every single time we want to crawl a
site before we can fetch the actual page
this is going to add a lot of overhead
and is going to make our crawling
significantly less efficient so in order
to solve this problem we're going to
introduce this robots.txt cache and the
cache is just going to take the
robots.txt file and hold on to it until
our crawler needs it again and if the
robots.txt file does not exist in the
cache then the crawler can go out to the
internet download the file and place it
in the cache so that it can be used
later so looking at some math for how
our crawlers can scale if we're doing
100 billion pages and crawling each one
every 10 days that's 10 million crawls
per day and if we assume that each crawl
takes around 2 seconds which is the
average page load time for a website we
can see that we need 231
000 concurrent crawls that is a lot of
print current crawls so in order to
handle this let's say we can maybe run
20 or 30 crawls on a single computer we
will still need 10 000 nodes in our
system right however since the rest of
our system that we're building is very
scalable nothing else should should
really be a bottleneck and scaling out
our crawler infrastructure to that size
the next thing to consider is how much
bandwidth we're using so if we consider
that each page is 2 megabytes that's
going to be about just under two
terabits per second of bandwidth that
we're using that's a ridiculous amount
of bandwidth to be using so the physical
location actually becomes really
important here we want to make sure that
we spread out our crawlers
geographically so we can take advantage
of different locations internet
infrastructure and we also want to make
sure that we keep our crawlers
physically close to the pages that
they're crawling so that we get optimal
latency so taking a look at what we've
built so far we have our API just like
before and we have our database that
stores all of this data now what we're
adding is this set of crawlers that can
go out to the internet download pages
and then send them off to our database
now the only part that we have missing
here is this URL Frontier so if you're
fuzzy on any of this I would highly
recommend at this point that you go back
in the video and re-watch some pieces
and make sure that you're super solid on
each one of these pieces that we've
covered so far the next thing we're
going to focus on is the URL Frontier
and while this might seem simple is
actually a really complicated problem in
and of itself so now that we're ready
let's take a look so there's a couple
key requirements for the URL Frontier of
course it needs to hold every URL on the
internet so that it can send it to the
crawler to be crawled but it also needs
to send those URLs to the crawler in the
right order well what does that mean it
means two things the first is priority
so it needs to handle the fact that
different sites update at different
frequencies for example a site like
CNN.com might need to be updated a
couple times every day whereas a site
such as somebody's company home page
does still need to be crawled eventually
but probably doesn't need to be updated
more than once a month the other thing
that we need to consider here is
politeness so if we have 231 000
concurrent crawls and they all decide to
crawl example.com at the same time
example.com will probably crash so we
want to very much avoid that at all
costs so ideally we only want one of our
crawlers crawling a single host at one
time the other crawler should only be
working on different hosts so to start
off let's take a look at a very basic
URL Frontier that's implemented with a
single queue this queue can store every
URL on the internet so that does solve
our main problem of being able to hold
every URL when we're actually using this
we can simply take the URL at the end of
the queue crawl that URL and then send
it back to the beginning of the queue to
be crawled again later this solves the
problem of every URL will eventually be
crawled however it doesn't Implement
priority or politeness it's simply going
through every URL in order so let's take
a look at an example that solves the
priority Problem instead of having a
single queue we're going to have several
cues where each one is assigned a
specific priority and then when data
enters the system our prioritizer right
here will determine what priority the
data is is and we'll send it to the
correct queue based on its priority then
when our crawler is getting the next URL
this queue selector here is going to
randomly select one of these cues
although it will be biased toward ones
with higher priority so for example
maybe there's a fifty percent chance of
selecting this one thirty percent chance
of selecting this one and twenty percent
chance of selecting the last once it
selects that it's going to again pull
the URL down crawl the page and then
send it back to the prioritizer where it
will determine which queue it should
belong in so this does solve our
priority Problem right every single one
of these URLs will get crawled
eventually however the higher priority
ones up here will just get crawled more
frequently however the last thing that
we still need to focus on is politeness
if all of these URLs are ordered and
each one is from example.com then our
crawlers will still end up crawling all
of those URLs in parallel and we want to
avoid that so here's an example of a
solution that solves all of these
problems so we have the same solution
from the last slide over here and then
all we're adding is on the right these
extra cues that are sorted by host so
there's a couple of key requirements
here so every single URL in one of these
queues has to be from the same host in
this case race example1.com and if
example1.com is assigned to this queue
none of these these other cues are
allowed to be containing URLs from
example1.com then we're also going to
add this Heap over here that contains a
reference to each queue along with a
time stamp of the next time that we're
allowed to fetch a URL from that host
when we're looking at this the first
step to get the next URL is to grab the
top item from the heat the selector will
hold on to this item and what that means
is that if this crawler is working on
this Heap then no other crawlers are
able to access this queue well this one
is this is really important in solving
our politeness problem once we've
grabbed this element we can take the
first URL call that URL and put it back
to the prioritizer which will put it in
the correct queue based on its priority
just like before once we've finished
scraping that URL we can then put this
element back in the Heat and update the
timestamp based on the amount of time it
took the page to load a good rule of
thumb here is that you wait 10 times the
page load time so for example if your
page takes two seconds to load you'd
wait 20 seconds before performing the
next request at that website event
actually throughout this process one of
these cues will become empty so let's
say for example this top queue is no
longer filled with items that's where
this router comes into play the router
will use the same algorithm before where
it picks a random one of these cues and
it'll start taking URLs out of those
queues and assigning them to other cues
when it takes a URL out of one of these
queues it's first going to try to
determine what the host of that URL is
so for example let's assume that this
URL here is for example3.com so the
first step is to check whether any of
these cues down here remember that this
one is empty are assigned to
example3.com if they are we'll go ahead
and set that URL over to that queue
otherwise we'll go ahead and put it in
the empty queue this is how the cues
will end up filling up over time as we
continue scanning for more URLs from
these queues as a quick recap we have
two sets of cues one is designed for
ensuring that we are taking care of
priority and the other is designed for
take making sure that we're taking care
of politeness by sorting by host and
claiming one of these elements whenever
we're trying to scrape a site we can
ensure that there's only one worker ever
working on a site at once and by sorting
these first cues By Priority we can
ensure that our router selects cues with
a higher priority more frequently so
let's take a look at some math and see
how big this URL Frontier really needs
to be looking at 100 billion URLs times
50 bytes per URL we can see that we have
five terabytes of URLs that's a lot just
for memory so we're going to store most
of these cues on disk where the last few
elements that are about to be pulled by
the router here can be stored in Ram
next looking at the iops and the
bandwidth usage we can see that we're
using 116 000 iops and just under 50
megabits per second the 50 megabits per
second is more than reasonable and we
really don't have to worry about this
and while the 116 000 iops is within
reason for a high speed SSD we also
don't have to worry about that because
we can simply take each one of these key
use and put it on a separate node and
greatly increase the efficiency of this
system another interesting point of
consideration here is this heat this is
pretty difficult to horizontally scale
however since all we're storing is a
reference to each queue and a timestamp
we can very easily regenerate this data
in the event of a failure so we can feel
fairly comfortable putting this data in
Ram in which case this hundred sixteen
thousand iops becomes extremely
reasonable so let's take a look at where
we're at now with the finished solution
so we have just like before our API we
have our database that stores all of our
information and we have our crawlers
that go out to the internet download the
site and send the data off to the
database and finally we have our
finished URL Frontier which will take
URLs that the crawler finds in other
sites prioritize them route them by host
and ensure that only one crawler is ever
accessing a single host at once so
there's a ton of places where you can
dig deeper into the solution and make
things more efficient or make the
solution better in general so some
examples of things that we hadn't talked
about yet but you should certainly think
about if you're interested are fault
tolerance so for example here we're
popping an element off of this Heap well
what if the selector dies in the process
you want to make sure that you're able
to recover that element in the Heap so
things like that and making sure that
anytime something fails we can recover
from it is very important another thing
that we didn't talk about is handling
close duplicates so remember we were
hashing the page content in order to
make sure that there were no duplicates
in the database well you could use a
technique called shingles which is
similar to hashing but enables you to
check for close duplicates and make sure
that things where there's only a word or
a character difference between pages is
handled efficiently digging into the
index and different algorithms that
other search engines use to efficiently
index data at scale is also a really
interesting problem and using pagerank
and personalizing results to the
specific user is also a very difficult
problem in and of itself and of course
there's tons more places that you can
explore and find ways to make this more
efficient or more functional and I would
absolutely invite you to do so if you're
curious so I hope this video gave you a
general idea of how to approach a
problem like this and I hope this helped
you prepare for your interviews if you
enjoyed that video you can get a lot
more content just like this on
interviewpen.com WE publish two to four
videos a week really it's just an
arbitrary number it's whenever I can sit
down and do a video because these videos
take a whole day to do and we're always
online to answer any questions you may
have join our Discord join our
newsletter the blueprint where you can
get more weekly data structure and
algorithm and system design kind of
topics and subscribe and like this video
if you actually like this video and it
helped you and also tell a friend that
we exist that's all
5.0 / 5 (0 votes)