Google SWE teaches systems design | EP27: Search Indexes
Summary
TLDRThis video script delves into search indexes, crucial for large-scale companies like Google and Amazon, offering fast query capabilities. It introduces Lucene, the popular open-source search index, and explains its architecture using LSM trees and SS tables for performance. The script covers tokenization, the creation of an inverse index, and advanced search functionalities like prefix and suffix searches. It then discusses Elasticsearch, which scales Lucene across a multi-node cluster, employing replication and partitioning for efficient searching and caching to boost performance. The video aims to clarify search index concepts, preventing common misconceptions.
Takeaways
- 🔎 Search indexes are crucial for large-scale companies with search functionalities, enabling quick and complex queries.
- 📚 Lucene is a widely-used search index technology, open-sourced since 1999 under the Apache license, and forms the basis for search platforms like Elasticsearch and Solr.
- 🌳 Lucene uses an LSM tree and SS table architecture for high performance, with writes first going to an in-memory buffer and then to disk every second during a refresh period.
- 📝 When a document is added to Lucene, it undergoes tokenization, which involves splitting terms based on rules, handling punctuation, case, and filtering out common words.
- 🔑 Lucene's architecture includes an inverse index, mapping terms to a list of document IDs, allowing for efficient searching by term.
- 🔍 Lucene supports advanced search features, including prefix searching, term suffix searching, and numeric term searching, with binary search used for finding documents matching a prefix.
- 🔄 Elasticsearch extends Lucene by scaling it across a multi-node cluster using replication and partitioning, creating local inverted indexes for each node.
- 💾 Elasticsearch achieves high read throughput by implementing caching effectively, including OS caching of index pages and caching of query results and components.
- 🗂️ Unlike global inverted indexes, Elasticsearch uses document partitioning, meaning each index on a node is relevant only for documents on that node.
- 🚀 The benefits of using search indexes like Elasticsearch include faster text searching capabilities compared to traditional database scans and the ability to scale in a distributed environment.
- 📈 The video script emphasizes the importance of understanding search index implementations to avoid misconceptions and to effectively utilize these technologies in applications.
Q & A
What is the primary purpose of search indexes in large-scale companies?
-Search indexes are crucial for providing efficient search functionalities in applications. They optimize complex queries to run quickly, which is essential for companies like Google, Amazon, and Twitter that rely heavily on search capabilities.
Why are databases not ideal for search functionalities?
-Databases are not optimized for search functionalities because they are not designed to handle the speed and complexity of search queries as efficiently as specialized search index technologies can.
What is Lucene and why is it significant in the context of search indexes?
-Lucene is a high-performance, full-featured text search engine library created in 1999 and open-sourced under the Apache License. It serves as the foundation for many popular search platforms like Elasticsearch and Solr, making it a key component in the search index ecosystem.
Can you explain the LSM Tree and SS Table architecture used by Lucene?
-The LSM Tree and SS Table architecture is a method used by Lucene to maintain high performance. It involves first writing data to an in-memory buffer, which is then periodically flushed to disk in the form of immutable SS Table index files. These files are later compacted and merged to maintain efficiency.
What happens during the tokenization process in Lucene?
-Tokenization is the process of splitting text into individual elements or terms based on certain rules, typically whitespace. It also involves handling punctuation, case normalization, contractions, and filtering out common words that may not be relevant for search purposes.
What is an inverse index and how does it work?
-An inverse index is the core of a search index, where instead of mapping a document ID to terms, each term is mapped to a list of document IDs that contain it. This allows for efficient searching as the index is sorted by terms, enabling quick retrieval of documents containing specific terms or term prefixes.
How does Lucene handle searches for term suffixes or numeric terms?
-For term suffixes, Lucene can create a modified index where terms are stored in reverse order, allowing for the same prefix searching logic to be applied to suffixes. For numeric terms, Lucene may use different data structures or algorithms, such as comparing common digits in similar numbers.
What is Elasticsearch and how does it differ from Lucene?
-Elasticsearch is a distributed, multitenant-capable full-text search engine built on top of Lucene. It scales Lucene across a multi-node cluster using replication and partitioning, creating local inverted indexes on each node relevant to the documents on that node.
How does Elasticsearch achieve high read throughput?
-Elasticsearch achieves high read throughput through effective caching mechanisms. It caches index pages in memory, results of queries, and components of queries, allowing for quick retrieval and reuse of data for similar or repeated searches.
What are the advantages and disadvantages of document partitioning in Elasticsearch?
-Advantages include efficient local search and write operations on a specific node. Disadvantages include the need to reach out to multiple shards for cross-shard searches, which requires merging results from different nodes.
Why are search indexes important for large applications?
-Search indexes are important for large applications because they enable fast and efficient text searches, which would be impractical with traditional database scans and substring logic. They are essential for providing a seamless user experience in applications that require robust search capabilities.
Outlines
🔍 Introduction to Search Indexes
The video script begins with an introduction to search indexes, highlighting their importance in large-scale companies for efficient search functionalities. The speaker explains how traditional databases are not optimized for search operations and introduces Lucene as a popular search index technology, which has been open-sourced under the Apache license since 1999. Lucene forms the backbone for search platforms like Elasticsearch and Solr. The script then delves into Lucene's architecture, which utilizes an LSM tree and SS table structure to maintain high performance, with an emphasis on the write process involving an in-memory buffer and periodic refresh to disk. The summary also touches on the read process, which may involve multiple SS table files and merging results.
📚 Deep Dive into Lucene's Architecture and Features
This paragraph provides a deeper insight into Lucene's architecture, starting with the tokenization process of adding a document to Lucene, which involves splitting text into terms while handling punctuation, case sensitivity, contractions, and common words. The core of Lucene's functionality is its inverted index, which maps terms to document IDs, allowing for efficient searching. The script discusses how Lucene handles prefix searches using binary search on sorted term tables and introduces the concept of a modified index for suffix searches and numeric terms. It also mentions Lucene's capabilities for text similarity searches using Levenshtein distance and geospatial data, suggesting these topics for future videos. The paragraph concludes with an overview of Elasticsearch, which scales Lucene across a multi-node cluster using replication and partitioning, and highlights the benefits of Elasticsearch's caching mechanisms for achieving high read throughput.
Mindmap
Keywords
💡Search Indexes
💡Lucene
💡LSM Tree
💡SS Table
💡Tokenization
💡Inverted Index
💡Elasticsearch
💡Replication
💡Partitioning
💡Caching
💡Levenstein Distance
Highlights
Search indexes are essential for large-scale companies with embedded search functionalities, optimizing complex queries for speed.
Lucene, an open-source search index since 1999, is widely used under the Apache license and powers search functionalities in platforms like Elasticsearch and Solr.
Lucene's architecture uses an LSM tree and SS table for high performance, with writes buffered in-memory and refreshed periodically to disk.
Lucene's tokenization process involves splitting documents into terms, handling punctuation, case, and common words to prepare for efficient searching.
The inverted index in Lucene maps terms to document IDs, allowing for rapid retrieval of documents containing specific terms.
Lucene supports prefix searches by utilizing sorted tables and binary search, enhancing the efficiency of term lookups.
For suffix searches, Lucene can use a modified index with reversed term characters to apply the same prefix search logic.
Lucene also handles numeric terms and geospatial data with specialized data structures, going beyond text-based searches.
Elasticsearch extends Lucene by scaling it across a multi-node cluster, using replication and partitioning to manage data distribution.
Elasticsearch creates local inverted indexes on each node, relevant only to the documents on that node, for efficient querying.
Elasticsearch benefits from caching strategies, including OS caching of index pages and caching of query results and components.
Caching in Elasticsearch contributes to high read throughput and fast performance, even for complex queries.
Search indexes enable large applications to find text strings much faster than traditional database scans, leveraging SS tables and inverted indexes.
Elasticsearch and Solr demonstrate the successful scaling of Lucene in a distributed environment, enhancing search capabilities.
The video creator emphasizes the importance of understanding search index implementations to avoid past mistakes and guide future development.
Transcripts
alrighty i am back i'm gonna be a little
bit behind schedule today just because
my roommate would not leave bed all day
and now i can finally record now that
he's gone so uh we're gonna talk about
search indexes which are a pretty
important part of any large-scale
company they pretty much all use them in
one sense or another so i'll talk about
the internals of those how they work and
then we can get through this video and
on to the next
alrighty search indexes what are they
well
a lot of companies have some sort of
search functionality embedded in their
application at one point or another i
mean if you're just taking google for
example it's pretty much their entire
business
amazon has the ability to search for pro
products based on a bunch of different
filters and you know you can do the same
on twitter for posts matching certain
keywords
but generally speaking databases aren't
great for these search functionalities
generally we want to use a different
type of data structure or
technology in general in order to
optimize for search functionality and
make it such that these very complex
queries can actually run really quickly
so what is leucine lucene is what's
known as a search index and it's
probably the most popular one it's been
around since 1999 and it's been open
sourced under the apache license
the actual search companies that people
tend to use which are you know
elasticsearch and solar
use leucine under the hood and as a
result of that i'm going to first
explain lucine and then i'll go into
elasticsearch and kind of how it builds
on that
so let's talk about the architecture of
lucian
generally speaking it uses an lsm tree
and ss table architecture and that's one
of the ways that it remains highly
performant so obviously first rights are
going to be sent to an in-memory buffer
the one difference here which is a
little bit nuanced for lucian is that
once the writes are sent to that buffer
you still can't read them just yet they
have to actually be propagated to the
disk first and in lucy and this tends to
happen once every second it's called
like a refresh period
so eventually the buffer is going to get
written to an immutable ss table index
file and then once there are a bunch of
ss table index files those are going to
get compacted and merged together which
is pretty typical for what we've seen
and then in terms of reads reads
actually have to be potentially sent to
multiple ss table files because each ss
table is kind of covering a different
set of documents and then once you get
the result of those reads you're going
to go ahead and merge them
okay so let's actually go ahead and
continue talking about this lucian
architecture because the ss table part
is only one aspect of it
so what actually happens when a document
is added to lucien well it has to be
tokenized which means that you're taking
all of the words or the terms in the
document and splitting them apart based
on some sort of rule typically you're
just splitting based on the white space
but then there are a few nuances as well
in terms of how we actually want to
handle the document because we don't
want to be mapping it to a bunch of
terms that don't really apply or that we
don't care about so obviously we have to
handle the punctuation
we have to handle you know the case of
words maybe it'd be better if we just
treated all words if they were lowercase
handling contractions such as like its
or their
you know with the apostrophe re and then
also handling common words like the or
and maybe we don't really care about
queries on these and we would rather
just filter them out
these are all also problems that occur
in natural language processing but just
tokenizing in general obviously you want
to be conscious of how you're actually
doing that because you want to be able
to search for these terms later and
you know if there are nuances within the
terms themselves you just have to think
about you know how am i actually going
to be tokenizing this document
okay so here's kind of the bread and
butter of a search index and this is
called an inverse index so inverse
indexes work by basically doing the
following once the document has been
tokenized we know that now it applies to
a list of terms so instead of mapping a
document id to a list of terms what we
actually do instead is take each term
and map it to a list of document ids
that contain it like i said this is
known as the inverted index and these
are what the ss table files are
generally containing
so that means that they're going to be
sorted by term by the way
but what if we want to just go beyond
looking at just one term itself and say
you know get all the terms with a prefix
of p e in this case
so let's start by finding all the
documents with those terms or at least
the terms that have that prefix and in
order to do so we can just run a binary
search on this sorted table and the way
you would probably do this is do two
binary searches where one gets the upper
bound so p-e-e in this case then one
gets the lower bound p-e-n
so you know that's how you would go
ahead and do that and that way we can
kind of find all the documents matching
this prefix in
login time
but generally speaking if we want to
just do searches that are more
complicated than just by term prefix we
have to add some additional logic and
here a couple of examples of that
what if we wanted to search by term
suffix so for example instead of
searching by p e we wanted to search for
all terms with e n
well what we could actually do here is
create a second modified index where
instead of storing the terms themselves
in the inverse index you store the
reversed characters of the term so that
way you can basically do the same prefix
searching logic but just with the suffix
and you go ahead and reverse that and
then run your binary search
another thing that's kind of similar is
doing this with numeric terms where you
actually look at kind of the the digits
of the number itself and that way you
can kind of say like oh you know like
let's look at the number 120 123
well they're kind of similar because
um there's actually or i don't know
maybe like
it's the fact that they basically have a
bunch of digits in common in the same
places and at the same start but i think
actually in reality lucian does use a
different data structure for things like
numbers and geospatial data so maybe
that's the topic for another video but
ultimately the whole point here is that
lucian can do a lot of really cool
search functionality and it's not just
on text but like i said numbers
you can do
text searching of similar words using
something called levenstein distance
which is you know done via some dynamic
programming algorithm and then also on
geolocation data which is a topic i will
cover in another video
okay so what is elasticsearch well
elasticsearch is basically taking
leucine and scaling it across a
multi-node cluster
so the way that we do this obviously is
going to be through some combination of
replication and partitioning
so we do have replication where each uh
thing that's being replicated is an
actual individual leucine index over a
bunch of documents
but also we have partitioning as well so
each index um you know can be sent to
certain nodes in the cluster and then
other index might be sent to different
nodes in the cluster and it's kind of in
this way that the data set is divided up
um so in that way elasticsearch
basically creates a bunch of local
inverted indexes keep in mind these are
not global inverted indexes it's not
like every single term or for a given
term we have every single document id of
it and then
that index is term partitioned but
rather it's document partitioned meaning
that
the index that's held on a given node is
only relevant for the documents on that
actual node so if you think about it
this has
you know obviously its advantages and
disadvantages it means that if you want
to search for terms across multiple
shards then you're actually going to
have to reach out to all of those shards
and merge together those results
but it also means that say we're you
know trying to implement a search
function on like a log functionality if
we put all of those log entries from the
same day on the same chart then we can
really quickly go ahead and make rights
to that shard and in addition to that
query it super easily
another huge benefit of elasticsearch
over just like plain old blue scene is
that it implements caching really well
and this is kind of how it achieves such
fast performance
so not only is there just like operating
system caching of index pages in memory
so if you think about it an ss table is
just a page on disk and um you know the
way that operating systems work is they
actually will tend to cache frequent
disk accesses so that way they can speed
up the actual you know accessing of the
index itself
but in addition to that elasticsearch
builds some functionality on top of that
where for a query on a given shard
basically it'll go ahead and cache the
result of that so you're not just
caching the index itself but say any
logic that you're doing with the index
or you know anything that you're adding
on top of that in the event that you're
going to end up doing that query again
and then furthermore
in addition to just caching the result
of the query um elasticsearch will
actually kind of cache components of
each query so that you know if a
separate query were to come along later
for a given shard and try and reuse some
of that data
then you know this could go ahead and be
really useful and you know if there are
parts of the queries that kind of
intersect
we already have that in our cache so
it's kind of in this sense that
elasticsearch is able to achieve really
high read throughput
okay ultimately search indexes are a
super important part of many large
applications
and it means that they can basically
find strings of text in a manner that's
much faster than just a traditional scan
over a database and kind of you know
using substring logic to try and find
the actual string itself
using ss tables in conjunction with an
inverted index turned out to be a really
good strategy for building a fast search
index and through
you know technologies such as
elasticsearch and solar
people have been able to go ahead and
scale out lucine in a manner such that
it works in a distributed environment
which is really great
also i'm really happy that i made this
video because i personally have looked
like a bozo in the past when bringing up
search indexes and then not knowing
about exactly how they were implemented
and then i felt like a real so
hopefully you guys don't make the same
mistake and i will be seeing you guys in
the next one
Посмотреть больше похожих видео
How do SQL Indexes Work
Vector Databases simply explained! (Embeddings & Indexes)
Things every developer absolutely, positively needs to know about database indexing - Kai Sassnowski
How databases scale writes: The power of the log ✍️🗒️
Comment j'ai recodé Google en 7 jours.
Database Design Tips | Choosing the Best Database in a System Design Interview
5.0 / 5 (0 votes)