How indexes work in Distributed Databases, their trade-offs, and challenges
Summary
TLDRThe transcript discusses the importance of indexing in database management, particularly in the context of distributed data stores. It explains how traditional indexing accelerates lookups by creating indexes on secondary attributes. The concept of sharding and partitioning databases is introduced to handle large volumes of data across multiple nodes. The script delves into practical examples, such as using an author ID as a partition key in a blogging platform like Medium, and how this affects querying. It further explores the use of global secondary indexes (GSI) to efficiently query data across shards based on secondary attributes like blog categories, contrasting this with local secondary indexes that are more efficient for queries that include the partition key. The summary also touches on the trade-offs between storing only primary key references versus the entire object in GSIs, the challenges of maintaining GSIs, and the limitations of local secondary indexes. The speaker encourages viewers to explore the fascinating domain of distributed databases and to prototype their understanding.
Takeaways
- 🔎 Indexes are used to speed up database lookups and are typically created on secondary attributes.
- 📚 When databases are sharded and partitioned, data is spread across multiple nodes to handle large volumes efficiently.
- 📈 Practical examples, like a blogs database, illustrate how indexing works in distributed data stores.
- 🔑 A partition key is essential for determining which node will handle specific data in a sharded database.
- 🗃️ Hash functions are used to map data to the appropriate shard based on the partition key.
- 🔍 Queries on the partitioning key are efficient because they can be directed to the correct shard without needing to search all nodes.
- 📚 For queries not involving the partition key, such as searching by category, the database must fan out requests to all shards, which is slower.
- 🌐 Global secondary indexes provide a solution by maintaining a separate index for secondary attributes, improving query efficiency.
- 📈 Global secondary indexes can either store references to primary keys or entire objects, offering a trade-off between space and performance.
- 🔄 Keeping global secondary indexes in sync with the main data can be challenging and expensive, especially with frequent updates.
- 📚 Local secondary indexes, on the other hand, are limited to a shard and are useful when queries always include the partition key, ensuring strong consistency.
Q & A
What is the primary purpose of creating an index in a database?
-The primary purpose of creating an index in a database is to speed up the lookup process, especially on secondary attributes, which can significantly improve the performance of queries.
How does sharding and partitioning a database help with handling large volumes of data?
-Sharding and partitioning a database involve splitting the data and distributing it across multiple data nodes. This helps to manage large volumes of data and distribute the load, ensuring that no single node is overwhelmed by the data it needs to handle.
What is a partition key and why is it important in a distributed data store?
-A partition key is a unique identifier used to determine which shard or data node will store a particular piece of data. It is crucial because it dictates the distribution of data across the database shards, ensuring efficient data retrieval and storage.
How does a hash function play a role in determining the shard for a specific piece of data?
-A hash function is used to process the partition key, such as an author ID, to determine which shard the data should be stored in. It does this by generating a hash value that corresponds to a specific shard, ensuring that data with the same partition key is stored together.
What is the main challenge when querying for data based on a non-partitioning key in a sharded database?
-The main challenge is that the database proxy must fan out the request to all nodes, execute the query on each node, and then merge the results before sending them back to the user. This process can be slow, inefficient, and can lead to incomplete results or timeouts if a shard is slow or unavailable.
What is a global secondary index and how does it help with querying on a secondary attribute?
-A global secondary index is a separate index that is partitioned by a secondary attribute, such as a category. It allows for efficient querying on that attribute without the need to fan out requests across all shards. This is particularly useful when the query does not involve the partition key.
How does storing the entire blog object in a global secondary index affect query performance?
-Storing the entire blog object in a global secondary index can improve query performance because it reduces the need for additional lookups in the data shards. However, it also increases the index size, which can lead to a tradeoff between space efficiency and query speed.
What is the difference between a global secondary index and a local secondary index?
-A global secondary index is a separate index that is partitioned by a secondary attribute and can be queried across all shards. A local secondary index, on the other hand, is specific to a shard and is used when the query includes the partition key. It allows for efficient querying without the need for a global index.
Why are global secondary indexes considered expensive to manage and maintain?
-Global secondary indexes are expensive to manage and maintain because they require synchronization with the main data. Any update to the main data must also be reflected in the index, which can be resource-intensive, especially if there are many indexes or a high volume of updates.
What is the typical limit on the number of global secondary indexes that can be created in a distributed database?
-Many distributed databases limit the number of global secondary indexes that can be created to manage the cost of maintenance and synchronization. The typical limit is between 5 to 7 global secondary indexes.
How does the concept of strong consistency affect the implementation of secondary indexes?
-Strong consistency requires that all updates to the data are immediately reflected in the indexes, ensuring that the data and indexes are always in sync. This can be challenging to achieve, especially with a large number of global secondary indexes, and is one of the reasons why these indexes can be expensive to maintain.
Outlines
🚀 Introduction to Indexing and Sharding in Databases
The first paragraph introduces the concept of indexes in databases, explaining how they speed up data retrieval, especially when databases are sharded and partitioned to manage large volumes of data across multiple nodes. It uses the example of a blogs database, illustrating how a sharding strategy using an author ID as a partition key can distribute blog data efficiently. The paragraph also discusses the process of querying data based on the partition key and the challenges that arise when querying on non-partition key attributes, such as blog categories, which requires querying across all shards.
🌟 Understanding Global Secondary Indexes
The second paragraph delves into global secondary indexes as a solution to the problem of querying on non-partition key attributes. It explains that a global secondary index is a separate index that is partitioned by the secondary attribute of interest, such as categories in the blogs database example. This allows for efficient querying of data across shards without the need to query each individual shard. The paragraph also discusses different implementations of global secondary indexes, such as storing only the primary key reference or the entire object, and the trade-offs between space and performance.
🔄 Trade-offs and Management of Global Secondary Indexes
The third paragraph discusses the trade-offs associated with using global secondary indexes (GSI), such as the increased storage requirements and the need to maintain index synchronization with the main data. It highlights the challenges of managing GSIs, including the potential performance impact of updates and the limitations that databases often impose on the number of GSIs that can be created. The paragraph also introduces local secondary indexes as an alternative for queries that include the partition key, providing a more efficient and consistent solution for such specific query patterns.
📚 Conclusion and Further Exploration
The final paragraph concludes the discussion on indexes in distributed databases, emphasizing the importance of understanding and implementing indexing strategies. It encourages further exploration of the topic, suggesting that readers prototype an indexing solution to gain a deeper understanding. The speaker also references a previous video for more information on how indexes are created using a B+ tree and invites viewers to watch it for additional insights.
Mindmap
Keywords
💡Indexes
💡Database Sharding
💡Partition Key
💡Hash Function
💡Global Secondary Indexes (GSIs)
💡Local Secondary Indexes
💡DynamoDB
💡Data Sharding
💡Strong Consistency
💡B+ Tree
Highlights
Indexes enhance database lookup speed by creating them on secondary attributes.
Sharding and partitioning a database is essential for handling large volumes of data.
A practical example of indexing in a distributed data store is managing a large blogs database.
Choosing a partition key, such as author ID, is crucial for data distribution across nodes.
Hash functions are used to determine the database node where data should reside.
Queries on the partitioning key are efficient due to data being stored on the same node.
Global secondary indexes are introduced for efficient querying on secondary attributes like categories.
Global secondary indexes are maintained separately and partitioned by the secondary attribute.
Local secondary indexes are used when queries always contain the partition key.
Local secondary indexes provide strong consistency and are limited to a single shard.
Global secondary indexes can store either references to primary keys or entire objects for faster retrieval.
There's a tradeoff between space usage and query performance when deciding how much data to store in a global secondary index.
Maintaining global secondary indexes can be expensive due to the need for synchronization with the main data.
Many databases limit the number of global secondary indexes that can be created to manage costs.
The choice between global and local secondary indexes depends on the query patterns and requirements.
Distributed databases often have a form of secondary indexes, either explicitly or implicitly implemented.
Prototyping and understanding the implementation of indexes in distributed databases is a fascinating domain.
The presenter encourages viewers to explore and prototype index implementations for a deeper understanding.
Transcripts
so indexes make your database look up
faster and we typically create index on
secondary attributes and Things become
really really interesting when your
database is sharded and partitioned
right for example when you have a large
volume of data and one node is not able
to handle the load what you do you Shard
the database you partition the data and
you place it across multiple data nodes
right now let's take a practical example
to understand how indexing work in case
of a distributor data store now say you
have a blogs database in which you are
holding a large number of blocks let's
say You're Building A medium like
application in which there are tons and
tons of blogs that are published now
what would happen given the large volume
of data one node will not be able to
handle the load which means we would
have to create multiple partitions of
data and place them across multiple
charts and for us to do that we would
need to pick a partition key on basis of
which we would be splitting the data so
let's say we pick a partition key as
author ID right so given a Blog object
and an author ID we would be determining
which of the three nodes is most capable
of handling it losing let's say a hash
function so we take the author ID pass
it through the hash function we know
which database would that key reside in
and we go to the database and place the
data right this is a classic way to
handle a hash Bas partitioning you can
go for range based consistent hashing
pick your favorite implementation there
but things become really interesting
when we are looking for something
specific let's say I want to get that
given a user ID get all the blogs from
it if I want to fire this query given a
user ID get me all the blogs of a
particular user your flow would be
really easy that given a user ID given a
user ID I would be figuring out which
data node would the or which data Shard
would the data would the data be present
for that because my partitioning key is
author ID given a user ID that is the
author ID I would pass it through the
hash
function and whatever spits out I would
go to that node fire query select start
from this table where author ID is equal
to this I would get all the blocks
listed for that user and send it back
and it's a pretty straightforward query
that would work like a charm this worked
really well because we were querying on
the partitioning key itself but now
let's take another example let's say
what we are looking for is we are not
looking for to get uh the blogs for a
particular user but let's say we are
looking for something much more let's
say every blog has a category let's say
category could be a topic that to which
the block belongs to let's say my SQL
engine X go python whatnot now what we
want to query is get all the blogs that
are that belong to a particular category
let's say my SQL category so let's say I
take concrete example and say I have two
data charts in which I have four blog
items distributed 2 cross two so let's
say I have two blogs listed over here so
user ID one wrote A Blog with ID 1
belonging to category my SQL with some
title and somebody so because U1 passed
through the hash function spits out one
I would put it to one similarly U1 wrote
another blog with id9 on go topic with
some title and body it also recites to
Shard one why because user ID because U1
U now let's say u3 wrote two blocks one
on engine x one on my SQL goes to Shard
2 because u3 pass through the hash
function spits out two I would put it
over here right now given this the ninth
solution for us to get all the blogs
tagged for a particular category let's
say my SQL now here we can clearly see
that the blog tagged with my SQL is not
present on one node it's present on both
the nodes Shard one and shart two now
what would happen when the request hits
the database proxy it would need to Fan
out the request to all the nodes on each
of the node fire that query get the
response merge the response and send it
back and send it back to the user right
now here for every query like this get
all the blogs tacked for a particular
category I would have to Fan out my
request across all the database shards
combine the results and send it back to
the user now this obviously even while
explaining it felt really slow it is
really slow and there are bunch of risks
involed risk number one what if one of
The Shard is overburden so you made the
request the request went to both the
shards in parallel but one of The Shard
is slow which means although one shart
responded quickly you have to wait for
the second chart to respond before you
can emit the response to the user so if
one Shard is slow it affects the user
experience worst what if one Shard is
dead either you wait until the timeout
happens or you send incomplete result
that is another risk third is when you
and given that you might be paginating
on this you are firing query on this
both data nodes or both the shards
getting the results there's a huge
amount of data transferred over here it
is eventually getting either filtered
out or paginated and what not before it
is sent to the user so this is also
expensive so given
this there has to be a better way there
has to be a better way to index the data
which is where we get introduced to the
concept of global secondary indexes so
what do we do given that our query is
for a particular category give me all
the blogs that belong to it what do we
need to do is we need to maintain a
separate index a global index for the
secondary attribute category somewhere
now this is what most database abstract
it for you like for example dynamodb
calls it Global secondary index and you
and it is like a secondary table that
you have but that's the whole idea so
what do you do you create a global
secondary index which holds your index
but it is partitioned by the secondary
attribute that you want to query on for
example categories that attribute in
this case so what do we do is we create
an index which is which can be shed
internally that's not a problem right
but it is partitioned by the category
attribute of it so all the post that
belongs to let's say MySQL and engine X
pass through the hash function splits
out the same value so all MySQL post
will come to Shard three and all go post
will go to Shard 4 for example now here
I've draw on this as separate chard but
that's not necessarily that these are
separate set of machines it could
coreside with your existing data not it
totally depends on the implementation
this for Simplicity I've drawn it at
separate cluster for that you don't you
might not need to do it because the
database is abstracting this things out
for you it can decide to co-locate the
data on the data nodes let just store it
as a separate b+3 on the dis or however
it wants to do it but the idea is
logical separation is very clear on
where your Global secondary index
resides and where your data resid it's a
logical separation not a physical
separation right okay now given that we
are having this data stored this way if
we are looking for that hey given a
category give me all the blogs that
belong to it I would directly fire query
to This Global secondary index because
this data is already partitioned by the
category that I'm looking for so if I
want to look for all the blocks that
belong to category my SQL I can just
find a query select star from blog
category GSI or Global secondary index
where category is equal to my SQL when
the request goes I could fire request to
this note because I know my SQL would be
present over here passing through the
hash I would know The Shard ID I would
go there fire the query get the blog ID
get the data from here and respond
back right so this is how simple it
becomes so what we did is from The Shard
from the data Shard that we had we
created a global secondary index on an
attribute on which we wanted to query
and on this index we are firing the
query that select star from block block
category GSI where category is equal to
my SQL now here again I wrote a SQL
query it depends on the database on what
it exposes I just made it read right now
this query if you look carefully because
the global secondary index is actually
partitioned by the category the query
needs to only go to one instance get the
blog IDs then go to data shart read the
actual object and send it back to the
user so you don't need to go and query
multiple charts for this so you
literally fire one query get the ID go
to another data or place where you get
the blog details and all combine the
result and send it back it just makes
your life really easy and your query
really efficient right now this is where
you have multiple
implementations first is either your
global secondary index can just store
the reference of the primary key that
you have so for example if I'm creating
a global secondary index on category I
can choose to store the category and the
row ID or the blog ID that you have
right or I can choose to store the
entire blog object being stored over
there so again when you have multiple
choices you evaluate both of them and
database can choose to implement it
either one like either one of these ways
so let's say if we just store primary
key we just store primary key in the
index when the request comes to DB proxy
DB proxy will go to First the index
shards get the block IDs then go to
corresponding data sharts for the
objects that you want get the block
details and send it back to the user so
no unnecessarily fetching of data from
the data sharts you are only fetching
the data that you require from the data
sharts and that's really nice right
that's really efficient second is if we
want if we choose to store all the
attributes in global secondary index for
a particular Row for example which means
the entire document is reindexed it's
repartitioned and indexed there so which
means here you don't not only have the
blog ID but the entire blog object in
that case request comes over here you go
over here get the data immediately send
it back to the user so no need to look
up to data shart because your entire
document is residing in the global
secondary index itself right both of
these options are available if you
choose Dynamo DB to like Dynamo DB
implement this you can tag that as a
configuration when you creating a global
secondary
Index right okay now here we see a
classic tradeoff that if we just store
primary key reference which means you
have to do one look up on index charts
and then depending on the documents that
you received for those corresponding
primary you go to the data shards read
read those corresponding documents and
send it back to the user right so you're
doing multiple lookups there but if you
store all the attributes in GSI which
means the end document in GSI you are
bloating up the index X size but then
you're you are reducing a lookup right
it's a classic Space versus try tradeoff
that you may want to go with over here
but another another challenge that comes
in is you need to keep the GSI in sync
when the main data is manipulated so
which means let's say any update that
has happened which updates a particular
document you have to update the index as
well and this needs to be done
synchronously because most databases do
offer strong consistency with indexes
now that becomes another problem that if
you have large number of GS nice then
your updates and your rights would take
a hit it would become really expensive
because now you have to update not just
in the main data chart but along with
your index charts that you have right
which is why Global secondary indexes
are expensive to manage and maintain
which is why a lot of databases actually
limit the maximum number of gsis you can
create they don't allow you to create
any number of gsis you want they
basically restrict the number of gsis
that you can create on that typically 5
to7 is a sweet spot there but it totally
you can create a database tomorrow that
allows more gsis than that and you can
do it eventually if you want to like
make it eventually consistent if you
want to right but this is about global
secondary indexes right Dynam DB is a
very has a very famous implementation
you pick any distributed database in the
world if it it would have a flavor of
gsis somewhere in its internal
implementation right because that's what
would make like by collocating the data
at one place you are making your queries
efficient it's a very standard practice
out there right now what is opposite of
global secondary index it's local
secondary index so what if we want to
query that give me all the blogs for a
particular category from a particular
author what if this is the only type of
query we would have we would never
hypothetically assume that we would
never be firing a query that give me all
the blogs for a particular category but
we would also like sorry we would always
be quering that for a particular
category for a particular user give me
all the blogs so in that case given that
our query actually contains the
partition key what we can create is we
can create a local secondary index
rather than a global secondary index so
here if your partition key is always
going to be part of your query in that
case you do not need to create a global
secondary index but you can create a
local local secondary index and this
local secondary index would be localized
to a particular Shard so given that in
Shard one you had all the documents of
user U1 which is all the blogs of user
U1 you can create a local index out of
it on a local B+ Tre and this index is
good enough for you to answer your query
that for a particular user for a
particular category give me all the
blogs right it would be answered from
the single node itself no need to Fan
out request and get the response back
right and this is an advantage that you
get so depending on your query pattern
depending on the query Clause that you
would be firing you need to decide if
you need a local secondary index for
this or a global secondary index for
this by default any and every
distributed database in the world would
have a flavor of this either explicitly
exposed to the to us the consumer of the
database or it would be implicitly
implemented by the
database right so depending on the
database you are picking go through the
documentation and figure it out but if
you look carefully the local secondary
index it being local what it does it is
easy to ensure strong consistency for
that because your rights are going to
the same instance where your index is
placed so you can have a very strong
consistent implementation over here the
response will always come from a single
not no need to do fan out right but you
are limited by the local chart so when
you have let's say you are having a
local secondary index on category you
would never be able to fire an efficient
query that given a category give me all
the blogs for that you would always be
firing given a category and the user
give me all the blogs so that is a
limitation of it but if your query is
always going to be with respect to a
partition key local secondary index
gives you a really good boost rather
than creating a global secondary index
for that right so this is what I wanted
to cover as part of indexes this is a
very fundamental concept of any
distribut database in the world either
they're explicitly exposing it or
implicitly managing it right either way
pick up every database go through
internal software it's a fascinating
domain and try to implement this one
it's a really easy piece to implement so
if you find time go ahead and prototype
this thing it's quite fun to be honest
right and if you're interested in going
deeper into how you create an index
index how index is created using a b+3 I
already have a video on it I link it in
the iard and in the description down so
feel free to check that out and yeah
these is all what I wanted to cover in
this one I hope you found it interesting
hope you found it amazing that's it for
this one I'll see you in the next one
thanks
[Music]
Weitere verwandte Videos ansehen
![](https://i.ytimg.com/vi/pl3sJ-RoD3Q/hq720.jpg)
Why do we need Kafka?
![](https://i.ytimg.com/vi/0iGR8GnIItQ/hq720.jpg)
DynamoDB: Under the hood, managing throughput, advanced design patterns | Jason Hunter | AWS Events
![](https://i.ytimg.com/vi/b4awhnG4R3M/hq720.jpg)
Shazam Audio Recognition Design Deep Dive with Google SWE! | Systems Design Interview Question 23
![](https://i.ytimg.com/vi/cY7pE7vX6MU/hqdefault.jpg?sqp=-oaymwExCJADEOABSFryq4qpAyMIARUAAIhCGAHwAQH4Af4EgALgA4oCDAgAEAEYciBSKEMwDw==&rs=AOn4CLDKkrfeqyoi-9Tq2O5zz99AA9mvOw)
Building A Python-Based Search Engine
![](https://i.ytimg.com/vi/j09EQ-xlh88/hq720.jpg)
Learn What is Database | Types of Database | DBMS
![](https://i.ytimg.com/vi/FIPCDRRBGz4/hq720.jpg)
Intro to Replication - Systems Design "Need to Knows" | Systems Design 0 to 1 with Ex-Google SWE
5.0 / 5 (0 votes)