How indexes work in Distributed Databases, their trade-offs, and challenges

Arpit Bhayani
1 Mar 202416:20

Summary

TLDRThe transcript discusses the importance of indexing in database management, particularly in the context of distributed data stores. It explains how traditional indexing accelerates lookups by creating indexes on secondary attributes. The concept of sharding and partitioning databases is introduced to handle large volumes of data across multiple nodes. The script delves into practical examples, such as using an author ID as a partition key in a blogging platform like Medium, and how this affects querying. It further explores the use of global secondary indexes (GSI) to efficiently query data across shards based on secondary attributes like blog categories, contrasting this with local secondary indexes that are more efficient for queries that include the partition key. The summary also touches on the trade-offs between storing only primary key references versus the entire object in GSIs, the challenges of maintaining GSIs, and the limitations of local secondary indexes. The speaker encourages viewers to explore the fascinating domain of distributed databases and to prototype their understanding.

Takeaways

  • 🔎 Indexes are used to speed up database lookups and are typically created on secondary attributes.
  • 📚 When databases are sharded and partitioned, data is spread across multiple nodes to handle large volumes efficiently.
  • 📈 Practical examples, like a blogs database, illustrate how indexing works in distributed data stores.
  • 🔑 A partition key is essential for determining which node will handle specific data in a sharded database.
  • 🗃️ Hash functions are used to map data to the appropriate shard based on the partition key.
  • 🔍 Queries on the partitioning key are efficient because they can be directed to the correct shard without needing to search all nodes.
  • 📚 For queries not involving the partition key, such as searching by category, the database must fan out requests to all shards, which is slower.
  • 🌐 Global secondary indexes provide a solution by maintaining a separate index for secondary attributes, improving query efficiency.
  • 📈 Global secondary indexes can either store references to primary keys or entire objects, offering a trade-off between space and performance.
  • 🔄 Keeping global secondary indexes in sync with the main data can be challenging and expensive, especially with frequent updates.
  • 📚 Local secondary indexes, on the other hand, are limited to a shard and are useful when queries always include the partition key, ensuring strong consistency.

Q & A

  • What is the primary purpose of creating an index in a database?

    -The primary purpose of creating an index in a database is to speed up the lookup process, especially on secondary attributes, which can significantly improve the performance of queries.

  • How does sharding and partitioning a database help with handling large volumes of data?

    -Sharding and partitioning a database involve splitting the data and distributing it across multiple data nodes. This helps to manage large volumes of data and distribute the load, ensuring that no single node is overwhelmed by the data it needs to handle.

  • What is a partition key and why is it important in a distributed data store?

    -A partition key is a unique identifier used to determine which shard or data node will store a particular piece of data. It is crucial because it dictates the distribution of data across the database shards, ensuring efficient data retrieval and storage.

  • How does a hash function play a role in determining the shard for a specific piece of data?

    -A hash function is used to process the partition key, such as an author ID, to determine which shard the data should be stored in. It does this by generating a hash value that corresponds to a specific shard, ensuring that data with the same partition key is stored together.

  • What is the main challenge when querying for data based on a non-partitioning key in a sharded database?

    -The main challenge is that the database proxy must fan out the request to all nodes, execute the query on each node, and then merge the results before sending them back to the user. This process can be slow, inefficient, and can lead to incomplete results or timeouts if a shard is slow or unavailable.

  • What is a global secondary index and how does it help with querying on a secondary attribute?

    -A global secondary index is a separate index that is partitioned by a secondary attribute, such as a category. It allows for efficient querying on that attribute without the need to fan out requests across all shards. This is particularly useful when the query does not involve the partition key.

  • How does storing the entire blog object in a global secondary index affect query performance?

    -Storing the entire blog object in a global secondary index can improve query performance because it reduces the need for additional lookups in the data shards. However, it also increases the index size, which can lead to a tradeoff between space efficiency and query speed.

  • What is the difference between a global secondary index and a local secondary index?

    -A global secondary index is a separate index that is partitioned by a secondary attribute and can be queried across all shards. A local secondary index, on the other hand, is specific to a shard and is used when the query includes the partition key. It allows for efficient querying without the need for a global index.

  • Why are global secondary indexes considered expensive to manage and maintain?

    -Global secondary indexes are expensive to manage and maintain because they require synchronization with the main data. Any update to the main data must also be reflected in the index, which can be resource-intensive, especially if there are many indexes or a high volume of updates.

  • What is the typical limit on the number of global secondary indexes that can be created in a distributed database?

    -Many distributed databases limit the number of global secondary indexes that can be created to manage the cost of maintenance and synchronization. The typical limit is between 5 to 7 global secondary indexes.

  • How does the concept of strong consistency affect the implementation of secondary indexes?

    -Strong consistency requires that all updates to the data are immediately reflected in the indexes, ensuring that the data and indexes are always in sync. This can be challenging to achieve, especially with a large number of global secondary indexes, and is one of the reasons why these indexes can be expensive to maintain.

Outlines

00:00

🚀 Introduction to Indexing and Sharding in Databases

The first paragraph introduces the concept of indexes in databases, explaining how they speed up data retrieval, especially when databases are sharded and partitioned to manage large volumes of data across multiple nodes. It uses the example of a blogs database, illustrating how a sharding strategy using an author ID as a partition key can distribute blog data efficiently. The paragraph also discusses the process of querying data based on the partition key and the challenges that arise when querying on non-partition key attributes, such as blog categories, which requires querying across all shards.

05:00

🌟 Understanding Global Secondary Indexes

The second paragraph delves into global secondary indexes as a solution to the problem of querying on non-partition key attributes. It explains that a global secondary index is a separate index that is partitioned by the secondary attribute of interest, such as categories in the blogs database example. This allows for efficient querying of data across shards without the need to query each individual shard. The paragraph also discusses different implementations of global secondary indexes, such as storing only the primary key reference or the entire object, and the trade-offs between space and performance.

10:00

🔄 Trade-offs and Management of Global Secondary Indexes

The third paragraph discusses the trade-offs associated with using global secondary indexes (GSI), such as the increased storage requirements and the need to maintain index synchronization with the main data. It highlights the challenges of managing GSIs, including the potential performance impact of updates and the limitations that databases often impose on the number of GSIs that can be created. The paragraph also introduces local secondary indexes as an alternative for queries that include the partition key, providing a more efficient and consistent solution for such specific query patterns.

15:02

📚 Conclusion and Further Exploration

The final paragraph concludes the discussion on indexes in distributed databases, emphasizing the importance of understanding and implementing indexing strategies. It encourages further exploration of the topic, suggesting that readers prototype an indexing solution to gain a deeper understanding. The speaker also references a previous video for more information on how indexes are created using a B+ tree and invites viewers to watch it for additional insights.

Mindmap

Keywords

💡Indexes

Indexes are a fundamental concept in database management that allows for faster data retrieval. They work by creating a data structure that improves the speed of data lookup operations. In the context of the video, indexes are essential for efficient querying in a distributed database environment. For example, when the database is sharded and partitioned, indexes on secondary attributes can dramatically speed up the process of locating specific data.

💡Database Sharding

Database sharding is the process of horizontally partitioning data across multiple servers. This is done to manage large volumes of data that one node cannot handle alone. In the video, sharding is used to distribute the load of a large number of blogs across multiple data nodes, ensuring that the system can scale and handle increased traffic.

💡Partition Key

A partition key is a value that determines how data is distributed across shards in a database. It is used to direct data to the appropriate shard based on the key's value. In the script, the author ID is chosen as the partition key to decide which node should handle a particular blog post, using a hash function to distribute the data evenly.

💡Hash Function

A hash function is an algorithm that takes input data and returns a fixed-size string of bytes, typically used for indexing data in a hash table. In the context of the video, a hash function is used to map the author ID to a specific shard, ensuring an even distribution of blog data across the database nodes.

💡Global Secondary Indexes (GSIs)

Global Secondary Indexes are a feature of some distributed databases that allow for efficient querying on non-primary key attributes. They are separate from the main table and are indexed by a secondary attribute, such as a category in the video's example. GSIs can improve query performance by reducing the need to query multiple shards, as they are designed to handle specific types of queries.

💡Local Secondary Indexes

Local Secondary Indexes are another type of index used in distributed databases, but unlike GSIs, they are specific to a particular shard and are used when the query includes the partition key. In the script, if queries are always going to be for a particular category and a specific user, a local secondary index on the category would be more appropriate and efficient.

💡DynamoDB

DynamoDB is a managed NoSQL database service provided by Amazon Web Services. It is known for its high scalability and performance and supports features like Global Secondary Indexes. In the video, DynamoDB is mentioned as an example of a database that abstracts the complexities of managing indexes and allows users to configure GSIs during index creation.

💡Data Sharding

Data sharding is the process of splitting a database into smaller, more manageable pieces called shards. Each shard can be stored on a separate server, which allows for better distribution of the workload and can improve performance. The video discusses how data sharding is used in conjunction with indexes to optimize query performance in a distributed database.

💡Strong Consistency

Strong consistency is a guarantee that any read operation will return the most recent write operation's result. In the context of the video, maintaining strong consistency with indexes, especially Global Secondary Indexes, is important to ensure that the data retrieved is up-to-date. However, this can be challenging and expensive to achieve, particularly when there are many GSIs.

💡B+ Tree

A B+ tree is a type of self-balancing tree data structure that maintains sorted data and allows searches, insertions, and deletions in logarithmic time. While not explicitly mentioned in the script, the reference to creating an index using a B+ tree suggests that this data structure is used in the implementation of indexes in databases, providing an efficient way to store and retrieve data.

Highlights

Indexes enhance database lookup speed by creating them on secondary attributes.

Sharding and partitioning a database is essential for handling large volumes of data.

A practical example of indexing in a distributed data store is managing a large blogs database.

Choosing a partition key, such as author ID, is crucial for data distribution across nodes.

Hash functions are used to determine the database node where data should reside.

Queries on the partitioning key are efficient due to data being stored on the same node.

Global secondary indexes are introduced for efficient querying on secondary attributes like categories.

Global secondary indexes are maintained separately and partitioned by the secondary attribute.

Local secondary indexes are used when queries always contain the partition key.

Local secondary indexes provide strong consistency and are limited to a single shard.

Global secondary indexes can store either references to primary keys or entire objects for faster retrieval.

There's a tradeoff between space usage and query performance when deciding how much data to store in a global secondary index.

Maintaining global secondary indexes can be expensive due to the need for synchronization with the main data.

Many databases limit the number of global secondary indexes that can be created to manage costs.

The choice between global and local secondary indexes depends on the query patterns and requirements.

Distributed databases often have a form of secondary indexes, either explicitly or implicitly implemented.

Prototyping and understanding the implementation of indexes in distributed databases is a fascinating domain.

The presenter encourages viewers to explore and prototype index implementations for a deeper understanding.

Transcripts

play00:00

so indexes make your database look up

play00:01

faster and we typically create index on

play00:03

secondary attributes and Things become

play00:05

really really interesting when your

play00:06

database is sharded and partitioned

play00:09

right for example when you have a large

play00:11

volume of data and one node is not able

play00:13

to handle the load what you do you Shard

play00:15

the database you partition the data and

play00:17

you place it across multiple data nodes

play00:19

right now let's take a practical example

play00:21

to understand how indexing work in case

play00:23

of a distributor data store now say you

play00:25

have a blogs database in which you are

play00:27

holding a large number of blocks let's

play00:29

say You're Building A medium like

play00:30

application in which there are tons and

play00:32

tons of blogs that are published now

play00:34

what would happen given the large volume

play00:35

of data one node will not be able to

play00:38

handle the load which means we would

play00:40

have to create multiple partitions of

play00:42

data and place them across multiple

play00:43

charts and for us to do that we would

play00:45

need to pick a partition key on basis of

play00:47

which we would be splitting the data so

play00:49

let's say we pick a partition key as

play00:51

author ID right so given a Blog object

play00:54

and an author ID we would be determining

play00:57

which of the three nodes is most capable

play01:00

of handling it losing let's say a hash

play01:02

function so we take the author ID pass

play01:04

it through the hash function we know

play01:06

which database would that key reside in

play01:09

and we go to the database and place the

play01:11

data right this is a classic way to

play01:14

handle a hash Bas partitioning you can

play01:16

go for range based consistent hashing

play01:18

pick your favorite implementation there

play01:20

but things become really interesting

play01:22

when we are looking for something

play01:24

specific let's say I want to get that

play01:28

given a user ID get all the blogs from

play01:31

it if I want to fire this query given a

play01:35

user ID get me all the blogs of a

play01:37

particular user your flow would be

play01:38

really easy that given a user ID given a

play01:43

user ID I would be figuring out which

play01:47

data node would the or which data Shard

play01:49

would the data would the data be present

play01:51

for that because my partitioning key is

play01:53

author ID given a user ID that is the

play01:56

author ID I would pass it through the

play01:57

hash

play01:58

function and whatever spits out I would

play02:00

go to that node fire query select start

play02:02

from this table where author ID is equal

play02:04

to this I would get all the blocks

play02:05

listed for that user and send it back

play02:08

and it's a pretty straightforward query

play02:10

that would work like a charm this worked

play02:13

really well because we were querying on

play02:15

the partitioning key itself but now

play02:17

let's take another example let's say

play02:20

what we are looking for is we are not

play02:22

looking for to get uh the blogs for a

play02:25

particular user but let's say we are

play02:27

looking for something much more let's

play02:29

say every blog has a category let's say

play02:31

category could be a topic that to which

play02:34

the block belongs to let's say my SQL

play02:36

engine X go python whatnot now what we

play02:38

want to query is get all the blogs that

play02:41

are that belong to a particular category

play02:44

let's say my SQL category so let's say I

play02:48

take concrete example and say I have two

play02:51

data charts in which I have four blog

play02:54

items distributed 2 cross two so let's

play02:56

say I have two blogs listed over here so

play02:58

user ID one wrote A Blog with ID 1

play03:01

belonging to category my SQL with some

play03:03

title and somebody so because U1 passed

play03:07

through the hash function spits out one

play03:08

I would put it to one similarly U1 wrote

play03:12

another blog with id9 on go topic with

play03:14

some title and body it also recites to

play03:17

Shard one why because user ID because U1

play03:20

U now let's say u3 wrote two blocks one

play03:23

on engine x one on my SQL goes to Shard

play03:26

2 because u3 pass through the hash

play03:27

function spits out two I would put it

play03:29

over here right now given this the ninth

play03:32

solution for us to get all the blogs

play03:34

tagged for a particular category let's

play03:36

say my SQL now here we can clearly see

play03:39

that the blog tagged with my SQL is not

play03:41

present on one node it's present on both

play03:44

the nodes Shard one and shart two now

play03:47

what would happen when the request hits

play03:49

the database proxy it would need to Fan

play03:52

out the request to all the nodes on each

play03:55

of the node fire that query get the

play03:57

response merge the response and send it

play03:59

back and send it back to the user right

play04:02

now here for every query like this get

play04:05

all the blogs tacked for a particular

play04:07

category I would have to Fan out my

play04:09

request across all the database shards

play04:11

combine the results and send it back to

play04:13

the user now this obviously even while

play04:16

explaining it felt really slow it is

play04:18

really slow and there are bunch of risks

play04:20

involed risk number one what if one of

play04:23

The Shard is overburden so you made the

play04:25

request the request went to both the

play04:27

shards in parallel but one of The Shard

play04:29

is slow which means although one shart

play04:31

responded quickly you have to wait for

play04:34

the second chart to respond before you

play04:35

can emit the response to the user so if

play04:37

one Shard is slow it affects the user

play04:40

experience worst what if one Shard is

play04:42

dead either you wait until the timeout

play04:45

happens or you send incomplete result

play04:47

that is another risk third is when you

play04:50

and given that you might be paginating

play04:53

on this you are firing query on this

play04:56

both data nodes or both the shards

play04:58

getting the results there's a huge

play05:00

amount of data transferred over here it

play05:02

is eventually getting either filtered

play05:03

out or paginated and what not before it

play05:05

is sent to the user so this is also

play05:07

expensive so given

play05:09

this there has to be a better way there

play05:12

has to be a better way to index the data

play05:14

which is where we get introduced to the

play05:16

concept of global secondary indexes so

play05:18

what do we do given that our query is

play05:23

for a particular category give me all

play05:25

the blogs that belong to it what do we

play05:28

need to do is we need to maintain a

play05:30

separate index a global index for the

play05:34

secondary attribute category somewhere

play05:36

now this is what most database abstract

play05:39

it for you like for example dynamodb

play05:41

calls it Global secondary index and you

play05:43

and it is like a secondary table that

play05:46

you have but that's the whole idea so

play05:49

what do you do you create a global

play05:51

secondary index which holds your index

play05:54

but it is partitioned by the secondary

play05:57

attribute that you want to query on for

play05:59

example categories that attribute in

play06:01

this case so what do we do is we create

play06:04

an index which is which can be shed

play06:08

internally that's not a problem right

play06:10

but it is partitioned by the category

play06:13

attribute of it so all the post that

play06:17

belongs to let's say MySQL and engine X

play06:19

pass through the hash function splits

play06:21

out the same value so all MySQL post

play06:23

will come to Shard three and all go post

play06:26

will go to Shard 4 for example now here

play06:29

I've draw on this as separate chard but

play06:31

that's not necessarily that these are

play06:33

separate set of machines it could

play06:34

coreside with your existing data not it

play06:36

totally depends on the implementation

play06:38

this for Simplicity I've drawn it at

play06:40

separate cluster for that you don't you

play06:43

might not need to do it because the

play06:44

database is abstracting this things out

play06:46

for you it can decide to co-locate the

play06:49

data on the data nodes let just store it

play06:52

as a separate b+3 on the dis or however

play06:54

it wants to do it but the idea is

play06:56

logical separation is very clear on

play06:59

where your Global secondary index

play07:01

resides and where your data resid it's a

play07:03

logical separation not a physical

play07:04

separation right okay now given that we

play07:09

are having this data stored this way if

play07:11

we are looking for that hey given a

play07:14

category give me all the blogs that

play07:15

belong to it I would directly fire query

play07:17

to This Global secondary index because

play07:20

this data is already partitioned by the

play07:23

category that I'm looking for so if I

play07:25

want to look for all the blocks that

play07:27

belong to category my SQL I can just

play07:29

find a query select star from blog

play07:31

category GSI or Global secondary index

play07:33

where category is equal to my SQL when

play07:36

the request goes I could fire request to

play07:38

this note because I know my SQL would be

play07:40

present over here passing through the

play07:41

hash I would know The Shard ID I would

play07:44

go there fire the query get the blog ID

play07:46

get the data from here and respond

play07:49

back right so this is how simple it

play07:52

becomes so what we did is from The Shard

play07:56

from the data Shard that we had we

play07:57

created a global secondary index on an

play08:00

attribute on which we wanted to query

play08:03

and on this index we are firing the

play08:05

query that select star from block block

play08:07

category GSI where category is equal to

play08:09

my SQL now here again I wrote a SQL

play08:12

query it depends on the database on what

play08:14

it exposes I just made it read right now

play08:18

this query if you look carefully because

play08:20

the global secondary index is actually

play08:22

partitioned by the category the query

play08:25

needs to only go to one instance get the

play08:28

blog IDs then go to data shart read the

play08:31

actual object and send it back to the

play08:32

user so you don't need to go and query

play08:35

multiple charts for this so you

play08:37

literally fire one query get the ID go

play08:39

to another data or place where you get

play08:42

the blog details and all combine the

play08:43

result and send it back it just makes

play08:46

your life really easy and your query

play08:49

really efficient right now this is where

play08:53

you have multiple

play08:55

implementations first is either your

play08:59

global secondary index can just store

play09:01

the reference of the primary key that

play09:04

you have so for example if I'm creating

play09:06

a global secondary index on category I

play09:08

can choose to store the category and the

play09:12

row ID or the blog ID that you have

play09:15

right or I can choose to store the

play09:17

entire blog object being stored over

play09:19

there so again when you have multiple

play09:21

choices you evaluate both of them and

play09:22

database can choose to implement it

play09:24

either one like either one of these ways

play09:27

so let's say if we just store primary

play09:28

key we just store primary key in the

play09:30

index when the request comes to DB proxy

play09:33

DB proxy will go to First the index

play09:36

shards get the block IDs then go to

play09:38

corresponding data sharts for the

play09:40

objects that you want get the block

play09:42

details and send it back to the user so

play09:44

no unnecessarily fetching of data from

play09:46

the data sharts you are only fetching

play09:48

the data that you require from the data

play09:49

sharts and that's really nice right

play09:52

that's really efficient second is if we

play09:56

want if we choose to store all the

play09:58

attributes in global secondary index for

play10:00

a particular Row for example which means

play10:01

the entire document is reindexed it's

play10:04

repartitioned and indexed there so which

play10:06

means here you don't not only have the

play10:09

blog ID but the entire blog object in

play10:13

that case request comes over here you go

play10:14

over here get the data immediately send

play10:16

it back to the user so no need to look

play10:18

up to data shart because your entire

play10:21

document is residing in the global

play10:22

secondary index itself right both of

play10:25

these options are available if you

play10:27

choose Dynamo DB to like Dynamo DB

play10:29

implement this you can tag that as a

play10:31

configuration when you creating a global

play10:32

secondary

play10:34

Index right okay now here we see a

play10:36

classic tradeoff that if we just store

play10:38

primary key reference which means you

play10:40

have to do one look up on index charts

play10:42

and then depending on the documents that

play10:44

you received for those corresponding

play10:46

primary you go to the data shards read

play10:48

read those corresponding documents and

play10:49

send it back to the user right so you're

play10:51

doing multiple lookups there but if you

play10:53

store all the attributes in GSI which

play10:55

means the end document in GSI you are

play10:58

bloating up the index X size but then

play11:00

you're you are reducing a lookup right

play11:02

it's a classic Space versus try tradeoff

play11:05

that you may want to go with over here

play11:07

but another another challenge that comes

play11:09

in is you need to keep the GSI in sync

play11:12

when the main data is manipulated so

play11:14

which means let's say any update that

play11:15

has happened which updates a particular

play11:17

document you have to update the index as

play11:19

well and this needs to be done

play11:21

synchronously because most databases do

play11:23

offer strong consistency with indexes

play11:26

now that becomes another problem that if

play11:27

you have large number of GS nice then

play11:30

your updates and your rights would take

play11:32

a hit it would become really expensive

play11:34

because now you have to update not just

play11:35

in the main data chart but along with

play11:37

your index charts that you have right

play11:40

which is why Global secondary indexes

play11:42

are expensive to manage and maintain

play11:44

which is why a lot of databases actually

play11:48

limit the maximum number of gsis you can

play11:50

create they don't allow you to create

play11:52

any number of gsis you want they

play11:54

basically restrict the number of gsis

play11:57

that you can create on that typically 5

play12:00

to7 is a sweet spot there but it totally

play12:02

you can create a database tomorrow that

play12:03

allows more gsis than that and you can

play12:05

do it eventually if you want to like

play12:07

make it eventually consistent if you

play12:09

want to right but this is about global

play12:12

secondary indexes right Dynam DB is a

play12:15

very has a very famous implementation

play12:16

you pick any distributed database in the

play12:18

world if it it would have a flavor of

play12:21

gsis somewhere in its internal

play12:23

implementation right because that's what

play12:25

would make like by collocating the data

play12:28

at one place you are making your queries

play12:30

efficient it's a very standard practice

play12:31

out there right now what is opposite of

play12:34

global secondary index it's local

play12:36

secondary index so what if we want to

play12:39

query that give me all the blogs for a

play12:42

particular category from a particular

play12:45

author what if this is the only type of

play12:48

query we would have we would never

play12:50

hypothetically assume that we would

play12:52

never be firing a query that give me all

play12:54

the blogs for a particular category but

play12:56

we would also like sorry we would always

play12:59

be quering that for a particular

play13:01

category for a particular user give me

play13:04

all the blogs so in that case given that

play13:07

our query actually contains the

play13:09

partition key what we can create is we

play13:13

can create a local secondary index

play13:15

rather than a global secondary index so

play13:18

here if your partition key is always

play13:21

going to be part of your query in that

play13:24

case you do not need to create a global

play13:27

secondary index but you can create a

play13:28

local local secondary index and this

play13:30

local secondary index would be localized

play13:33

to a particular Shard so given that in

play13:36

Shard one you had all the documents of

play13:39

user U1 which is all the blogs of user

play13:41

U1 you can create a local index out of

play13:44

it on a local B+ Tre and this index is

play13:47

good enough for you to answer your query

play13:49

that for a particular user for a

play13:52

particular category give me all the

play13:53

blogs right it would be answered from

play13:56

the single node itself no need to Fan

play13:58

out request and get the response back

play14:01

right and this is an advantage that you

play14:03

get so depending on your query pattern

play14:07

depending on the query Clause that you

play14:08

would be firing you need to decide if

play14:10

you need a local secondary index for

play14:12

this or a global secondary index for

play14:13

this by default any and every

play14:15

distributed database in the world would

play14:17

have a flavor of this either explicitly

play14:19

exposed to the to us the consumer of the

play14:22

database or it would be implicitly

play14:24

implemented by the

play14:26

database right so depending on the

play14:29

database you are picking go through the

play14:30

documentation and figure it out but if

play14:32

you look carefully the local secondary

play14:35

index it being local what it does it is

play14:39

easy to ensure strong consistency for

play14:41

that because your rights are going to

play14:43

the same instance where your index is

play14:44

placed so you can have a very strong

play14:46

consistent implementation over here the

play14:48

response will always come from a single

play14:50

not no need to do fan out right but you

play14:53

are limited by the local chart so when

play14:55

you have let's say you are having a

play14:57

local secondary index on category you

play14:59

would never be able to fire an efficient

play15:01

query that given a category give me all

play15:03

the blogs for that you would always be

play15:05

firing given a category and the user

play15:08

give me all the blogs so that is a

play15:11

limitation of it but if your query is

play15:13

always going to be with respect to a

play15:16

partition key local secondary index

play15:17

gives you a really good boost rather

play15:19

than creating a global secondary index

play15:21

for that right so this is what I wanted

play15:24

to cover as part of indexes this is a

play15:26

very fundamental concept of any

play15:28

distribut database in the world either

play15:31

they're explicitly exposing it or

play15:33

implicitly managing it right either way

play15:35

pick up every database go through

play15:37

internal software it's a fascinating

play15:39

domain and try to implement this one

play15:40

it's a really easy piece to implement so

play15:42

if you find time go ahead and prototype

play15:44

this thing it's quite fun to be honest

play15:47

right and if you're interested in going

play15:48

deeper into how you create an index

play15:52

index how index is created using a b+3 I

play15:55

already have a video on it I link it in

play15:57

the iard and in the description down so

play15:59

feel free to check that out and yeah

play16:01

these is all what I wanted to cover in

play16:02

this one I hope you found it interesting

play16:04

hope you found it amazing that's it for

play16:05

this one I'll see you in the next one

play16:07

thanks

play16:10

[Music]

Rate This

5.0 / 5 (0 votes)

Related Tags
Database ShardingIndexingDistributed SystemsData OptimizationGlobal Secondary IndexData PartitioningHash FunctionsData ConsistencyQuery EfficiencyDynamoDBData Management