How databases scale writes: The power of the log ✍️🗒️

Gaurav Sen
10 Apr 202017:22

Summary

TLDRIn this informative video, G KCS explores the optimization of database write operations using log-structured merge trees (LSM trees). The script explains how traditional B+ trees in databases can be inefficient for scaling, and introduces the concept of LSM trees that prioritize fast writes using linked lists while also maintaining quick read times through sorted string tables. The video delves into the mechanics of LSM trees, including compaction and the use of bloom filters to speed up search queries, offering a comprehensive look at balancing performance in database management.

Takeaways

  • 🌐 The video discusses optimizing database operations by reducing unnecessary data exchanges and I/O calls to improve response times.
  • 📚 Traditional databases use B+ trees for efficient insertion and search operations, which are both O(log n) in complexity.
  • 🔧 To scale databases, the video suggests condensing multiple data operations into a single query to minimize acknowledgments and headers.
  • 🔑 The trade-off for condensing data is the need for additional memory on the server to temporarily store the queries before flushing them to the database.
  • ✂️ Write operations can be optimized using linked lists, which offer O(1) constant time complexity for appending data.
  • 🔍 However, linked lists are inefficient for read operations, which require a sequential search, leading to O(n) time complexity.
  • 📈 The Log-Structured Merge Tree (LSM Tree) is introduced as a hybrid data structure that combines the fast write benefits of a log with the fast read benefits of a sorted array.
  • 🔄 LSM Trees work by periodically merging sorted arrays in the background, reducing the number of required read operations and improving efficiency.
  • 📊 The video explains the concept of 'compaction' in LSM Trees, where smaller sorted arrays are merged into larger ones to optimize read performance.
  • 🛠️ Bloom filters are suggested as a method to speed up search queries by reducing the number of chunks that need to be checked, albeit with a chance of false positives.
  • 🔑 The script concludes with the importance of balancing fast writes with efficient reads in database optimization, highlighting the LSM Tree as a practical solution.

Q & A

  • What is the main topic of the video script?

    -The main topic of the video script is optimizing database operations by discussing the use of data structures like B+ trees, linked lists, and log-structured merge trees for efficient write and read operations.

  • Why are B+ trees preferred in traditional databases?

    -B+ trees are preferred in traditional databases because they provide good insertion and search times, both of which are of the order log n, making them efficient for handling SQL commands like insert and select.

  • What is the main idea behind condensing data queries into a single query?

    -The main idea behind condensing data queries into a single query is to reduce unnecessary data exchange, such as acknowledgments and headers, and to decrease I/O calls, which in turn can free up resources and reduce request-response times.

  • What is the time complexity of write operations in a linked list?

    -The time complexity of write operations in a linked list is O(1), which means it is constant time because appending data to the end of the list is a straightforward operation.

  • What is the disadvantage of using a log for database storage in terms of read operations?

    -The disadvantage of using a log for database storage is that read operations are slow, as they require a sequential search through the entire log, which is an O(n) operation where n is the number of records in the log.

  • What is a log-structured merge tree (LSM tree) and how does it help with database optimization?

    -A log-structured merge tree (LSM tree) is a data structure that combines the fast write capabilities of a linked list with the fast read capabilities of sorted arrays or trees. It optimizes databases by allowing fast writes through sequential log appends and fast reads through sorted string tables that are created and merged in the background.

  • What is the purpose of merging sorted arrays in the background of an LSM tree?

    -The purpose of merging sorted arrays in the background of an LSM tree is to improve read performance by reducing the number of chunks that need to be searched during a query, thus decreasing the overall time complexity for read operations.

  • What is compaction in the context of LSM trees?

    -Compaction in the context of LSM trees is the process of taking multiple sorted string tables and merging them into a single, larger sorted array to facilitate faster search queries.

  • What is a bloom filter and how can it be used to speed up read queries in a database?

    -A bloom filter is a probabilistic data structure that can be used to test whether an element is a member of a set. In the context of databases, it can speed up read queries by quickly determining the presence of a record in a chunk, reducing the need for full sequential searches and thus speeding up the overall read operation, albeit with a possibility of false positives.

  • How does the script suggest managing the trade-off between fast writes and fast reads in database optimization?

    -The script suggests using a hybrid approach where writes are fast due to the use of a linked list-like structure, while reads are optimized by merging sorted arrays in the background to form sorted string tables. Additionally, bloom filters can be used to further speed up read queries by reducing unnecessary searches.

Outlines

00:00

📚 Database Optimization with B+ Trees and Logs

In this paragraph, the speaker discusses the challenges of scaling databases and introduces the B+ tree data structure, which is crucial for efficient SQL command execution. The B+ tree is likened to a binary search tree but with multiple paths from each node, offering good insertion and search times. The speaker then explores the idea of reducing unnecessary data exchange and I/O calls to improve database performance. The concept of condensing data queries into a single operation is proposed to enhance efficiency, although this comes with the trade-off of requiring additional memory. The paragraph concludes with a teaser about the fastest data structure for write operations, hinting at the linked list's advantages.

05:03

🔄 Leveraging Linked Lists and Sorted Arrays for Optimal Performance

The speaker elaborates on the use of linked lists for fast write operations and the subsequent need for a data structure that supports quick read operations. The concept of a log-structured merge tree (LSM tree) is introduced as a hybrid approach, combining the benefits of fast writes from a linked list and efficient reads from sorted arrays. The mechanics of how data is appended to a log and then sorted before being persisted in the database are explained. The speaker also addresses the inefficiency of constantly sorting large datasets and introduces the idea of maintaining sorted chunks to optimize read operations. The paragraph ends with a discussion on the trade-offs between read and write speeds and the strategy for merging sorted arrays to improve performance.

10:05

🔍 Enhancing Database Read Efficiency with Merging and Bloom Filters

This paragraph delves into the process of merging sorted arrays to form larger chunks, which reduces the time complexity for search queries. The speaker explains how the database can manage chunks of varying sizes and the decision-making process for when to merge these chunks, drawing parallels to the merge sort algorithm. The concept of a bloom filter is introduced as a method to speed up search queries by reducing false positives, which in turn decreases read times. The speaker suggests that creating a bloom filter for each chunk can help avoid unnecessary reads and improve the overall efficiency of the database system.

15:06

🌐 Conclusion on LSM Trees and Future Optimization Techniques

In the final paragraph, the speaker wraps up the discussion on log-structured merge trees, summarizing the key points about using linked lists for fast writes and sorted arrays for efficient reads. The process of compaction, which merges multiple sorted string tables into a single array, is highlighted as a critical component of LSM trees. The speaker also hints at further optimization techniques, such as the use of bloom filters, to enhance search query performance. The paragraph concludes with an invitation for feedback and suggestions from the audience, emphasizing the ongoing nature of database optimization and the potential for further improvements.

Mindmap

Keywords

💡Database Optimization

Database optimization refers to the process of improving the performance and efficiency of a database system. In the context of the video, it is about finding ways to scale the database to handle a large volume of data and queries without sacrificing speed or reliability. The script discusses optimizing write operations and reducing unnecessary data exchanges like acknowledgments.

💡B+ Tree

A B+ tree is a type of data structure that allows for efficient insertion and search operations, which are both of order log n. The video explains that traditional databases use B+ trees because they provide multiple paths from the root to the leaves, making them ideal for managing large datasets with fast access times.

💡I/O Calls

I/O calls, or Input/Output calls, are operations that manage the transfer of data between the storage system and the central processing unit. The script mentions reducing I/O calls as a way to free up resources and decrease request-response times, which is crucial for database performance.

💡Condensing Data

Condensing data in the script refers to the process of combining multiple data queries into a single query to reduce the number of I/O operations. This approach aims to increase bandwidth efficiency and reduce the workload on the database by minimizing the data sent and the number of acknowledgments received.

💡Linked List

A linked list is a basic data structure consisting of nodes that are connected via pointers. The video discusses the use of linked lists for their efficiency in write operations, as appending data to the end of a list is a constant time operation, which is beneficial for databases that require fast writes.

💡Log Structured Merge Tree (LSM Tree)

The Log Structured Merge Tree is a data structure that combines the fast write capabilities of a log with the fast read capabilities of a sorted array. The video explains that LSM trees are used to optimize write operations in databases by first writing data to a log and then merging it into sorted arrays for efficient searching.

💡Compaction

Compaction in the context of LSM trees is the process of merging multiple sorted string tables into a single, larger sorted array. This is done to improve read performance by reducing the number of areas that need to be searched during a query, as explained in the script.

💡Sorted String Table

A sorted string table is a sorted array of records within an LSM tree. The script mentions that these tables are created in the background to facilitate fast read operations, as they allow for binary search rather than sequential search.

💡Bloom Filter

A bloom filter is a probabilistic data structure used to test whether an element is a member of a set. The video suggests using bloom filters to speed up search queries in databases by quickly determining the presence or absence of a record, thus reducing unnecessary read operations.

💡False Positives

In the context of bloom filters, false positives occur when the filter incorrectly indicates that an element is present in the set. The script explains that while bloom filters can speed up queries, they may sometimes return a positive result even when the record is not actually in the database, which is a trade-off for the increased search speed.

Highlights

Optimizing database writes by reducing unnecessary data exchange and I/O calls for faster response times.

Introduction to the B+ tree data structure, which offers good insertion and search times due to its multiple paths per node.

The concept of condensing data into a single block for efficient bandwidth use and reduced I/O operations.

Trade-off between using additional memory for condensed data queries and the benefits of fewer I/O operations.

The LinkedList data structure's advantage in write operations due to its constant time complexity.

The Log-Structured Merge Tree (LSM Tree) as a hybrid approach combining the benefits of fast writes and sorted arrays for efficient reads.

The disadvantages of using a log for read operations, which require sequential search and are slow compared to tree-based structures.

The importance of maintaining fast writes and reducing I/O operations while addressing the slow read issue.

The use of sorted arrays to improve search times from O(n) to O(log n) after converting a linked list.

Mechanics of persisting data in sorted order to the database to facilitate efficient querying.

The process of merging sorted chunks in the background to maintain a balance between write and read performance.

The concept of compaction in LSM Trees, which merges sorted string tables to optimize read queries.

The management of memory and chunks to prevent slowing down write operations while optimizing read times.

The use of bloom filters to speed up search queries by reducing the number of chunks that need to be searched.

The strategy of merging sorted arrays in the background to form larger sorted arrays for faster searches.

The decision-making process for when to merge sorted arrays in LSM Trees, similar to merge sort.

The impact of varying block sizes on read operations and the use of bloom filters to speed up queries across blocks.

The introduction to log-structured merge trees and their role in optimizing both write and read operations in databases.

Transcripts

play00:00

hey everyone this is G kcs today we'll

play00:02

be talking about a way in which we can

play00:03

optimize rights in our database so you

play00:06

have your client which sends information

play00:08

to your server the server has to write

play00:10

to the database because it needs to

play00:11

maintain some state how do you scale

play00:13

this you can think of your database as a

play00:16

data structure the speedup queries the

play00:19

traditional database uses a data

play00:20

structure called a B+ tree this is like

play00:23

a binary search tree without the binary

play00:24

inside it there's multiple paths that

play00:26

each node can follow and the B+ tree is

play00:29

preferred because it gives you good

play00:30

insertion and search times both of them

play00:32

are order log n so whenever you fire a

play00:34

SQL command of an insert or a select

play00:36

that maps to an insertion or a search

play00:38

operation in the B+ tree and the

play00:40

underlying data structure is manipulated

play00:41

each of these operations requires an

play00:43

acknowledgment from the database which

play00:45

says yes I have successfully executed

play00:47

the request the scale our database we

play00:50

want to reduce the unnecessary exchange

play00:51

of data which is acknowledgments and

play00:53

headers and also we want to reduce the

play00:55

i/o calls which will help us free up

play00:57

some resources and reduce the request

play00:59

response times if we can condense all of

play01:03

this data into one block and send it in

play01:06

one shot to the database and get one

play01:08

acknowledgment that's a good idea

play01:10

so that would be more efficient use of

play01:13

your bandwidth because you don't have

play01:14

any extra data that you are sending

play01:16

around and also you're going to be doing

play01:19

less work because there'll be one i/o

play01:21

called one I or response condense data

play01:23

queries into a single query that's our

play01:26

first idea the server has to give up

play01:28

some memory to condense this data query

play01:30

these bunch of data queries so that

play01:33

requires additional memory so this is

play01:35

one of the drawbacks and the reason we

play01:38

did this which is one of the advantages

play01:41

is that we need to use lesser i/o

play01:44

operations if you have a lot of write

play01:46

operations what is the fastest data

play01:49

structure which can help you write

play01:50

operations quickly

play01:53

[Music]

play01:57

the link list it's the most basic data

play02:00

structure if you think of it all you

play02:01

need to do is when whenever there's a

play02:03

write in a linked list you just go to

play02:05

the end and append something to it and

play02:08

that's exactly what we are trying to

play02:09

optimize your so the write operations in

play02:12

a linked list are order one it's

play02:13

constant time because all you need to do

play02:14

is you have this linked list over here

play02:16

and you go to the end and you just add a

play02:19

node and it goes to the next node we

play02:21

have a well-known data structure which

play02:23

follows the philosophy of a linked list

play02:25

which is called a log okay and you can

play02:30

get the idea of whether log structured

play02:31

merge tree is coming from we are going

play02:33

to be using a log to assist data because

play02:37

it's really fast when it comes to write

play02:39

operations what is the disadvantage of a

play02:42

log read operations are really slow in a

play02:45

log if you need to find something you

play02:47

just can't jump to one point you have to

play02:49

actually search through sequentially the

play02:50

entire log and if you can think about

play02:53

your database as a log that's a huge

play02:55

amount of data which you'll need to read

play02:56

for every query that a person will be

play02:57

sending so the advantage of this is that

play03:01

you have fast writes but the clear

play03:05

disadvantage is that you have slow reads

play03:08

so we have two clear advantages with us

play03:10

which we want to keep because fast

play03:12

writes and lesser i/o operations is

play03:14

something that sounds really good but

play03:15

the drawbacks are something that we want

play03:17

to cut down on so the additional memory

play03:20

you can't really help much because if

play03:22

you need to condense the queries and

play03:25

keep them somewhere they have to be

play03:26

somewhere so the additional memory is

play03:28

fine it's not something we can optimize

play03:30

on much all you can do is you can kind

play03:33

of constrain the maximum number of

play03:35

blocks that you'll be keeping in memory

play03:37

before you flush into the database so

play03:39

this thing can be managed what we really

play03:42

don't want to do is we don't want to

play03:44

slow down our read operations if you're

play03:45

thinking about Facebook where there's a

play03:47

lot of people who are posting things the

play03:48

only worse thing compared to a slow post

play03:51

is a slow reader operation where you

play03:54

want to know what your friends are doing

play03:55

and it takes you 30 seconds to actually

play03:57

get that news feed onto your onto your

play03:59

browser so slow leagues is something

play04:01

that we definitely want to avoid and

play04:02

what can we do about that well this is

play04:06

some good

play04:06

- the first thing is that slow reads on

play04:10

what you know your your reading not on

play04:12

the log you don't want to be reading on

play04:14

this linked list because it gives you by

play04:16

definition a sequential search which is

play04:18

order n so that's pretty bad

play04:21

can you improve on this no not as a

play04:24

linguist you can't but if you convert

play04:27

your data to a far more sensible data

play04:30

structure which is something like a tree

play04:33

or something like a sorted list a sorted

play04:36

array that makes your searching really

play04:38

fast yeah so if you have instead of the

play04:42

linked list if you use something like a

play04:45

sorted array then your search time

play04:50

becomes order log n but this seems

play04:54

counterintuitive because you already had

play04:55

a B plus tree which was giving you log

play04:58

in such time it was giving you log in

play04:59

insertion time so what should we take a

play05:03

B plus tree or a linked list a linked

play05:07

list is great at writing but terrible at

play05:10

reading but the reason we have moved on

play05:13

to it is because B trees give us or a

play05:16

write performance and that's why we have

play05:19

the MAGIX addition of using a linked

play05:21

list with a sorted array to get great

play05:24

write speeds and great read speeds so if

play05:29

we would have this data in a sorted

play05:30

array it would take us login time which

play05:32

is really fast this takes less time

play05:36

compared to order n of course the

play05:38

sequential search is really slow so can

play05:40

we convert our linked list to a sorted

play05:42

array we can we can we don't want to do

play05:46

it in the memory because that defeats

play05:47

the purpose the whole point of having a

play05:49

linked list was so that you do fast

play05:50

insertions if you want a sorted arrays

play05:52

insertions are really slow so you won't

play05:54

do it here you instead will do it

play05:56

somewhere over here in the database okay

play05:59

so whatever information you get from the

play06:01

client you sort it and then you persist

play06:04

it so that the read operations are super

play06:07

fast let's talk about the mechanics of

play06:09

how this is going to work firstly we

play06:11

mentioned that we need some sort of a

play06:12

log that's over here which is being

play06:15

appended to by the server so every time

play06:17

a new record comes in the latest record

play06:19

over here is J

play06:19

so point 31 you keep appending to this

play06:23

log after a certain point which is a

play06:25

threshold of the number of Records you

play06:27

can keep in memory you persist this to

play06:29

the database in one shot so what's

play06:31

happening here in the DB is that this

play06:33

data is going to be sorted before it is

play06:35

persisted or rather it's going to be

play06:36

persisted in a sorted fashion so you can

play06:39

see that the numbers here are jumbled up

play06:40

while here it's 12 17 19 23 31 47 it's a

play06:44

nice sorted way what you can do here on

play06:47

top of this is do a binary search which

play06:52

allows you to query this data in a

play06:55

efficient manner but what if you have

play06:58

some more data coming in so if after

play07:00

this append is complete you want to

play07:03

append new stuff to your database so you

play07:07

don't just take it and immediately push

play07:09

it to the database if we do the same

play07:10

thing as we did over here now that we

play07:15

have these records and you want to

play07:16

persist them the database all you need

play07:18

to do is you need to send this

play07:19

information to the DB when you want to

play07:22

persist this in the DB the DB can do two

play07:24

things one is it can take these six

play07:26

records and merge it with these six

play07:28

records so it'll create a sorted file of

play07:30

size twelve but that's not maybe the

play07:33

smartest thing to do because if you

play07:35

think of it this is going to happen all

play07:36

the time after 10,000 records in the DB

play07:38

you send in six more records what it

play07:41

needs to do is it needs to sort n

play07:43

thousand and six records for the

play07:46

additional six records that you sent

play07:47

here so the increment of six records is

play07:50

actually forcing the entire database to

play07:52

be sorted you know flushing is nothing

play07:54

compared to sorting huge bunch of

play07:55

numbers this is time complexity and log

play07:58

in so how do we optimize this one of the

play08:03

things you can do is instead of merging

play08:06

arrays all the time you can keep them in

play08:08

sorted chunks so your database is going

play08:10

to be represented by chunks this is a

play08:13

chunk of size six this chunk after being

play08:15

sorted will be of size six now whenever

play08:19

there's a read query in your database

play08:21

all you need to do is you need to search

play08:22

through this chunk do a binary search on

play08:24

this chunk if you don't find it then you

play08:26

do a binary search on this chunk and

play08:27

there's just two chunks that you have to

play08:29

go through this is better than slowing

play08:31

down your write operations tremendously

play08:32

so

play08:33

you're right operations are reasonably

play08:35

fast even now but your read operations

play08:37

are slightly slower so if you have n

play08:39

insertions the number of chunks you'll

play08:43

have is divided by six if you draw the

play08:45

graph for this you know it's after all a

play08:48

linear graph so this is again order n n

play08:52

by six is the graph literally so your

play08:55

read times are extremely slow even now

play08:57

imagine in something like Facebook you

play08:59

have billions of records in the database

play09:01

1 billion divided by 6 is still very

play09:04

very slow so how do you optimize this

play09:06

well you can use some sort of a hybrid

play09:08

approach which is taking some records

play09:12

and merging it with the other records as

play09:14

long as the short time is not too high

play09:17

so when you have six ecords in six

play09:20

Eckerd's over here you merge them

play09:22

together to get one sorted array and

play09:24

this uses the standard mode sort

play09:25

technique so I take the first two

play09:28

elements over here because they're

play09:30

sorted I know that they're the smallest

play09:31

in their respective areas so twelve and

play09:33

one of the smallest elements in these

play09:35

two respective areas which one is

play09:36

smaller one so I take one and Sam and I

play09:41

pump it out over here now my point is

play09:42

still at twelve over here while my

play09:44

pointer has changed to 26 over here 2l

play09:46

is smaller than 36 I get Tim pointer

play09:51

changes to this point John which is at

play09:54

position 17 so 17 is smaller than 36 I

play09:57

take John and this is the basic merging

play10:00

process when you're taking two sorted

play10:01

arrays and you want to merge them into a

play10:02

single sorted array what you need to do

play10:04

is you need to move with the two

play10:05

pointers and keep taking the smaller

play10:07

element so the next one will be iris

play10:09

which is 19 and then push the pointer

play10:12

forward so it had moved here these are

play10:15

now gone so there's 23 again it is

play10:17

bigger than 36 so gada will come and

play10:20

they'll be Jane and then finally after

play10:22

Jane when the pointers are 47 with Mac

play10:24

47 is greater than 36 so you will pull

play10:26

out Larry and then so on and so forth so

play10:28

you get this nice sorted array of size

play10:30

12 so now what's happening is that every

play10:34

time you are getting a chunk you are

play10:36

deciding on whether or not you should

play10:38

merge this chunk with the existing

play10:41

chunks in the background what we are

play10:43

going to do is we are going to take

play10:44

chunks and we are going to merge them

play10:45

together to make

play10:47

just sorted Aries so that when there is

play10:49

a search query this large sorted array

play10:52

can help us reduce the overall time

play10:54

complexity in the background what we are

play11:03

going to do is we are going to take

play11:04

chunks and we are going to merge them

play11:06

together to make larger sorted areas so

play11:09

that when there is a search query this

play11:11

large sorted array can help us reduce

play11:13

the overall time complexity to clarify

play11:15

things if you get three chunks in total

play11:18

so that means you have 18 records in

play11:19

total which are broken into three chunks

play11:20

then the time required to search in one

play11:23

chunk is log to the base 2 of 6 so that

play11:26

is approximately equal to 3 and if you

play11:28

have to search in all the chunks for a

play11:30

particular record let's say I'm

play11:31

searching for record number 8 I saw sure

play11:33

I don't find it I searched sure I don't

play11:35

find it answer and I find it in the end

play11:36

so the total number of operations I'll

play11:38

need to do here are log of 6 to the base

play11:41

2 3 times so that is in total around 9

play11:44

operations while instead if in the

play11:46

background I was able to merge two

play11:48

sorted arrays and change that into a

play11:50

sorted array of size 12 then the total

play11:53

number of operations I have to do in

play11:54

this for these two chunks is just 4

play11:57

operations and the number of operations

play11:58

I have to do here are 3 so 3 plus 4

play12:01

gives you 7 operations so there's a

play12:03

clear saving here but obviously you need

play12:05

to sort I mean you need to merge two

play12:07

sorted arrays and then persist it so

play12:09

that has some overhead let's see what is

play12:12

the best approach

play12:17

if you get another sorted chunk of size

play12:20

six what I'll do is I'll take these two

play12:22

chunks and merge them together and have

play12:24

two sorted chunks of size 12 can I do

play12:26

better yeah I can take those two sort

play12:29

chunks of size 12 and merge them into a

play12:30

single sort of chunk of size 24 so log

play12:34

of 24 to the base 2 is equal to 5

play12:40

approximately and instead if I had them

play12:44

all 4 if I had all four chunks unmerged

play12:46

then I would have 3 into 4 which is 12

play12:50

operations so you can see that there's a

play12:52

clear saving of 5 instead of 12 if you

play12:55

sort arrays together so when do we

play12:57

decide to sort arias it's very similar

play12:59

to how the merge sort works if you have

play13:02

two blocks of size 6 you convert them

play13:04

into a block of size 12 if you have two

play13:06

blocks of size 12 you convert them into

play13:08

a block of size 24 and so on all the way

play13:11

up till your maximum size which is going

play13:13

to be n number of insertions that's the

play13:14

maximum possible size you can have what

play13:17

happens with this strategy is that you

play13:18

have a large number of blocks of varying

play13:20

size for example this one is of size six

play13:22

this one is of size 12 and so on and so

play13:24

forth you'll have some blocks of 24 also

play13:25

which have been merged after some time

play13:27

so your read operation is spread across

play13:30

these blocks and to speed up your read

play13:33

operation you can try some clever hacks

play13:35

the idea you can apply here is something

play13:37

called a bloom filter there's a video I

play13:39

made on that you can check it out

play13:42

the basic idea of a bloom filter is

play13:44

quite simple if you have a book let's

play13:46

say with a lot of words in it so you

play13:48

want to find the word cat in that book

play13:51

sociaty now one of the things you can do

play13:56

is you can search the entire book for

play13:57

the word cat or if the book provides you

play14:00

a particular data structure which is

play14:01

called a bloom filter yeah I had 26

play14:05

positions in this bloom filter there's 1

play14:08

for a 1 for B 1 for C D and so on and so

play14:11

forth if there exists any word in the

play14:14

book starting with the letter C I mark

play14:16

this as true I mark this as 1 so the

play14:20

position of C will be marked as 1 so if

play14:23

the word cat exists in this book

play14:25

definitely this point is going to give

play14:28

marked as true

play14:29

think about some edge cases what if I

play14:31

have the word Karuna in this book will

play14:36

this letter be marked yes but what if I

play14:39

don't have the word cat and I just have

play14:41

the word karana in this book will this

play14:42

letter be mark yes so that's called a

play14:45

false positive

play14:46

so in bloom filters you can have false

play14:48

positives you cannot have false

play14:49

negatives if C if there is no word with

play14:51

C there's no way that this letters going

play14:54

to be marked that's the general idea of

play14:56

a bloom filter and you can think about

play14:57

you know your book being the database

play14:59

and your search query being the bloom

play15:01

filter query if you are looking for

play15:03

whether this record exists in this chunk

play15:05

all you need to do is you need to create

play15:07

a bloom filter on top of this chunk

play15:10

which tells you whether or Eckerd exists

play15:12

in this in this list or not so this

play15:16

speeds up your querying because you get

play15:18

false positives the rate of false

play15:19

positive can be managed by increasing

play15:21

the number of bits yeah instead of

play15:23

saying whether there's a letter with but

play15:25

there's a word with C I can say are

play15:27

there any words it's C a so I'm taking

play15:29

combinations now so Co is going to be

play15:32

marked but C is not going to be marked

play15:33

so this is going to be a proper negative

play15:36

and I can speed up queries that way I

play15:39

can use that concept to create bloom

play15:41

filters here and speed up my search

play15:44

queries also anytime I merge two areas

play15:47

I'm going to be requiring a separate

play15:48

bloom filter for this one a good idea is

play15:52

to create a bloom filter of larger size

play15:54

because there's more information here

play15:56

the chance of a false positive is higher

play15:58

therefore I need to increase the

play16:00

information in the bloom filter to

play16:02

reduce the chance of a false positive

play16:03

this way when you are reducing the false

play16:05

positives you are indirectly reducing

play16:06

the read times also you don't have any

play16:08

useless reading that you are doing

play16:10

because this is accurate there's a lot

play16:13

more to log structured merge trees but

play16:15

this is a reasonable introduction to the

play16:18

concept what you want to do is you want

play16:19

to speed up writes for that you use the

play16:21

data structure which is a linked list

play16:22

but you can't read well on linked lists

play16:25

so you in the background merge sorted

play16:28

arrays which can be binary searched to

play16:30

get fast read operations so your rights

play16:32

are fast because you're flushing rights

play16:35

in in a sequential fashion but your

play16:37

reads also fast because you are reading

play16:39

from sorted tables which have been made

play16:41

in the background that's the general

play16:42

idea for lsdm some of the important

play16:44

terms you can take from here are this

play16:46

sorted set of Records is called a sorted

play16:50

string table yeah the reason being that

play16:52

it is a sorted string table the other

play16:54

thing that we did off of taking these

play16:57

two arrays and merging them together is

play16:59

called compaction so compaction is

play17:02

taking lots of sorted string tables and

play17:04

using the merge process condensing them

play17:07

into a single array so that we can do

play17:09

fast search queries if you have any

play17:11

doubts or suggestions on this you can

play17:13

leave them in the comments below if you

play17:14

like the video then make sure to hit the

play17:16

like button and if you want

play17:17

notifications for further videos like

play17:18

this hit the subscribe button I'll see

play17:20

you next time

Rate This

5.0 / 5 (0 votes)

Связанные теги
Database OptimizationData StructuresB+ TreeLog StructuredMerge TreesWrite EfficiencyRead SpeedI/O OperationsMemory ManagementSearch QueriesBloom Filters
Вам нужно краткое изложение на английском?