What is DATABASE SHARDING?

Gaurav Sen
7 Aug 201808:55

Summary

TLDRThe script explores database optimization techniques, emphasizing the limitations of traditional indexing and introducing sharding as a solution for handling large datasets. It explains sharding with the pizza analogy, dividing data across multiple servers based on a key attribute, and discusses horizontal partitioning. The script also touches on challenges like cross-shard joins and inflexible shard sizes, proposing solutions like consistent hashing and hierarchical sharding. It highlights the importance of consistency and availability in databases and suggests using indexing on shards to improve performance. Finally, it mentions master-slave architecture for fault tolerance.

Takeaways

  • πŸ” Query optimization is essential for handling large databases, and traditional methods like indexing may not suffice for massive datasets.
  • πŸ“š The concept of sharding is introduced as a solution for distributing large datasets across multiple servers, similar to sharing pizza slices among friends.
  • πŸ“ˆ Sharding involves horizontal partitioning of data using a key attribute, which helps in managing and scaling databases effectively.
  • πŸ”‘ Horizontal partitioning is differentiated from vertical partitioning, which uses columns to partition data.
  • πŸ›‘οΈ Consistency and availability are key attributes of a database; consistency ensures that data is synchronized across the system, while availability ensures the system is always running.
  • πŸ”‘ The choice of sharding key is crucial and can be based on attributes like user ID or location, depending on the application's needs.
  • 🚫 Sharding presents challenges, such as the complexity of performing joins across different shards, which can be computationally expensive.
  • πŸ”„ The inflexibility of shard sizes is a limitation, but techniques like consistent hashing and hierarchical sharding can help address this issue.
  • πŸ“ˆ Hierarchical sharding allows for dynamic breaking down of large shards into smaller ones, increasing flexibility in database management.
  • πŸ” Creating indexes on shards can improve query performance, especially when queries involve attributes different from the sharding key.
  • πŸ›‘οΈ Master-slave architecture can provide fault tolerance in sharding setups by allowing multiple slaves to replicate data from a master server.
  • πŸš€ While sharding offers benefits in read and write performance, it is a complex solution and should be considered carefully, with simpler solutions like indexing or NoSQL databases explored first.

Q & A

  • What is the main topic discussed in the video script?

    -The main topic discussed in the video script is database optimization, specifically focusing on the concept of sharding as a method to handle large volumes of data.

  • What is an SQ optimizer mentioned in the script?

    -An SQ optimizer is a tool or technique used to optimize SQL queries, though the script suggests that it might be considered 'old school' for handling very large datasets.

  • Why might indexing be insufficient for a database with a lot of data?

    -Indexing might be insufficient for a database with a lot of data because it can help speed up query times but does not address the fundamental issue of data distribution and scalability that comes with large datasets.

  • What is the analogy used in the script to explain sharding?

    -The analogy used in the script to explain sharding is a pizza that is too large for one person to eat alone, so it is broken into slices and shared among friends.

  • What is horizontal partitioning and how is it related to sharding?

    -Horizontal partitioning is the process of dividing a database into parts, each part stored on a different server. Sharding is a specific implementation of horizontal partitioning where data is distributed across multiple servers based on a key attribute.

  • What are the key attributes of a database mentioned in the script?

    -The key attributes of a database mentioned in the script are consistency, which ensures that data is accurately stored and retrieved, and availability, which refers to the database's uptime and reliability.

  • Why is consistency considered more important than availability in the context of databases?

    -Consistency is considered more important than availability because it ensures the accuracy and reliability of the data. While uptime is important, having accurate data is crucial for the integrity of the database.

  • What is the problem with joins across shards as mentioned in the script?

    -The problem with joins across shards is that they can be inefficient because the query must access data from multiple shards, which may involve network communication and can significantly slow down the process.

  • What is the issue with the inflexibility of shards in sharding?

    -The issue with the inflexibility of shards is that once the data is partitioned, it is difficult to change the number of shards without significant reorganization of the data.

  • What is consistent hashing and how does it address the inflexibility of shards?

    -Consistent hashing is an algorithm that allows for a dynamic number of shards by distributing data across a virtual ring of hashes. It helps address the inflexibility of shards by allowing the system to add or remove shards without significant disruption.

  • What is hierarchical sharding and how does it help with the inflexibility problem?

    -Hierarchical sharding is a technique where a shard, which has too much data, is further divided into smaller pieces or 'mini-shards'. This approach helps overcome the inflexibility problem by allowing the system to dynamically adjust the number of data pieces as needed.

  • What is the role of a master-slave architecture in the context of sharding?

    -In the context of sharding, a master-slave architecture is used to provide redundancy and fault tolerance. The master holds the most current data and handles write requests, while the slaves replicate the master's data and handle read requests.

  • Why is indexing important on shards?

    -Indexing is important on shards because it can significantly improve the performance of queries that are not based on the sharding key. By indexing other attributes, the database can quickly locate and retrieve data within a shard.

  • What challenges does sharding present when it comes to practical application?

    -Sharding presents challenges in practical application due to the difficulty in maintaining consistency across distributed data. It requires careful planning and consideration of data access patterns and query performance.

Outlines

00:00

πŸ• Sharding for Database Scalability

The first paragraph discusses database optimization and introduces the concept of sharding as a solution for handling large datasets. It compares sharding to sharing a pizza with friends, where the pizza represents the database and the slices represent partitions of data distributed across different servers. The explanation covers horizontal partitioning using a key attribute, contrasting it with vertical partitioning. It also touches on the importance of database consistency and availability, and how sharding can be used to improve performance by reducing the dataset size for each query. The paragraph concludes with potential issues related to sharding, such as the complexity of joins across shards and the inflexibility of shard sizes.

05:05

πŸ”„ Overcoming Sharding Challenges with Consistent Hashing and Hierarchical Sharding

The second paragraph delves into strategies for addressing the challenges associated with sharding. It introduces consistent hashing as an algorithm to manage data distribution across a dynamic number of shards, mentioning memcached as a database that uses this technique. The paragraph explains how hierarchical sharding can be used to break down large shards into smaller, more manageable pieces, thus overcoming inflexibility. It also discusses the benefits of creating indexes on shards for improved query performance and touches on the use of master-slave architecture to ensure high availability and fault tolerance in the event of a shard failure. The summary concludes with advice for those considering sharding, suggesting the use of ready-made solutions like indexing and NoSQL databases before implementing sharding.

Mindmap

Keywords

πŸ’‘Database

A database is a structured collection of data, often stored electronically in a computer system. In the context of the video, the database is the subject of optimization and scaling discussions, highlighting the need for efficient data management as the volume of data grows.

πŸ’‘Indexing

Indexing in databases refers to the creation of data structures that improve the speed of data retrieval operations. The script mentions indexing as a method to optimize queries, suggesting that it's a fundamental technique for enhancing database performance.

πŸ’‘NoSQL

NoSQL stands for 'Not Only SQL' and refers to a category of database management systems that provide a mechanism for storage and retrieval of data which is modeled in means other than the tabular relations used in relational databases. The script briefly mentions NoSQL as an alternative to traditional SQL databases for handling large volumes of data.

πŸ’‘Sharding

Sharding is a method of database partitioning that separates very large databases into smaller, faster, more easily managed parts called data shards. In the video, sharding is used as an analogy with a pizza being split into slices to explain how data can be distributed across multiple servers.

πŸ’‘Horizontal Partitioning

Horizontal partitioning, also known as sharding, is the process of splitting a database into parts that are distributed across multiple servers. The script uses this term to describe how data can be divided based on a key attribute, allowing for scalability and improved performance.

πŸ’‘Vertical Partitioning

Vertical partitioning is the process of dividing larger tables into smaller ones based on the columns, rather than the rows. The script contrasts this with horizontal partitioning, noting that vertical partitioning is less about scalability and more about managing table size.

πŸ’‘Consistency

Consistency in databases refers to the property of ensuring that all transactions are processed reliably and that the database accurately reflects all transactions. The script emphasizes the importance of consistency, stating that it is a key attribute of any database system.

πŸ’‘Availability

Availability in the context of databases is the measure of the time a system is operational and accessible to users. The script discusses the trade-off between consistency and availability, noting that while both are important, consistency often takes precedence.

πŸ’‘Joins Across Shards

Joins across shards refer to the process of combining data from different shards, which can be computationally expensive. The script identifies this as a potential problem with sharding, as it requires data to be pulled from multiple sources and then merged.

πŸ’‘Consistent Hashing

Consistent hashing is an algorithm that provides a way to distribute data across a set of servers in a way that minimizes reorganization when servers are added or removed. The script suggests consistent hashing as a solution to the inflexibility of shard sizes, allowing for a dynamic number of shards.

πŸ’‘Hierarchical Sharding

Hierarchical sharding is a technique where a shard is further divided into smaller pieces or 'mini-shards'. The script describes this as a method to overcome the inflexibility of traditional sharding by allowing for dynamic breaking down of large shards into smaller, more manageable pieces.

πŸ’‘Master-Slave Architecture

Master-slave architecture is a type of data architecture where one server (the master) is the primary source of data, and other servers (the slaves) replicate the master's data. The script mentions this architecture as a way to provide fault tolerance in the event of a shard failure, with slaves taking over if the master goes down.

Highlights

Query optimization using an SQL optimizer is discussed as an old school approach.

Indexing is introduced as a basic method for database performance improvement.

The concept of NoSQL databases is introduced but dismissed for the given scenario.

Sharding is introduced as a solution for handling large datasets by partitioning data.

A pizza analogy is used to explain the concept of sharding.

Horizontal partitioning is described as a method of data distribution across servers.

Vertical partitioning is mentioned in contrast to horizontal partitioning.

Sharding is distinguished from traditional servers by focusing on database servers.

Consistency and availability are highlighted as key attributes of a database.

Sharding can be based on attributes like user ID or location in applications.

The problem of joins across shards is discussed as a challenge in sharding.

The inflexibility of shard sizes is identified as an issue.

Consistent hashing is suggested as a solution for dynamic scalability in sharding.

Hierarchical sharding is introduced to address the inflexibility of shard sizes.

Indexing on shards is recommended for improving query performance.

Master-slave architecture is presented as a solution for handling shard failures.

The practical application of sharding is acknowledged as challenging due to consistency issues.

The importance of using ready-made solutions like indexing or NoSQL databases is suggested before sharding.

Transcripts

play00:00

So.

play00:06

So how do you query this database?

play00:14

So I would optimize the queries using an SQ optimizer. Well,

play00:17

let's say we have a lot of data. So optimizing queries is so, you know,

play00:22

old school.

play00:25

Just so.

play00:28

I could make an index on the table. Alright. Indexing is, indexing is cute,

play00:33

but we are looking for something which is serious. Right, we got lots of data.

play00:39

So can we use a NoSQL? No, we are not gonna learn audio m s.

play00:42

Now for the final time, what do you think we should do? So shorting,

play00:45

I'll use shorting. Hmm.

play00:50

Okay.

play00:53

Hired. What is sharding?

play00:56

Let's say you have pizza and you can't have the entire thing by yourself.

play01:00

So you break it into slices and call your friends.

play01:03

Over eight friends.

play01:06

Now each of these friends is going to get one slice of pizza.

play01:09

What you have done effectively is partitioned the pizza according to each

play01:14

friend's share.

play01:16

Just like that we can have servers which are going to be taking the load of the

play01:20

requests, which are, which are being sent into it. So if there's a server here,

play01:25

now how do you get that to the pizza model?

play01:28

User id number zero is going to start here. 100 starts here, 200.

play01:34

What you have effectively done is taken all the server requests that you had and

play01:38

mapped them onto a pizza such that each of these slices is going to be

play01:43

served by one server, in this case, server id.

play01:48

Number six.

play01:49

The key thing to notice here is we couldn't eat the entire pizza by ourselves.

play01:54

We needed friends to finish the pizza. To handle the pizza effectively.

play01:59

And when you're getting friends along,

play02:00

you're effectively taking the range of the pizza and breaking into pieces.

play02:06

When you're doing that, you are partitioning the pizza.

play02:10

This kind of partitioning,

play02:11

which uses some sort of a key to break the data into pieces

play02:16

and allocate that to different servers, is called horizontal partitioning.

play02:22

Horizontal partitioning depends on one key,

play02:25

which is an attribute of the data that you're storing to partition data.

play02:28

You can contrast this with vertical partitioning.

play02:30

There's a link in the description below,

play02:32

which uses columns to partition data effectively.

play02:35

But we are focusing on the horizontal partitioning bit and specifically we are

play02:40

focusing on one concept, which is sharding. Now,

play02:44

we mentioned that sharding is taking one attribute in the data and partitioning

play02:48

the data such that each server gets one chunk. But what I mean by servers,

play02:53

the servers here are database servers.

play02:56

Can contrast this with what we have been talking till now about normal servers.

play03:00

Normal servers are application servers.

play03:03

They're platform servers which deal with data,

play03:05

but they try to be as stateless as possible to keep things decoupled and

play03:10

really nice and clean. This is going to be dealing with the meat of the data,

play03:15

alright? And we can't afford to have any goof ups over here.

play03:19

Consistency is important.

play03:21

This is one of the key attributes of any database that whatever data

play03:26

you persist in it is what you can read out of it later on.

play03:29

And there is some sort of synchronization that if a person makes an update,

play03:33

the new request is going to read that update. Okay?

play03:37

So that is consistency. Also what we look at is availability,

play03:42

meaning that the database should not crash and stay down.

play03:46

You don't want that, you want your application to be running all the time,

play03:49

but consistency trumps availability. When it comes to data,

play03:52

in most cases there are more things to think about.

play03:56

What should you shard your data on? In our case, we have used user id,

play04:01

but in applications like Lindo, which use location,

play04:04

you could shard on the location. And then if a person says,

play04:08

find me all the users in C x,

play04:11

then X may fall in this shard and all you need to do is just read through this

play04:16

shard, which is what this database, database server number seven can do for you,

play04:21

right? That shard is going to be smaller in size,

play04:23

it's also going to be easier to maintain,

play04:25

probably going to give you faster performance. Everything good about sharding?

play04:29

And the first problem that you have to take into consideration is joins

play04:34

across shard. If these are across shards,

play04:38

what's going to happen is the query needs to go to two different shards.

play04:41

They need to pull out their data, then join the data across the network.

play04:46

And this is going to be extremely expensive. So one of the problems here

play04:52

joins.

play04:53

The second point comes when you look at the pizza and you realize that this is

play04:57

completely inflexible. The shards are inflexible.

play04:59

You can't have more pizza slices or less pizza slices. It's already done.

play05:04

But we want our database servers to be flexible in number.

play05:07

So one of the really good algorithms for this is consistent hashing.

play05:10

You should have a look at that. There's one database which actually uses this,

play05:15

and that is mem cached,

play05:21

right? This doesn't really implement consistent hashing.

play05:24

You can use an application logic about the database mem cashed

play05:30

to get your work done. So it's not really a problem. It might be a problem,

play05:35

but you can't have dynamic number of shots.

play05:42

Now to overcome this problem, what we do is take a shot,

play05:46

which has too much data in it and then dynamically break into pieces.

play05:51

So this pizza slice is like a pizza for us. Yeah,

play05:55

when we magnify it enough, it's going to be a really large slice,

play05:59

and then we break it into smaller pieces.

play06:02

So there's going to be some sort of a manager for every particular shot,

play06:07

which is going to map the requests to the correct mini slice,

play06:12

so to speak, in the pizza slice, single pizza slice.

play06:17

Using this technique, which is hierarchical sharding,

play06:20

we can get rid of the inflexibility over here. So point number two

play06:27

is no longer a big problem. Now,

play06:31

one of the smart things to do here is to create an index on these shards.

play06:37

Assuming your query requires that this index could be on a completely different

play06:42

attribute compared to the user id.

play06:45

And one of the good examples of this is find me all the people in New York who

play06:49

have age greater than 50. So if these are the city IDs,

play06:54

then New York is going to land, let's say here, and then you can index on age.

play06:59

So you'll find all users

play07:02

in New York within a given range of age.

play07:07

So all of your queries are fast.

play07:09

So that's the most important thing about sharding.

play07:12

Your read performance goes up and your right performance goes up because all of

play07:16

your queries fall on one particular point. But what happens if a shard fails?

play07:21

Let's say there's some sort of electricity issue over there. In that case,

play07:26

you could have something like a master slave architecture.

play07:29

The master slave architecture is a very common architecture.

play07:31

What happens in this is that you have multiple slaves which are copying the

play07:35

master. Whenever there's a right request, it's always on the master.

play07:39

The master is the most updated copy,

play07:41

while the slaves continuously pull the master and read from it.

play07:44

What then happens is if there's a read request,

play07:46

it can be distributed across slaves. While if there's a right request,

play07:51

it always goes to the master. In case the master fails,

play07:54

the slaves choose one master amongst themselves, right?

play07:57

And so there's good single point of failure tolerance over here.

play08:02

Conceptually, it's quite easy. You just take your data,

play08:05

break into pieces break into ranges essentially,

play08:08

and then persist in different places.

play08:12

But when it comes to practical application,

play08:14

this is quite tough because this guy consistency is

play08:19

difficult to do.

play08:20

And if you're just starting out with your system and you you're thinking about

play08:24

charting, I suggest that you take into consideration.

play08:26

Although mechanisms like indexing, like using NoSQL databases,

play08:30

which internally actually use these kind of concepts,

play08:34

but to use those ready-made solutions or to use well-known

play08:39

solutions like indexing is probably the way to go before you go for sharding,

play08:43

a database even more difficult than sharding is to hit the like and the

play08:48

subscribe button at the same time. If you're able to do that,

play08:51

then you'll get notifications for further videos and I'll catch you next time.

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Database OptimizationScalabilityShardingConsistencyAvailabilityHorizontal PartitioningData ManagementIndexingNoSQLConsistent Hashing