Database Design Tips | Choosing the Best Database in a System Design Interview

codeKarle
6 Jun 202023:27

Summary

TLDRThis video script delves into the critical role of database selection in system design, emphasizing its impact on scalability and performance. It outlines common use cases, such as caching with Redis, file storage with Amazon S3, text search with Elastic Search or Solr, and time series data with InfluxDB. The script guides through choosing between SQL and NoSQL databases based on data structure, query patterns, and scale, highlighting the importance of ACID properties for transactional systems and suggesting combinations of databases to meet diverse requirements.

Takeaways

  • 🔍 The choice of database is crucial for system scalability and is influenced by non-functional requirements (NFRs).
  • 📚 Databases do not impact functional requirements but can affect performance based on query patterns and data structures.
  • 📈 The suitability of a database depends on factors such as data structure, query patterns, and the scale of data to be handled.
  • 💾 Caching solutions like Redis, Memcached, etcd, and Hazelcast are essential for reducing database load and improving latency.
  • 🖼️ For file storage, particularly for images and videos, Blob Storage solutions like Amazon S3 are commonly used.
  • 🌐 Content Delivery Networks (CDNs) complement Blob Storage by distributing content geographically for faster access.
  • 🔎 Text Search Engines like Elastic Search and Solr, built on Apache Lucene, are used for implementing search capabilities with support for fuzzy search.
  • 📊 Time Series Databases such as InfluxDB and openTSDB are optimized for sequential data writes and time-range queries, ideal for metrics tracking.
  • 📊 Data Warehouses like Hadoop are used for offline reporting and analytics by aggregating data from various transactional systems.
  • 🔑 The decision between relational and non-relational databases often hinges on the structure of the data and the need for ACID transactions.
  • 🔄 For ever-increasing data with a finite number of query types, columnar databases like Cassandra and HBase are recommended.

Q & A

  • What is the primary factor that determines the choice of database in a system design?

    -The primary factor that determines the choice of database in a system design is the non-functional requirements (NFRs) such as query patterns, data structure, and scale to handle.

  • How do databases impact the functional requirements of a system?

    -Databases generally do not impact the functional requirements of a system. Any database can be used to satisfy the functional requirements.

  • What are the three main factors that influence the choice of database?

    -The three main factors that influence the choice of database are the structure of the data, the query pattern, and the amount of scale that needs to be handled.

  • Why is caching important in system design?

    -Caching is important in system design to reduce the number of times a database is queried, to improve response times, and to handle high-latency remote calls by storing responses locally.

  • What are some common caching solutions mentioned in the script?

    -Some common caching solutions mentioned in the script are Redis, Memcached, etcd, and Hazelcast.

  • How does caching work in terms of key-value stores?

    -In caching, the key is typically the query parameter or request parameter, and the value is the response expected from the system. This key-value pair is stored in key-value stores.

  • What is Blob Storage and why is it used?

    -Blob Storage is used to store large binary objects like images and videos. It is not a database but a data store where files are served as they are, without querying.

  • Why is Amazon S3 a popular choice for Blob Storage?

    -Amazon S3 is a popular choice for Blob Storage because it is cost-effective and widely used by many companies, making it a reliable and efficient solution for storing images and videos.

  • What is the purpose of a Content Delivery Network (CDN) in the context of Blob Storage?

    -A CDN is used to distribute the same image or video geographically across various locations, allowing users to access the content faster by querying servers closer to their location.

  • What is a Text Search Engine and why is it used?

    -A Text Search Engine is used to provide text searching capabilities on textual data, such as product titles and descriptions. It supports fuzzy search, which helps in handling misspellings and provides a better user experience.

  • What are some common implementations of Text Search Engines?

    -Some common implementations of Text Search Engines are Elastic Search and Solr, both of which are built on top of Apache Lucene.

  • Why are Time Series Databases used and what are some examples?

    -Time Series Databases are used to store and manage time-stamped data, such as application metrics. They are optimized for sequential updates and bulk read queries. Examples include InfluxDB and openTSDB.

  • What is the role of a Data Warehouse in system design?

    -A Data Warehouse is used for storing large amounts of data from various transactional systems and providing querying capabilities for offline reporting and analytics.

  • How does the choice between a relational and non-relational database depend on the data structure?

    -If the data is structured and can be easily modeled in tables with rows and columns, a relational database is suitable. For unstructured data or data with complex queries, a non-relational database like a document DB is more appropriate.

  • What are some scenarios where a combination of databases is used?

    -In real-world scenarios like an e-commerce platform, a combination of databases is often used. For example, an RDBMS can be used for inventory management, while Cassandra can be used for storing historical order data. Document DBs can be used for complex querying needs.

  • What are some common providers of relational databases?

    -Some common providers of relational databases include MySQL, Oracle, SQL Server, and Postgres.

  • What are the characteristics of ever-increasing data and why is a columnar DB like Cassandra suitable for this?

    -Ever-increasing data refers to data that grows at a rate that is more than linear, such as location pings from drivers in a service like Uber. Columnar DBs like Cassandra are suitable because they handle large volumes of data and support high write and read throughput.

Outlines

00:00

🗂️ Database Choice in System Design

The paragraph discusses the importance of database selection in system design, particularly for scaling. It emphasizes that while databases don't affect functional requirements, they significantly impact non-functional requirements (NFRs) such as query patterns, data structure, and scalability. The choice of database is influenced by data structure, query patterns, and scale. The video will cover common use cases and potential database solutions, including caching, file storage, text search engines, and time series databases. It also touches on SQL-NoSQL comparisons and the combination of databases for solving real-world problems.

05:01

🔍 Caching and File Storage Solutions

This paragraph focuses on caching solutions and their importance in system design to reduce database load and latency. It explains the concept of caching with keys and values, where keys are query parameters and values are the expected responses. Popular caching solutions like Redis, Memcached, etcd, and Hazelcast are mentioned, with Redis highlighted as a battle-tested and widely used solution. The paragraph also addresses file storage options, particularly for images and videos, recommending Amazon S3 for Blob Storage and Content Delivery Networks (CDN) for geographical distribution.

10:08

📚 Text Search Engines and Time Series Databases

The paragraph introduces text search engines, necessary for capabilities like searching product titles and descriptions, and mentions Elastic Search and Solr as common implementations built on Apache Lucene. It also covers fuzzy search, which allows for misspelled queries to return relevant results based on edit distance. Time series databases are then discussed as an extension of relational databases, optimized for sequential data updates and bulk read queries over time ranges, with InfluxDB and openTSDB as examples.

15:11

📊 Data Warehousing and Analytical Storage

This section delves into data warehousing for analytical purposes, such as transactional analytics in companies like Amazon or Uber. It contrasts transactional systems with data warehouses, which are used for offline reporting and require large databases to accommodate extensive data dumping and querying. Hadoop is suggested as a suitable solution for such use cases. The paragraph also explores the decision-making process for choosing between relational and non-relational databases based on data structure, transactional guarantees, and the nature of queries.

20:14

📈 Deciding Database Types for Varying Data Needs

The paragraph guides through the decision-making process for selecting the appropriate database type based on the data's structure and query requirements. It outlines scenarios for using relational databases when structured data and ACID transactions are needed, like in payment or inventory systems. For unstructured data with complex queries, document databases like MongoDB or Couchbase are recommended. The paragraph also discusses columnar databases for ever-increasing data with finite query types, citing Cassandra and HBase as stable options.

🛒 Combining Databases for Complex E-commerce Systems

The final paragraph illustrates the use of multiple databases in complex systems, such as an e-commerce platform like Amazon. It suggests using a combination of RDBMS for recent orders and Cassandra for permanent storage, leveraging the strengths of both for inventory management and data scalability. The paragraph also discusses using document databases for complex querying needs, like finding users who bought specific products, and emphasizes the importance of combining different databases to meet both functional and non-functional requirements in real-world scenarios.

Mindmap

Keywords

💡Database

A database is a structured collection of data, managed by a Database Management System (DBMS), which allows for the storage, retrieval, and management of data. In the context of the video, the choice of database is crucial for system design interviews, as it impacts scalability and performance based on non-functional requirements. The script discusses various types of databases and their suitability for different use cases.

💡Non-functional requirements (NFRs)

Non-functional requirements are aspects of a system that define constraints on the system's behavior, such as performance, scalability, and security. The script emphasizes that the choice of database can significantly influence how well a design meets these NFRs, as different databases are optimized for different operational characteristics.

💡Caching Solutions

Caching solutions are used to temporarily store data in order to improve performance by reducing the need to repeatedly access a slower data source. The script mentions caching as a common use case in system design, where technologies like Redis, Memcached, etcd, and Hazelcast can be employed to store frequently accessed data quickly.

💡File Storage Options

File storage options refer to the systems and services used to store and manage files, such as images and videos. The script discusses the use of Blob Storage, like Amazon S3, for storing large files and the integration of Content Delivery Networks (CDN) to distribute these files geographically for faster access.

💡Text Search Engine

A text search engine is a system that enables full-text search capabilities, allowing users to search for content within a text. The script highlights Elastic Search and Solr as common implementations of text search engines, built on Apache Lucene, which are used for providing search functionality in applications like Amazon or Netflix.

💡Fuzzy Search

Fuzzy search is a feature of search engines that allows for misspelled or near matches to be returned in search results. The script explains the concept with an example of a user searching for 'AIRPROT' instead of 'AIRPORT', and how a text search engine can recognize and correct such errors to provide a better user experience.

💡Time Series Database

A time series database is optimized for storing and managing time-stamped data, typically used for applications that require tracking metrics over time. The script mentions InfluxDB and openTSDB as examples of time series databases, which are designed to handle sequential data writes and time-range queries efficiently.

💡Data Warehouse

A data warehouse is a large, centralized repository of data designed for query and analysis rather than for transaction processing. The script discusses the use of data warehouses like Hadoop for storing large volumes of data from transactional systems and providing reporting and analytics capabilities.

💡Relational Database

A relational database is a type of database that stores data in tables with pre-defined columns and rows, allowing for complex queries and relationships between data. The script explains that relational databases are suitable for structured data and situations requiring ACID (Atomicity, Consistency, Isolation, Durability) properties, such as in payment systems.

💡NoSQL

NoSQL refers to non-relational, distributed, and open-source databases that are designed to handle large volumes of structured, semi-structured, or unstructured data with flexible schema. The script discusses NoSQL databases like MongoDB and Couchbase for scenarios where the data is not structured or when complex queries on varied data types are needed.

💡Columnar DB

A columnar database is a type of database that stores data tables by column rather than by row, which is beneficial for analytical queries on large datasets. The script recommends using columnar databases like Cassandra or HBase for handling ever-increasing data with a finite number of query types, such as tracking location pings for Uber drivers.

Highlights

The choice of database significantly impacts the scalability of a system design, particularly in relation to non-functional requirements.

Databases do not generally affect functional requirements, allowing any database to be used to meet these needs.

The structure of data, query patterns, and scale requirements are key factors in choosing a database.

Caching solutions like Redis, Memcached, etcd, and Hazelcast are essential for reducing database load and improving latency.

Redis is a popular and stable caching solution often used in system designs.

File Storage Options, such as Amazon S3, are crucial for storing images and videos in systems like Amazon or Netflix.

Content Delivery Networks (CDNs) complement Blob Storages like Amazon S3 by geographically distributing content for faster access.

Text Search Engines like Elastic Search and Solr, built on Apache Lucene, are vital for implementing search functionalities in systems.

Fuzzy search capabilities in search engines allow for user-friendly search results even with misspelled queries.

Time Series Databases, such as InfluxDB and openTSDB, are optimized for sequential data updates and bulk read queries in metrics monitoring systems.

Data Warehouses, like Hadoop, are used for offline reporting and analytics by aggregating data from various transactional systems.

The decision between relational and non-relational databases often hinges on the data structure and the need for transactional guarantees.

Relational databases like MySQL, Oracle, and SQL Server are suitable for structured data that requires ACID properties.

Document databases, such as MongoDB and Couchbase, are ideal for unstructured data and complex queries.

Columnar databases like Cassandra and HBase are best for handling ever-increasing data with a limited number of query types.

In real-world scenarios, a combination of databases is often necessary to meet both functional and non-functional requirements.

The transcript provides a cheat sheet for database selection based on different use cases, which can be a helpful guide in system design interviews.

Transcripts

play00:02

Hi everyone. Let's talk about databases now. So in your system design interview

play00:07

how good your design is and how well it can scale depends very much on the

play00:12

choice of database that you've used. Now databases generally do not impact your

play00:17

functional requirements, so whatever your functional requirement is you can

play00:21

satisfy that by using any database that you want. But normally the

play00:26

non-functional requirements are the ones which are impacted by the choice of

play00:29

database. So let's just say you have certain query patterns, or certain kind

play00:34

of a data structure or certain kind of a scale to handle, there are a different

play00:37

set of databases optimized for different kinds of such things, right. So based on

play00:42

the choice of database it would impact how well your design can scale up to the

play00:47

requirements that are given as part of your non-functional requirements (NFRs) so in

play00:51

this video what we'll do is we'll go over some very common use cases that

play00:54

normally you might come across in your System Design Interview and then look at

play00:59

some of the potential solutions or possible set of databases that you can

play01:02

use to handle those scenarios.

play01:04

Now normally the choice of database depends

play01:07

on a couple of factors. The very first thing is the structure of the data that

play01:11

you have. Whether it's a very structured data or a totally non structured data

play01:16

that is one of the factors. The other factor that impacts the choice of

play01:20

database is the query pattern that you have. And the next obvious one is the

play01:24

amount of scale that you need to handle. So all these three factors would make a

play01:29

difference in to the choice of database that you want. Now in this video we'll

play01:33

look at some common obvious ones in the beginning and then towards the end we'll

play01:37

look at the SQL-NoSQL comparison and a combination of certain kinds of

play01:41

databases that can help you solve a real-world problem.

play01:44

Let's talk about

play01:45

Caching Solutions. So whichever system you are designing in an interview, you would

play01:50

definitely have to use some caching solution there. Now there are a lot of

play01:54

use cases for caching. Let's say you are querying a database and you do not want

play01:58

to query the database a lot of times, you could cache the value in a cache, right.

play02:02

Alternatively, if you are making a remote call to a different service and that is

play02:07

having a high enough latency, you might want to cache the response of that

play02:11

system locally at your end in a caching solution

play02:16

And there could be a lot of use-cases for caching. Now generally

play02:19

the way caching works is you have a key and you have a value. So key

play02:23

normally is whatever your where clause is in the query or whatever your query param

play02:28

or request params are when  you're making an API call 

play02:31

and value is basically the response that

play02:33

you are expecting from the other system, right. So all of these kind of values

play02:37

would be stored in normally key value stores. Now very common used solutions

play02:42

are "Redis". You could also use "Memcached", you could use "etcd". "Hazelcast" is

play02:48

picking up a lot these days. So you could use any of them. Normally in the

play02:51

videos that I make I tend to use Redis because that is a very battle-tested

play02:55

solution and it is used by all the big companies in the world and it is fairly

play02:59

stable. So in whatever the System Design you have, where you have to use a caching

play03:03

solution, you could use a Redis. I'll just note it down so that you can refer it

play03:09

Redis is the solution for caching. Now again this does not mean that you cannot

play03:15

use anything else, you could use any similar solution and that would work just fine.

play03:19

Next let's look at some File Storage Options. So let's just say you

play03:23

are designing a system like Amazon where you are having various products that you

play03:28

are selling. Now the sellers would be uploading product images, maybe products

play03:32

videos, right. You need a data store to store those

play03:34

images and videos. Similarly you could be potentially building a system like

play03:39

Netflix which has videos all together. And you need a storage to support videos

play03:43

So wherever you have a image/video kind of a thing, there we'll use something

play03:48

called as a Blob Storage. Now these are not really databases. Databases are

play03:53

fundamentally meant for things where you can query on. Now a file is not

play03:57

something which you normally query on. You will just serve it as it is, right.

play04:01

So then is when Blob Storages come in. Now there are a lot of providers for that

play04:06

one of the most common ones and the and 

play04:08

a fairly cheaper (Cost Effective)  one is Amazon S3 and it

play04:12

is a fairly good system used by a lot of company, so whenever you have a

play04:16

requirement of storing images/videos and anything of that sort, you could use

play04:19

Amazon S3 as a data store. Now along with S3 you might want to use something

play04:24

called as a Content Delivery Network which is often called as a CDN.

play04:27

Now CDN is generally used for distributing the same image

play04:31

geographically in a lot of locations. So let's take a simple example that you

play04:35

have a product image that is stored in Amazon S3 as a primary data source and

play04:40

there are a lot of people coming from throughout the globe who are accessing

play04:43

that product. So you might want to distribute that image into various

play04:47

servers across the globe so that individual people can query them in a

play04:51

much faster way as compared to querying an S3 which is probably located in a

play04:55

couple of locations. So for all Blob kind of content you would then be

play05:00

using something like a for all blob you would be using something like the S3

play05:04

plus a content delivery network and that should be fairly fine on all your

play05:10

interview questions. Now the next very common use case that you might encounter

play05:14

is for example if you're building a product like Amazon and you need to

play05:18

provide text searching capabilities on various products. So the seller has uploaded

play05:23

a product with a particular title some description and you want users to search

play05:28

that, right. Now that search thing would be provided on a text of title and on

play05:33

the text of description. Very similar use cases would be when you have a product

play05:37

like Netflix to build and you want  to... give the option for the users to

play05:42

search or movie name, movie titles, genres, maybe cast and crew names and all of that.

play05:47

Or alternatively you might be designing something like an uber or

play05:50

Google Maps kind of a thing where you want to provide text searching

play05:54

capability with support for fuzzy search. So for all of these kind of use cases

play05:59

you would be using something called as a Text Search Engine. Now a very common

play06:03

implementation of a Text Search Engine is provided by Elastic Search and Solr,

play06:08

and both of them are built on top of something called an Apache Lucene

play06:12

now Lucene fundamentally provides these text searching capabilities and that is

play06:16

then being used by both of these products Elastic Search (ES) and Solr. So the next

play06:21

use case and quickly write it down is your text search which is Elastic Search

play06:27

and Solr. Now one more thing that they support is something called as a Fuzzy Search.

play06:33

So what is that is.. I will quickly go over that. Let's say if you are searching

play06:37

for the word Airport and let's say is the user typed in

play06:41

"AIRPROT" with the wrong spelling, okay. this O and R are interchanged right. Now, If a

play06:48

user searches for this and you do not return back any result, then that's a bad

play06:52

user experience, right. So you need your database to be able to figure out that

play06:56

the user did  

play06:57

not really meant this thing [AIRPROT], the user actually meant AIRPORT, right?

play07:01

Now how does that database identify, so this word can be converted into the

play07:07

correct spelling of airport by changing two characters, right.

play07:12

R needs to be converted into O and O needs to be converted into R, right? so this is at

play07:18

an edit distance of two. So you can provide a level of fuzziness that your

play07:25

search engine needs to support. This has a fuzziness factor of two, which is the

play07:30

Edit Distance. Normally there are a lot of other factors also that come in but this

play07:35

is roughly how fuzzy searching is implemented in most of the solutions.

play07:38

So wherever you have any search capabilities there you use either Elastic Search

play07:43

or Solr. Now one very important thing about both

play07:46

of these are, these are not databases. These are search engines. So the

play07:51

difference between a Search Engine and a Database is whenever you write something

play07:54

in a Database, Database gives you a guarantee that that data wouldn't be

play07:58

lost. Now both of these data stores don't

play08:01

give you any such guarantee. They claim that they are giving a good enough

play08:05

availability and a redundancy and all of that but potentially data could be lost

play08:10

so you should not keep any of these as your primary source of truth

play08:14

Your primary data store should be somewhere else and it you could load that data

play08:18

into either of these systems to provide the searching capabilities and these two

play08:22

are very efficient at search. Next, let's look at what do you do if you want to

play08:27

store some metrics kind of a data. So let's say if you are building a system

play08:30

like Graphite, Grafana or Prometheus which is basically an application

play08:34

metrics tracking system. So let's say if the use case that you are given is a lot of

play08:40

applications are pushing metrics related to their throughput, their CPU

play08:44

utilizations, their latencies, and all of that. And you want to build a system

play08:47

to support that. Then is when comes something called as a

play08:51

Time Series Database. Now think of Time Series Database as an extension of

play08:56

relational databases but with not all the functionalities and certain

play09:00

additional functionalities. So regular relational databases that you have would

play09:05

have the ability to update a lot of records right or they would also give

play09:09

you the ability to very random records but whenever you are building a metrics

play09:14

monitoring kind of a system you would never do random updates. You would always

play09:17

do sequential update in append-only mode. So if you have put an entry at time T1

play09:22

the next entry would be a time T 2 which is greater than T1. The next entry would

play09:26

be at time T3 which is greater than T1 and T2 right. So it's an append only

play09:30

write mode. Also the read queries that you do, they are kind of bulk read

play09:35

queries with the time range, right. You query for last few minutes of data or

play09:38

few hours of data or few days of data, right. But you don't do a random read or

play09:43

a random update. Now time series databases are optimized for this kind of

play09:47

a query pattern and input pattern. So there are a lot of time series databases

play09:52

InfluxDB is one of them, openTSDB  is one of them so you could use

play09:56

either of them if you have that kind of a use case. I will quickly write this

play10:00

thing also. For time series, you use something called as an openTSDB, it stands for

play10:08

Open Time Series Database. The next use case is when you have lot of information and you

play10:14

want to store all of that information of a company in a certain kind of a data

play10:20

store for various kind of analytics requirements. So let's just take an

play10:23

example of something like an Amazon or Uber or any system design where you want

play10:28

to provide analytics on all the transactions. You might want to provide

play10:32

analytics like how many orders I'm having, what geographies are giving me

play10:35

what revenues, which is the most sought after item, stuff like that. So where you

play10:40

want to do analytics on the data of the whole company, there you need something

play10:44

like a Data Warehouse. That basically is a large database in which you can dump

play10:51

all of that data and provide various querying capabilities on top of the data

play10:55

to serve a lot of reports. Now these are generally not used for transactional

play11:00

systems, these are generally  used for offline reporting.

play11:02

So if you have that kind of use case then you can use something like a Hadoop

play11:06

which can very well sit in for that purpose, where you put in a lot of data

play11:11

from various transactional systems and then build a lot of systems that can

play11:14

provide reporting on top of that data.

play11:16

Now let's look at slightly tricky

play11:19

scenarios where you might want to choose 

play11:21

between a relational and  a non-relational database.

play11:23

So the very first thing that helps you decide what kind of a database

play11:27

do you want to use, is the structure of the data. So if you have a very

play11:31

structured information then possibly a relational database makes sense. Now what

play11:36

is a structured information? it would be an information that you can easily model

play11:40

in form of tables and tables would have rows and columns of information.

play11:44

For example if you want to store a user information something like a user

play11:47

profile on any social network it would have name, email address, city, phone number,

play11:52

and bunch of very standard information that each user will have

play11:54

That would be a structured information. So if we try to make a flowchart of how

play12:00

to decide which database to use this is how it would look like.

play12:03

The very first choice that we need to make is whether we have a structured data or an

play12:08

unstructured data. Now let's say if we have a structured data the next question

play12:12

that we need to ask ourselves is, do we need any atomicity or transactional

play12:16

guarantees from the database or not? So, let's just say that you're building a

play12:20

payment system which supports a feature like somebody can transact money from

play12:24

their account to somebody else's account, right. Now fundamentally at the very core

play12:28

of it, it will have two queries, one of them will reduce the amount from the

play12:32

person's account and the other query would add the amount into the

play12:36

beneficiary's account, right. Basically reducing from account A and adding into

play12:40

account B. Now your database should be able to provide you certain guarantees

play12:44

which wrap both these query into transactional boundaries, saying either

play12:49

both of them would execute or both of them would not execute. But it should

play12:53

never happen that amount has been debited from account A but not credited

play12:58

in into account B or credited into account A but not debited from account B.

play13:02

Something of that sort should not happen. Also it should provide you some

play13:06

consistency saying if you have done a transaction the next call that you made

play13:10

to fetch the account balances it should reflect the amount. It should not be that

play13:14

sometimes it is reflecting the correct amount

play13:16

and sometimes it is not  reflecting the correct amount. 

play13:18

If you have that kind of

play13:20

requirement where in your building of payment system or an inventory

play13:23

management system where you have the count of number of products that you

play13:26

have while people are buying them and the count needs to reduce, for all of

play13:30

those kind of scenarios, that would fall into that yes you need an atomicity

play13:35

you need consistent, you need transactions, then basically you need to

play13:38

use a relational database, okay. Now there are multiple providers of relational

play13:44

database you could use any of them. Some of the very common ones are MySQL,

play13:47

Oracle, SQL Server, Postgres and there are a lot of them. I've just mentioned

play13:52

the common ones it does not mean that you cannot use any other database of

play13:56

your choice. Feel free to use any database which provides you ACID Guarantees.

play14:01

Now let's say you have relational data but you do not need

play14:05

ACID Guarantee. Let's say you are just storing user information which does

play14:09

not have any use case of these atomicity requirements. You could still choose

play14:13

to use a relational database or you could choose to use a non relational

play14:17

database, it wouldn't make too much of a difference, because normally you would be

play14:20

easily be able to map a structured data into a NoSQL model, so either of

play14:25

these scenarios would be fine. Now let's say you do not have structured

play14:30

data - so what do you do? So there are a bunch of scenarios in which your use case

play14:34

might fit in. So maybe you are trying to build a catalogue kind of a system for an

play14:39

e-commerce platform like Amazon which has information of all the items that

play14:43

are available on that platform, right. So let's just say if you are building that

play14:48

catalogue for Amazon and if we take certain examples, so let's say there is

play14:52

the item like shirt. Now each item would have certain attributes. A shirt would

play14:57

have an attribute like a size: could be large, a color: could be red, something of

play15:00

that sort. If you have a refrigerator, it would have a volume like

play15:06

200 litres/400 litres or whatever. It would also have a power saving mode like

play15:11

3-star/5-star/what not. So those are the attributes of a refrigerator, right.

play15:15

Similarly, something like a milk, would have quantity and an expiry date, right.

play15:21

Now normally when you are on an e-commerce platform you not only need to

play15:25

see these attributes if it is just about seeing them you could

play15:28

kind of dump it as a Json and store it in any databases. But normally you would

play15:33

also want to query that. Now querying on a JSON or random attributes

play15:38

is a bit tricky on the relational databases, but there are certain kinds of

play15:41

databases that are optimized for those kind of queries. So these are the

play15:46

databases where you have a lot of data not just data in terms of volume but in

play15:50

terms of structure. So if you have lot of attributes that can come in and a wide

play15:54

variety of queries that can come in, then you fall into this category, and if that

play15:59

is the case then you need to use something called as a document DB.

play16:01

Now there are a lot of providers of document  DBs, MongoDB, Couchbase are some of them.

play16:07

Earlier we looked at Elastic Search and Solr for text searching those are

play16:11

also special cases of Document Database. Now let's just say your your data is not

play16:17

relational and you do not have complex queries, you have a couple of

play16:22

straightforward queries you could still use a document DB if it doesn't fall

play16:26

into the third category. Now what is the third category? Let's say if you have an

play16:30

ever-increasing data. What do I mean by ever-increasing data? So let us take an

play16:34

example of Uber. So all the drivers of Uber are continuously sending location pings.

play16:39

so let's stay there are some number of drivers and they kind of translate

play16:44

into X number of location records per day. So there would be X number of

play16:50

records inputted per day. But this X would not be a constant, it would be a

play16:55

growing number why because the number of drivers of uber are increasing day by

play16:59

day right so this data would become probably X one day one. Additively 1.1X

play17:04

on day two, 1.2x on date three, so on and so forth.

play17:08

So it would not be increasing in a linear fashion it would be increasing but in a

play17:12

more than a linear fashion. So that is what I am calling an ever increasing data.

play17:18

Plus if you have finite number of type of queries. So let's if you want to track

play17:23

location pings of drivers the most important query that you will do is find

play17:26

all the locations of a driver whose driver ID is something right. So if you

play17:32

have less number of type of  queries (maybe high volume) but  

play17:35

a large amount of data then these kind of

play17:37

databases would be the best choice for you. This is something called as a

play17:41

columnar DB or a Column oriented DB and Cassandra

play17:45

and HBase are the most used and most stable options out there for these

play17:50

kind of scenarios. Now there would be a lot of other alternate as well but these

play17:55

are the most stable ones and have been battle tested again for a lot of years I

play17:59

would recommend to use either of these two in such kind of scenario. Now in my

play18:03

videos I generally prefer using Cassandra over HBase, the only reason being

play18:07

Cassandra is not very heavy to deploy HBase generally tends to have a lot of

play18:11

components that it comes with, but performance wise both of them are

play18:15

roughly similar. Each of them have their pros and cons but in a design interview

play18:19

it wouldn't matter which one do you use. Now, what if you don't have any of these what

play18:25

if you don't need ACID, you don't need these wide variety of data types and

play18:30

query types and you don't have an inner ever-increasing data? Then you can use

play18:35

any data base of your choice. Then it would basically be a low scale system

play18:39

having a very small number of queries on a very small number of attributes on a

play18:44

very small size of a data set. You would generally not get these kind of

play18:47

requirements in a system design interview but if you get you could use

play18:52

any of these unless it's something that we've already talked about which are one

play18:55

of the special cases in some other scenario. So if it's not that you

play18:59

could use any of these data bases and that should be fine. But normally when

play19:04

you are in a system design you would not get a design that you will be satisfied

play19:07

by just one of these databases. So let's look at slightly more trickier

play19:12

real-world scenarios. Now let's take an example of us building an e-commerce

play19:16

platform something like an Amazon. Now when you are managing inventory on that

play19:21

side you want to make sure that you are not

play19:23

over selling items. So let's say there is just one quantity of a particular item

play19:27

left and there are 10 users who want to buy that you want to have this ACID

play19:31

properties there to make sure that only one of the users should be able to

play19:35

commit the transaction other users should not be able to commit the

play19:38

transaction, right. You can put constraints and all of that on there.

play19:42

It would make sense to use a RDBMS database for the inventory management

play19:47

system of a system like Amazon or maybe an order management system right

play19:52

But if you look at the data of Amazon it is an ever interesting data. Why? because

play19:56

the number of orders are additive each day new orders are coming in they cannot

play20:02

purge the data due to lot of legal reasons plus the number of orders are

play20:06

also increasing so it naturally fits into this model right with you

play20:10

where I am recommending together Cassandra. So what you could use is a

play20:14

combination of both of these databases. You could use MySQL or any other

play20:19

RDBMS alternate for storing data about the orders it has just placed and not

play20:24

yet deliver to the customer. Once the item is delivered to the customer you

play20:29

can then remove from this RDBMS, and put it into Cassandra as a permanent store.

play20:34

I walked through this implementation in much more detail in

play20:36

the Amazon system Design, so I would recommend to have a look on how a

play20:40

combination of database can become a very power choice there.

play20:43

Now let's look at another example again taking an example of Amazon, let's say if you want

play20:49

to build a reporting kind of a thing which lets you query something like

play20:54

get me all the users who have bought sugar in last five days

play20:58

now sugar is not just a product. There are a lot of sellers selling different

play21:04

sugar alternates of different companies maybe, of different qualities maybe, right.

play21:09

so sugar would then be like a lot of item IDs right and on top of those a lot

play21:14

of item IDs there must be a lot of orders. Now again I am saying that

play21:18

orders would either be in Cassandra or in this (RDBMS) right but if you are doing

play21:24

random queries where some you might want to query on who bought sugar, who bought

play21:29

TV, who bought TV of certain quality, who bought a fridge of certain quality, that

play21:34

kind of logically goes into a model where I was recommending to use a Document DB.

play21:38

now what you could do is you could store the querying part over

play21:42

here (Document DB). You could basically say  that I'll store a subset of order information

play21:47

into a MongoDB which says that user ID so and so, had an order ID so-and-so, on

play21:54

some particular date, which has these ten item ID in the certain quantity, right

play21:59

On this database you could run a query which will return you a list of users

play22:03

and list of order right and then you could take those Order IDs and query

play22:08

on both of these systems (RDBMS+Cassandra) right  so here we are using all the three systems in

play22:12

combination to provide various querying capabilities like with user bought what

play22:18

kind of use case, right. So in any real world scenario you would have to use a

play22:23

combination of such databases to fulfill the functional and non-functional

play22:27

requirements that you have. Now all of these being said this is just an

play22:31

indicative cheat sheet of what database fits well in what kind of a scenario.

play22:35

when you are in an interview you could use this to kind of get done with it but

play22:41

if you are actually using it in a real-world system I would recommend to

play22:44

read a bit more about it to choose which out of these few do you want to

play22:48

actually use, right. Also there are a lot of other databases that I have not talked

play22:52

about which are basically examples of RDBMS's or example of columnar

play22:56

databases. It does not mean that these are the only options. There are a lot of

play23:00

other options out there you could explore them and use those as well.

play23:03

so yeah I think

play23:05

this should be a good enough place for a database choice given a certain use-case.

Rate This

5.0 / 5 (0 votes)

相关标签
Database DesignSystem ScalingSQL DatabasesNoSQL OptionsCaching SolutionsData StructuresQuery PatternsData StorageElastic SearchCassandraRedis
您是否需要英文摘要?