Types of Databases | Criteria to choose the best database in the System Design Interview

Think Software
30 Sept 202235:16

Summary

TLDRThis video script offers a comprehensive guide on selecting the right database for distributed services. It emphasizes the importance of considering various criteria such as ease of learning, product maturity, data model, scalability, and cost. The script discusses different types of databases, including relational, NoSQL, document, column family, graph, time series, and NewSQL databases, highlighting their pros and cons. It also provides scenarios to illustrate when each type is most suitable, ultimately guiding viewers on making informed decisions based on their service's specific requirements.

Takeaways

  • 📚 Choosing the right database for a distributed service is crucial as it involves trade-offs and depends on specific service requirements.
  • 🔍 The decision is not as simple as choosing a relational database for structured data or a NoSQL database for semi-structured or unstructured data.
  • 📈 Multiple criteria must be considered when selecting a database, including learning curve, product maturity, technical support, data model/schema, scalability, and cost.
  • 🤔 Sometimes a choice must be made between two databases, neither of which is fully appropriate, and the more suitable option must be selected based on the given requirements.
  • 🛠️ Different database types have their pros and cons, and the script discusses relational, key-value, document, column family, graph, time series, and NewSQL databases.
  • 🔑 Key-value stores are simple NoSQL databases suitable for scenarios with primary key access and where scalability and flexible schema are needed.
  • 📄 Document databases are a superset of key-value stores, allowing for a flexible schema and the ability to create secondary indexes on document attributes.
  • 🗂️ Column family databases are optimized for accessing data based on columns and are suitable for scenarios requiring aggregation queries on large datasets.
  • 🌐 Graph databases excel in scenarios involving complex relationships and multi-hop queries, making them ideal for fraud detection and recommendation systems.
  • ⏱️ Time series databases are specialized for storing and retrieving timestamped data, which is useful for IoT and financial systems.
  • 🔄 NewSQL or distributed SQL databases combine the ACID properties of relational databases with the scalability and sharding support needed for large-scale systems.

Q & A

  • What is the primary consideration when selecting a database for a distributed service?

    -The primary consideration is choosing the most appropriate database based on the requirements of the service, as migrating databases later can be costly and risky.

  • Why is it not as simple as choosing a relational database for structured data or a NoSQL database for semi-structured or unstructured data?

    -Because the decision is not binary and depends on multiple criteria that need to be weighed in parallel, such as the specific needs and trade-offs of the service.

  • What are some of the criteria used to select a database?

    -Criteria include ease of learning curve, database product maturity and technical support, data model or schema, scalability, QD and data access pattern, cost, and ACID or BASE requirements.

  • What does ACID stand for in the context of database transactions?

    -ACID stands for Atomicity, Consistency, Isolation, and Durability, which are properties of database transactions intended to guarantee data validity.

  • What is the difference between ACID and BASE systems?

    -ACID systems prioritize consistency and guarantee data validity through a set of transaction properties, while BASE systems prioritize availability and are eventually consistent, often used in distributed databases.

  • Why are relational databases considered intuitive and straightforward for data representation?

    -Relational databases represent data in the form of tables with rows as records and columns as attributes, making it easy to establish relationships between different data points.

  • What are some limitations of relational databases?

    -Limitations include difficulty in scaling horizontally, manual sharding, rigid schema that is hard to change, and not optimized for queries that require graph traversal.

  • What scenarios are key-value data stores a good fit for?

    -Key-value data stores are suitable for scenarios where all data access is through a primary key, scalability is important, and there is no need for complex queries like joins.

  • What are some examples of NoSQL databases?

    -Examples of NoSQL databases include MongoDB, CouchDB, Azure Cosmos DB, Apache Cassandra, and Amazon DynamoDB.

  • Why are document databases considered a superset of key-value databases?

    -Document databases are considered a superset of key-value databases because they allow the database to examine the internal data of the value, unlike key-value stores where the value is opaque.

  • What scenarios are column family databases or column-oriented databases suitable for?

    -Column family databases are suitable for scenarios where you only need to query a subset of columns, and column-oriented databases are efficient for analytic scenarios due to better compression and faster access to a subset of columns.

  • What are some use cases for graph databases?

    -Use cases for graph databases include fraud detection in payment systems, recommendation systems in social networks, and any scenario where complex multi-hop queries are required.

  • Why are time series databases optimized for storing and serving time series data?

    -Time series databases are optimized for storing and serving time series data because they allow fast insertion and retrieval of large amounts of timestamped data, supporting complex analysis on this type of data.

  • What is the difference between NewSQL or distributed SQL databases and traditional relational databases?

    -NewSQL or distributed SQL databases are relational databases with implicit support for sharding and scalability, making them suitable for systems that require ACID compliance and handle large volumes of data and high throughput.

  • How should one decide whether to use a relational database for semi-structured data?

    -One should decide based on whether the data can be accessed using an ID and stored in a BLOB or CLOB field within a relational database table, considering factors like performance and the need for complex queries.

  • What are the implications of updating a count or similar attribute in a database system?

    -Updating a count or similar attribute can change the system from read-heavy to write-heavy, affecting the choice of database and its ability to handle the increased write load efficiently.

Outlines

00:00

🔍 Selecting a Database for Distributed Services

The video begins by addressing the complexity of choosing a database for distributed services, given the multitude of technologies available. It emphasizes the importance of making informed trade-off decisions, as each database is optimized for specific scenarios. The role of software architects and senior engineers in selecting an appropriate database is highlighted, considering the costly and risky nature of migrating databases later in the development process. The speaker outlines various criteria for database selection, including the type of data (structured, semi-structured, or unstructured) and the required consistency model (ACID or eventual consistency). The paragraph stresses the need to weigh different database options against multiple criteria and service requirements before making a decision.

05:03

📈 Criteria for Database Selection

This paragraph delves into the specific criteria used to select a database, starting with the ease of learning curve, where the familiarity and adaptability of developers to a database play a significant role. The maturity of the database product and the availability of technical support are also crucial, with examples like Oracle database highlighting the importance of community size and language support. The data model or schema, which could range from structured SQL to unstructured data like media files, is another key factor. The paragraph also touches on scalability, the balance between read and write operations (QD), and cost considerations. It discusses the necessity of choosing between ACID compliance and eventual consistency based on system requirements, introducing the CAP theorem and the trade-offs between consistency, availability, and partition tolerance.

10:06

🗃️ Types of Databases and Their Use Cases

The speaker provides an overview of different database types, starting with relational databases like Oracle, MySQL, and Microsoft SQL Server, which are based on the relational model and are intuitive for data representation in tables. The limitations of relational databases, such as vertical scaling and the complexity of sharding, are discussed. The paragraph then introduces key-value data stores, which are simple NoSQL databases similar to hash tables, and are ideal for scenarios with primary key access and flexible database schemas. Examples of key-value stores like Redis, Memcached, and Amazon DynamoDB are given, along with scenarios where they are suitable, such as session information storage and shopping cart management.

15:09

📄 Document Databases and Their Applications

The paragraph discusses document databases, where the basic unit of storage is a document, often in JSON or XML format. Unlike relational databases, documents can have varying schemas, and secondary indexes can be created on document attributes. Document databases are beneficial for content management systems, e-commerce applications with flexible schemas, and any application requiring the storage of user-generated content. Examples of document databases include MongoDB, CouchDB, and Azure Cosmos DB, which are suitable for scenarios where a flexible data model is necessary and where data can be self-describing.

20:10

📊 Column Family and Column-Oriented Databases

The speaker explains column family databases, also known as wide-column databases or BigTable databases, which store data in columns rather than rows, allowing for efficient access when querying a subset of columns. Column-oriented databases are a subset that stores all data column-wise, offering benefits in data compression and access efficiency. The main difference between document and column family databases lies in the query model and the ability to address individual columns for updates. Suitable scenarios for column family databases include systems requiring aggregation queries on large datasets. Examples include Apache HBase, Cassandra, and Google's BigTable.

25:13

🌐 Graph and Time Series Databases

This paragraph introduces graph databases, which represent data as a network of nodes and edges, ideal for scenarios requiring traversal of complex relationships and multi-hop queries. The speaker contrasts this with the use of relational databases or document stores for simpler, single-hop relationships. Graph databases are particularly useful in fraud detection, recommendation systems, and social networks. The paragraph also mentions time series databases, optimized for storing and querying timestamped data, which is common in IoT and financial markets. These databases enable fast insertion and retrieval for complex time series analysis.

30:15

🚀 NewSQL and Distributed SQL Databases

The final paragraph of the script discusses NewSQL or distributed SQL databases, which can be seen as relational databases with built-in support for sharding and scalability. These databases are particularly suitable for systems where ACID compliance is mandatory and data volume or throughput is too large for a single machine. Examples include financial systems and large-scale billing services. The speaker also revisits the scenario of designing a URL shortening service, discussing the database selection process based on data structure, query load, and scalability requirements. The importance of understanding the system's needs and choosing a database that can accommodate growth and performance is emphasized.

🔑 Summary of Database Selection Criteria

The video concludes with a summary of the key points discussed regarding database selection. It clarifies that a simplified decision tree based on data structure or ACID properties is not sufficient for choosing the right database. Instead, a more complex matrix or table is needed to determine the most suitable database for a given scenario. The speaker advises starting with a relational database and considering other types like key-value, document, column-family, graph, and time series databases based on specific requirements. The importance of understanding the nature of data, whether structured, semi-structured, or unstructured, and the system's need for scalability, performance, and consistency is reiterated.

Mindmap

Keywords

💡Distributed Service

A distributed service refers to a system where components are spread across different networks and computers but work together as a cohesive unit to provide a service. In the context of the video, selecting a database for a distributed service is crucial as it impacts scalability, performance, and the ability to handle a growing amount of data and traffic. The video discusses the complexities involved in choosing the right database to support such services.

💡Database Trade-offs

Database trade-offs are decisions that involve weighing the pros and cons of different database technologies based on their features and the specific requirements of a service. The video emphasizes that with numerous database products available, each optimized for certain scenarios, it's essential for software architects to make informed decisions considering these trade-offs to align with the service's needs.

💡ACID Properties

ACID stands for Atomicity, Consistency, Isolation, and Durability, which are a set of properties that guarantee that database transactions are processed reliably. The video explains that choosing a database may depend on whether the system requires ACID compliance to ensure data validity despite errors or if it can operate with eventual consistency, a concept often used in NoSQL databases.

💡Relational Database

A relational database is a type of database that stores data tables in rows and columns, with relationships between the tables established through keys. The script mentions that relational databases like Oracle, MySQL, and Microsoft SQL Server are based on the relational model and are intuitive for representing data in tables. They are considered for scenarios where structured data and relationships are predominant.

💡NoSQL Database

NoSQL databases are non-relational databases that are designed to handle a wide variety of data models, including key-value, document, columnar, and graph formats. The video script discusses that NoSQL databases like MongoDB, Cassandra, and Amazon DynamoDB may be chosen over relational databases when dealing with semi-structured or unstructured data, or when scalability and flexibility are more critical than ACID compliance.

💡Scalability

Scalability refers to the ability of a database to handle growth in data volume and traffic. The script highlights that when selecting a database, it's important to consider its capacity for scaling horizontally (across multiple machines) or vertically (upgrading a single machine), as this will affect the database's performance under increased load.

💡Data Model/Schema

A data model or schema defines the structure of data in a database, dictating how data is organized, stored, and accessed. The video explains that the nature of the data, whether structured, semi-structured, or unstructured, plays a significant role in choosing the appropriate database, as different databases offer varying levels of support for different data models.

💡Consistency

In the context of databases, consistency refers to the degree to which the database maintains accurate and reliable data at all times. The script contrasts the strong consistency guarantees of ACID-compliant databases with the eventual consistency model often used in BASE (Basically Available, Soft state, Eventual consistency) systems, which prioritize availability over strict consistency.

💡CAP Theorem

The CAP theorem is a concept in distributed systems that states that it is impossible for a distributed system to simultaneously provide all three guarantees: Consistency, Availability, and Partition tolerance. The video script uses the CAP theorem to illustrate the trade-offs between consistency and availability, particularly when discussing NoSQL databases.

💡NewSQL Database

NewSQL is a modern class of database管理系统 that aims to combine the scalability and distributed architecture of NoSQL with the ACID guarantees of traditional SQL databases. The script mentions NewSQL or distributed SQL databases as suitable for systems that require both the robust transactional integrity of ACID and the horizontal scalability to handle large volumes of data and high throughput.

💡Graph Database

A graph database is a type of NoSQL database that stores data in the form of nodes, edges, and properties to represent and traverse complex relationships. The video script explains that graph databases are particularly well-suited for scenarios where the data involves multiple types of relationships and queries require traversing these relationships, such as in social networks or recommendation systems.

💡Time Series Database

A time series database is optimized for storing and managing time-stamped data, which is often used for handling data from financial markets, weather systems, or IoT devices. The script mentions that time series databases enable fast insertion and retrieval of large volumes of timestamped data, which is essential for time-sensitive applications.

Highlights

Importance of selecting an appropriate database for distributed services due to the high cost and risk of migration.

Trade-off decisions in choosing a database, as each product is optimized for specific scenarios.

Criteria for database selection includes learning curve, product maturity, technical support, and data model/schema.

Scalability as a key criterion, assessing how well a database can scale to support increased load and throughput.

Understanding read and write access patterns to prioritize database selection.

Cost considerations in database selection, including licensing, maintenance, downtime, and hiring experienced personnel.

The necessity of choosing between ACID compliance and eventual consistency based on system requirements.

Overview of different types of databases: relational, key-value, document, column-family, graph, and time-series databases.

Advantages and limitations of relational databases, including their intuitive relational model and vertical scaling limitations.

Key-value stores' suitability for scenarios with primary key access and flexible schema requirements.

Document databases' flexibility in handling different schemas and the ability to create secondary indexes.

Column-family databases' efficiency in accessing and storing data based on columns, beneficial for aggregation queries.

Graph databases' optimization for storing and querying data with complex relationships and multi-hop queries.

Time-series databases' specialization in handling timestamped data for fast insertion and retrieval.

NewSQL or distributed SQL databases as a blend of relational databases with built-in support for sharding and scalability.

Practical scenario analysis for choosing the right database type in services like URL shortening and social networking.

The video's call-to-action for viewers to engage with the content and subscribe for more informative videos.

Transcripts

play00:00

hello everyone welcome to my YouTube channel  in this video we are going to discuss how to  

play00:05

select a database for your distributed service  nowadays you can find a large number of database  

play00:10

Technologies having such a large number of product and  technologies means having more trade-off decisions  

play00:15

to make given that each database product is  optimized for certain scenarios it rests on the  

play00:22

software Architects and Senior software Engineers  to pick appropriate database with these trade-offs  

play00:28

in mind as it relates to the requirements of the  service choosing the most appropriate database  

play00:34

based on the requirements of the service is one  of the most important architectural and system  

play00:40

design decision because later if you find that the  selected database is not appropriate for all your  

play00:47

requirements then migrating within the database is  very costly and risky process and it could take  

play00:53

a lot of time choosing an appropriate database is  not based on a simple decision tree where you can  

play00:59

just decide simply that if we have structured  data then we will use relational database if  

play01:04

we have a semi-structured or unstructured data we  will use maybe something like nosql and also it's  

play01:09

not as simple as saying that since we need ACID  requirements we will choose relational database  

play01:15

and if we don't need acid requirements we can  live with eventual consistency for example then  

play01:20

let's choose a nosql database there are multiple  different criteria that we have to actually go  

play01:27

through in parallel and we have to decide after  going through all of those criterias and we have  

play01:34

to weigh different choices of databases that  we have that how those databases will fulfill  

play01:39

all those criteria or requirements that we  have and after that we have to come up with  

play01:44

an appropriate database for our service you also  need to understand that sometimes you may be in  

play01:50

a situation where you will have to choose between  two databases and none of them is appropriate for  

play01:57

your requirements but in that case you have to see  which of the database among these two databases is  

play02:03

more appropriate as compared to the other for  your requirements before discussing different  

play02:08

types of databases now let's discuss what are  different criterias that we use to select a  

play02:12

database based on the requirements of our service  the very first criteria is ease of learning curve  

play02:18

this criteria refers to the ease with which new  developers Architects dbas and other users of  

play02:26

the database can learn and adopt the use of  a particular database most developers tend  

play02:33

to select the database which they have already  worked on in the past and have some experience  

play02:38

of or if they think they can easily learn that  database easily choosing a database based on past  

play02:45

experience is okay if it fulfills all the other  different criterias that we are going to discussed  

play02:49

now and also it fulfills all the different  important requirements of your service the second  

play02:55

criteria that we use to select a database is the  database product maturity and technical support  

play03:01

this characteristic refers to how mature the  database is what is the size of the database  

play03:06

community and how easy it is to hire someone with  that database experience it also sometimes refers  

play03:13

to how many and which programming languages the  database support we can take musician database  

play03:18

like Oracle database for example this database  has been here for more than five decades and  

play03:24

we can easily find a lot of documentation on  the internet about it there's also a lot of  

play03:29

different communities on Internet that we can find  on Facebook or other communities where people have  

play03:34

worked on the listening databases and so it's  very easy to get some help from the community  

play03:39

if we need in a relational database the third  criteria that we use to select a database is  

play03:45

based on our data model or schema of our data the  data model or schema is one of the most important  

play03:52

criteria for choosing a database this signifies  that our database is student based on how well it  

play03:58

can provide support to the software developers  to model and store the data of the service  

play04:04

data can be structured like SQL schema or  semi-structured like a Json or XML object  

play04:10

or it could be totally unstructured data light for  example some media files some images some videos  

play04:16

audios Etc the data could be relationship data  for example in in case of Twitter system design  

play04:23

we have a follower and follow a relationship or  it could be a friendship relationship in case of  

play04:29

let's suppose Facebook choosing a right database  for the light model is also a very important Key  

play04:34

System design decision the fourth criteria for  choosing database is scalability this criteria  

play04:40

refers to your degree and the ease with which  a database can be scaled to support increased  

play04:45

load and throughput while minimizing the data  access latency it means that while choosing  

play04:51

database we have to see whether the database  can be scaled horizontally vertically or both  

play04:57

this also relates to the volume of data that needs  to be stored in our service sometimes the volume  

play05:02

of data is small enough to be stored in a single  machine but then other times the volume data could  

play05:08

be so huge that it cannot be stored in a single  machine another important criteria to select a  

play05:12

database is the QD and data access pattern this  criteria refers to whether a system is read heavy  

play05:18

or write heavy and so the database is selected to  prioritize either read override or right over read  

play05:25

if needed this is not a binary choice but rather  it is more of a scale towards which direction the  

play05:32

database optimizes then another very important  criteria for choosing a database is the cost  

play05:38

sometimes you see two different databases with  almost similar capabilities in this case it  

play05:44

depends on the overall cost associated with  using each database based on which you select  

play05:49

one of the two databases this cost comprises  of several expenses for example the cost of  

play05:55

buying a database license maybe it's a one-time  license or maybe it's a subscription then there's  

play06:02

a periodic maintenance that needs to be performed  on the database using the periodic maintenance is  

play06:07

their downtime that a service need to incur or  not also what would be the cost to hire people  

play06:13

who have experience with such a database another  important criteria to select database is based on  

play06:18

whether your system need asset requirement or your  system is a base system the S8 where a stands for  

play06:25

atomicity C stands for consistency I stand for  isolation and d stand for durability is a set  

play06:33

of properties of a database transaction intend  to guarantee data validity despite errors power  

play06:40

failures and other mishaps in the context of a  database the sequence of operations that satisfies  

play06:47

the asset properties are called transactions for  example transfer of funds from One bank account to  

play06:54

another bank account this would involve multiple  sequence operations for example debating money  

play07:00

from one account and crediting money to another  account and all of these operations need to be  

play07:07

done within a single transaction to make sure that  your system State remains valid on the other hand  

play07:13

the base systems are those systems which do not  meet the acid properties it does not mean the base  

play07:20

systems lack all the different asset properties  there could be base systems that fulfill some  

play07:26

criterias of acid properties for example a  database will be considered base if it fulfills  

play07:31

properties like atomicity and isolation similarly  there could be a base database system that fulfill  

play07:38

durability and isolation but not consistency  however one thing is guaranteed for base systems  

play07:45

that they always prefer availability over strong  consistency in a cap theorem the base systems are  

play07:51

always eventually consistent as the name suggests  so now the choice of database depends on whether  

play07:57

the system require an asset compliant database or  not ideally every system wants asset properties  

play08:05

however the asset property is though very  desirable are not always possible to achieve due  

play08:11

to other non-functional requirements imposed by  the system which affect the choice of a database  

play08:17

we just discussed what are different criterias to  select a database now we will go through different  

play08:21

common types of databases that are available right  now and we will see what are their pros and cons  

play08:26

before going through an example in which we  will see how we will choose a database in a  

play08:30

particular example or scenario the very first  type of databases which are available since more  

play08:36

than five decades are relational databases  some examples of relational databases are  

play08:41

Oracle database MySQL Microsoft SQL Server Etc  relational databases are based on relational  

play08:49

model which is very intuitive and straightforward  way of representing data in the form of tables  

play08:55

in relational databases each row in a table is  a record with a unique ID called the primary  

play09:02

key The Columns of the table holds different  attributes of the data and each record usually  

play09:07

has a value for each attribute making it easy to  establish relationships between different data  

play09:13

points relational database allow flexible data  modeling the relational databases even allow a  

play09:18

modeling of key value data and graph data which  we will discuss later in our other different  

play09:22

types of databases however it does not mean that  a Json database can be used in all the scenarios  

play09:28

it depends on several factors to decide whether a  relational database is a good fit for a particular  

play09:33

scenario or not relational databases are generally  vertically scaled using large machines this is one  

play09:39

of the limitation of the relational databases  because sometimes the data in the system could  

play09:44

be so huge that it cannot be stored in a single  machine there are other scenarios as well where  

play09:49

even if the data can be stored in a single machine  the number of read and write queries that comes  

play09:54

to a database are so large that they cannot be  handled by a single machine in that case the  

play10:00

only option that we have is to Shard a relational  database but usually relational databases does not  

play10:05

support implicit sharding the sharding needs to be  done manually in that case and also usually when  

play10:12

we Shard relational database then we store the  data in the sharded relation database in a shared  

play10:18

nothing architecture which means that all the data  for a particular tenant we store in a single shot  

play10:23

we does not distribute the data for a single  tenant in all the different shards that we  

play10:28

have created for our relational database  another limitation of relation database is  

play10:33

that they have usually a rigid schema changing  the schema of oscillational data in a relational  

play10:38

database require changing the table construct  in the relation database thus changing this K  

play10:44

minus additional database is usually a painful  time consuming and thought provoking process  

play10:50

that is why in such cases we have to look at  other different database types such as nosql  

play10:55

databases that we are going to discuss now  before we go into discussion about different  

play11:00

types of nosql databases I have a question  for you guys let me know in the comment below  

play11:05

whether the evolution databases can be distributed  or not and if yes how and if not why not now  

play11:13

another type of a very common data store that  is used in different Services is a key value  

play11:18

data store a key value data store is the simplest  form of a nosql database it is similar to a hash  

play11:26

table data structure and it is primarily used  in scenarios where all access to a database  

play11:31

is through a primary key the key value to test  scores are a possible Choice when all the access  

play11:37

your data is through a primary key the relational  database constructs like joins or where or order  

play11:43

by are not required as they are not supported by  the key value data store when the scalability is  

play11:49

an important requirement of the system and since  key will address store allows easier scaling and  

play11:54

higher throughput the key value data stores can  easily be shared using consistent hashing we  

play12:00

also use key value data store in scenarios where  database schema is flexible and whatever we store  

play12:05

in the blob value is OPEC to the database and  only understood by the application it doesn't  

play12:11

mean that if the key value data store is a nosql  data store it does not support asset requirements  

play12:18

there are different keyword just so that offer  asset requirements but usually most of the time  

play12:25

they offer those asset requirements only for a  single key operation that is Operation which only  

play12:30

touches a single key also we need to understand  that the consistency in a key value does store is  

play12:35

a bit different than the consistency in ovulation  database and the consistency in a key value data  

play12:40

store is associated with the replication of a key  value data to all the nodes or replicas in a shard  

play12:47

usually the key value data stores are good fit  for following scenarios the first scenario is  

play12:53

it's about storing session information generally  every web session is unique and is to assign  

play12:59

a unique session ID applications that store the  session information on the disk or in a relational  

play13:05

database will greatly benefit from moving to a key  value data store almost every user has a unique  

play13:11

user ID and then some attributes like username  and other information all this information can  

play13:18

be stored in a key value store in order to get  access to that information we only just need  

play13:22

a unique user ID and that's it another scenario  where key value stores are good fit is a shopping  

play13:29

cart scenario e-commerce websites usually have  a shopping cart tied to a specific user in that  

play13:35

case in order to clearly see if the shopping cart  associated with a user we can use key value data  

play13:40

store to store the shopping cart information and  we can access the shopping cart using a unique ID  

play13:46

of the shopping cart that could be associated with  the user ID as well at the same time we also need  

play13:52

to understand different scenarios via the key  values data stores are not a good fit such as  

play13:57

if the application need to have relationships  between different entities and objects in the  

play14:02

data then the key value data store is not a good  fit another scenario where the keyword data store  

play14:07

are not a good fit are the scenarios where we  have to query or scan a large amount of data  

play14:13

Key weather gestures are usually optimized to  queue the data based on a primary key and now if  

play14:18

we need to run a security in which we are scanning  a large amount of data keyword resistors on in  

play14:24

that case not a good fit for such scenarios also  the key elevator store does not support searching  

play14:30

based on the value that is if you want to search  some data based on particular value in a data we  

play14:37

cannot search using a key value data store some  examples of keyword data stores include radius  

play14:44

man cache Oracle coherence Amazon Dynamo DB  Etc now the second type of a nosql database  

play14:50

is a document database the basic unit of storage  in the document database is the document itself  

play14:55

these documents are usually self-describing  hierarchical three data structures usually  

play15:01

stored in the form of either XML Json or Json Etc  nowadays the most popular format used is Json the  

play15:09

documents is stored in a document database could  be similar to each other but they do not need to  

play15:14

have reflected the same format the document  database are like key value databases which  

play15:19

is store documents in the value part of the  key value data store however we can consider  

play15:24

document databases as a super set of key value  data stores where the value is not opaque to the  

play15:31

database and database can examine the value  content which is not the case in a key value  

play15:35

data store now the question is how the document  database is different from relational database  

play15:41

we need to understand that in a relational  database all the records need to have same schema  

play15:46

and they need to have same set of attributes if  you are storing them in a single table while in  

play15:51

case of a document database different documents  can have different schema also as compared to the  

play15:57

key value test show there is one difference which  is basically in case of a document data store we  

play16:02

can create second indexes on different attributes  of the document which is not possible in case of  

play16:08

a key value data store the document databases  are good fit for the following scenarios since  

play16:14

document databases have no predefined schema and  usually understand Json documents they work well  

play16:19

in content Management systems or applications  for publishing websites managing user comments  

play16:25

user registration profiles and other web facing  documents e-commerce applications often need  

play16:31

to have flexible schema for products and orders  as well as ability to resolve their data models  

play16:36

without expensive database refracting or database  migration document database are suitable for such  

play16:42

scenarios the examples of document databases  include mongodb couchdb and Azure Cosmos DB  

play16:49

by the way I have added detailed chapter in my  course the distributed systems I interview Bible  

play16:55

that you can find the link here in which I have  discussed all the different types of databases  

play17:00

via pros and cons and different scenarios  where we can use them for more information  

play17:05

nowadays another type of databases which we  use in some specific scenarios which we call  

play17:11

column family databases these databases are  sometimes also known as white column databases  

play17:17

or a big table databases these usually have  those with varying number of columns where  

play17:24

each column is a name value appear the column  family databases allows access to data based  

play17:31

on columns rather than buy those and even  sometimes they store the data by columns  

play17:36

there is a class of databases which we call column  now databases or column oriented databases this is  

play17:42

a subset of column family databases in which we  store all the data in column wise as compared to  

play17:47

row wise in case of other types of data stores  when a column family data store is also column  

play17:53

oriented data store then the benefits include more  efficient access to data when we are queuing only  

play17:59

one or few columns that is a subset of columns and  there's also more options for data compression in  

play18:05

case of a column oriented databases however the  column oriented databases are less efficient for  

play18:12

inserting new records the main difference between  a column family database and a document database  

play18:18

is in the query model they can both store similar  data structures that is you can actually install a  

play18:24

Json in a column family database as well but the  document stores usually give you query by value  

play18:29

capability whereas column family databases  usually do not another difference between a  

play18:36

document database and a column family database is  that in the document database you usually insert  

play18:41

or update the whole document which is typically  a Json whereas in column family databases like  

play18:47

Cassandra you can address individual columns  and insert and update them individually  

play18:52

so the columnar databases are good fit for  scenarios where you only need to query a subset  

play18:58

of the columns or very few columns also we use  columnar databases in scenarios where you need  

play19:04

to run aggregation queries against a large amount  of similar data on the other hand the columnar  

play19:10

databases are not suitable for scenarios such  as online transaction processing scenarios or in  

play19:17

scenarios where you have to run queries against a  small set of those some examples of column family  

play19:22

stores are Apache Edge base Cassandra and Google's  big table here I have two questions for you  

play19:30

the first question is that usually column oriented  databases are more suitable for analytic scenarios  

play19:37

why is this the case and the second question is  the column oriented databases usually provide  

play19:43

better compression why is this the case let me  know in the comments below what you guys think  

play19:49

another type of database which is getting popular  day by day is graph database the graph databases  

play19:57

represent data as graph which is nothing but  a network of nodes and edges between them  

play20:03

the edges between the nodes actually represent  relationships between those nodes in a graph  

play20:09

database the storage is optimized for storing the  graph and also is optimized for graph diversal we  

play20:17

can store a graph like data structure in a  relation database as well and also in some  

play20:23

nosql database like documentdb Etc but those  databases are not optimized for learning the  

play20:30

queries that require diversing the graph and  so the graph databases are more suitable for  

play20:36

such scenarios where we have qds which require  traversing the graph as we will see later now  

play20:42

let's discuss once again let me know in the  comments below what do you think whether we  

play20:46

can use graph database in such scenario or not  so in my one of my previous video in Twitter  

play20:50

system design which you can find the link here in  one of the max surveys we defined a user relation  

play20:57

table since in our design we are only required  to query a single hop relationship that is find  

play21:03

all the followers of a particular user and  also the relationship is of a single type  

play21:09

such data could be stored in a relational database  another option is to store this data in a nosql  

play21:16

key value database which provides a non-unique  secondary index on the following user ID  

play21:23

if storing such data in a graph database all  the possible will be an Overkill usage of  

play21:29

graph database in this case the only reason for  storing such information in the graph database  

play21:36

is that either we plan to introduce more  relationship types or we plan to introduce  

play21:41

multi-hop qds like find all the followers  of all the followers of a particular user  

play21:47

or find all the followers of our user a who are  also the followers of user B in the future so  

play21:55

that's why you need to understand that even though  you might be working with some data which require  

play22:00

graft reversing or I would say some sort of graph  traversing but using a graph database for such a  

play22:06

may not be that useful if the relationship between  all the different entities is just a single type  

play22:12

of relationship and the queries that you are  running only require a single hop relationship  

play22:17

the graph databases are usually used in following  scenarios the first scenario where the gravity  

play22:24

bases are being used a lot is flawed detection  in payment systems such as stripe with a graph  

play22:30

database you can process purchase and financial  transactions in almost real time which means  

play22:37

you can prevent fraud for example with a graph  database you can easily tag whether a particular  

play22:42

email address and credit card are related to other  Fallen charges or not you can also differentiate  

play22:49

accounts via a single email address is associated  with multiple people you can also find scenario we  

play22:55

have various people are associated with a single  IP address even though they have multiple physical  

play22:59

addresses in different accounts another usage  of graph database is in recommendation systems  

play23:04

for example in Facebook the graph database can be  used to store information about various users and  

play23:11

their friends and their interest in a social graph  this graph can be traversed to suggest friends and  

play23:17

other services to a user the other social network  and machine learning and AI scenarios where the  

play23:23

graph databases can be used which I have covered  in more detail in the course one thing to remember  

play23:28

is that the glav databases are not suitable  for scenarios when the queries don't include a  

play23:35

specific starting point for example if you need  to run frequent table scans then grav databases  

play23:40

are not suitable for such scenarios also when  you need to look up a single data just by a key  

play23:45

in those cases graph databases are not suitable  and you may be well off with using a key evaluator  

play23:52

store in that case also the graph databases are  not suitable for scenarios where you have to store  

play23:57

a large chunk of information and such as blobs  and media files Etc with every node let me know  

play24:05

in the comments below why is this the case why the  graph database is not suitable for such scenarios  

play24:11

then there is another class of databases which we  called time series databases these databases are  

play24:18

optimized for storing and serving time series  data a Time series data is some data which is  

play24:24

associated with a timestamp the time series data  is often continuous flow of data like measurement  

play24:29

form sensors in iot or intraday stock prices  a Time series database LED you store a large  

play24:37

amount of timestamped data in a format that  allows fast insertion and fast retrieval to  

play24:44

support complex analysis on time series data here  we need to understand that relational databases  

play24:50

can also be used to restore time series data but  decision database are not optimized for common  

play24:56

time series workloads we have discussed in Greater  detail how a Time series database store time sees  

play25:02

data in a format which enables quicker table of  timestamp data as compared to other traditional  

play25:07

databases in our course then lastly is a new  class of database basis which we call new SQL  

play25:13

or distributed SQL databases you can consider  new SQL databases as relational databases with  

play25:19

implicit support for sharding and scalability and  these databases are more suitable for financial  

play25:24

systems and other systems like for example billing  service in Amazon or Apple App Store where asset  

play25:32

requirement is a must but also the scale of data  is so big to be stored in a single machine all the  

play25:38

throughput required is so big to be handled by  a single machine now let's consider a scenario  

play25:44

you must have seen my previous video on designing  tiny URL service which you can find the link here  

play25:50

in that service in one of the micro service we  come up with this schema where we want to store  

play25:57

a URL mapping between the short URL and a long  URL now the question is what type of database  

play26:03

do we need to store this information so the very  first thing is that we will always start with a  

play26:09

relational database and check whether we can store  this information in a relational database or not  

play26:13

and in that case we have to first decide whether  the data is structured data is semi-structured or  

play26:21

is totally unstructured data and it's totally  possible we could say that this could be a  

play26:27

semi-structured data what we could also do is that  we can store all this information in a single blob  

play26:34

field or glob field in a relation database so our  relational database could be of the format where  

play26:40

it has only two columns one is the ID column and  the other is a blob or glob field where we store  

play26:48

the rest of the information and that could be  a unstructured information form the perspective  

play26:54

of a database and we use deletion database for  this purpose however now we have to see how much  

play26:59

data are we going to store let's say in next six  months or a year or two years which means that  

play27:05

we need to figure out what would be the growth  of data that we are expecting in the system and  

play27:10

then the second thing is what would be the query  load what be the rate of flights what would be  

play27:16

the rate of reads that would be happening and  can we use a solution database in that case or  

play27:22

not and of course since it's a tiny URL service  we know that it's a read heavy system the number  

play27:29

of lights will be way less as compared to number  of leads in the system we will only perform the  

play27:36

rights when we want to actually insert a new short  URL to Big URL mapping in this table for example  

play27:42

we could also maybe think about other things like  if we can still use resolution database so in that  

play27:48

case how to actually support scalability we could  do some other things for example we can still use  

play27:54

ovulation database in our service but we can  actually put a cache in front of our relation  

play28:00

database to support heavy read so now it all  depends on the types of requirements what are the  

play28:07

different requirements we have in our system that  can be fulfilled by a particular type of database  

play28:13

or not now let's just take an example that we  also want to introduce something like a count here  

play28:19

how do you say integer and let's suppose this  is the count of z-axis to this database and  

play28:27

suppose this is now a requirement which doesn't  even make sense basically but let's suppose you  

play28:31

want to also see how many times a particular  short URL for successed so every time now the  

play28:37

read happen and someone tries to get the big  URL using a short URL you also incrementing  

play28:43

this count initially the system was a seat heavy  system but now since with every read you are also  

play28:50

updating discount so now this system becomes a  wide heavy system basically and now you have to  

play28:56

think what would be the best database in that case  whether it's a relational database or it is a key  

play29:03

value data store that you would use or you will  use a document data store or a column valuator  

play29:09

store if you use a little database here or even  if a document database whenever we update this  

play29:16

count what we will be doing we will be updating  the whole record itself in that case it would be  

play29:22

way better if we use a column oriented database  because in a column oriented databases you see  

play29:29

access and store based on columns so in that case  whenever a URL get that you can just go and update  

play29:35

the count of it without touching the rest of the  data now let's take another example suppose we  

play29:41

have a graph of different objects and entities and  their relationships the objects or the nodes are  

play29:48

different types of nodes we'd have we have people  in the nodes we also have like things like banana  

play29:53

and we may also have cities like Berlin similarly  that the types of relationships are also different  

play29:58

we have translation shape we have a follow-up  relationship we have life relationship we have  

play30:04

visit relationship Etc in this scenario where you  know we have we have different type of entities  

play30:09

and they have different types of relationships  between them the graph databases are the most  

play30:14

suitable choices of course you can store this  information in a relation database as well as  

play30:19

the document data store as well where you can  actually install this follow likes visits all  

play30:26

different types of relationships in a relationship  column and maybe it might work with some simple  

play30:31

queries like if you want to say give me all the  users which Jason likes for example which is a  

play30:37

single hobby it may work in a in a database if you  want to run some complex ways like give me all the  

play30:44

followers of Json friends then graph databases are  best fit in order to run such type of queries so  

play30:51

let's summarize again what we have discussed in  this video so the first thing we discuss is that  

play30:55

we can't just come up with a simplified decision  tree based on whether we need a structured data  

play31:02

or semi structure or structured whether we need  acid on acid Properties or not and then based on  

play31:07

that we can just say we can just use relational  or a particular type of nosql database usually  

play31:13

in that case we have to construct a complex  Matrix actually or a table to determine which  

play31:18

database is more suitable for our scenarios people  usually think that we can't use relation databases  

play31:23

for unstructured data but that's not the case as  long as you can actually access your unrestricted  

play31:30

data using an ID you can define a table with  just an ID column and a blob or Club column  

play31:36

and you're going to store all your unstructured  data in that clock column in a position database  

play31:41

so this is a long assumption that you can't  use relational databases for unstructured data  

play31:47

you can but basically you have to look at other  things so Vision databases are I would say are  

play31:54

always the preferable choice you should always  start with a digital database because for example  

play31:58

if you are working on a product catalog service in  Amazon for example where you keep tag of all the  

play32:04

different products now different Porters will  have different attributes or baby infant milk  

play32:08

will definitely have different attributes than a  TV or some other electronic devices so now if a  

play32:14

data is semi-structured what is the best place  to store it whether it's relation database or  

play32:18

not usually here one other thing is that if the  data is unstructured data like a media file a  

play32:25

video file then usually we store them in object  storage service and we just store the URI for  

play32:32

those objects in our databases we usually don't  install big you know blobs in our database because  

play32:39

every time you update an attribute you will also  be writing that Big Blob of data as well and which  

play32:44

is usually banned for the database performance we  see the relation databases you can just short them  

play32:50

as well but then you have to share the relation  database manually and usually in that case we  

play32:55

use a shared nothing architecture to Shard those  solution databases where we put all the tenant  

play33:00

information in a single shot then we have a key  value data stores keyword response are like simple  

play33:06

hash table data structures where you have a key  and then a value value is a blob in a key value  

play33:11

data store the value is opaque to the database the  database does not look into the the values and and  

play33:18

so in that case if you need to also run queries  in which you have to look into the OPAC value in  

play33:23

a keyword test tool then the obvious choice is a  document data store because a document data store  

play33:29

is is like a superset of a key value data Store  where the database also can examine the internal  

play33:37

data of a value itself and so you can create  secondary indexes on those attributes in the value  

play33:44

part in the documentdb then we also discussed  there are certain scenarios where we only have  

play33:49

to access maybe for read or write only a small  subset of columns and for those scenarios column  

play33:54

Orient databases are a good fit basically because  you don't have to read or write the whole record  

play34:00

or the whole row and you can only access and only  update or write the columns you are interested in  

play34:06

then we see the AV also have a graph  database and there are scenarios via  

play34:11

the glove databases are the best choices  where these are the cases where you know  

play34:16

you are working with data which have multiple  different types of relationships among them  

play34:21

and also you want to run queue this which  require multiple hops not a single hop but  

play34:26

multiple Hops and so in that case the graph  databases are more suitable choices to be used  

play34:31

if you want to know more about these things please  look into my course there's a chapter where you  

play34:36

will find more details about it in which I have  discussed in detail how to select a database in  

play34:41

the end again thanks a lot if you are still here  with me and I will really appreciate if you find  

play34:45

this video useful then please do like this video  and also do subscribe to this channel if you  

play34:51

haven't subscribed yet I will be adding more  videos in future soon thank you and take care

Rate This

5.0 / 5 (0 votes)

Related Tags
Database SelectionDistributed SystemsSoftware ArchitectureRelational DatabasesNoSQL DatabasesData ModelingScalabilityACID PropertiesCAP TheoremSystem Design