Types of Databases: Relational vs. Columnar vs. Document vs. Graph vs. Vector vs. Key-value & more
Summary
TLDRThis script offers an insightful overview of various database types, their use cases, and examples of both open-source and cloud-managed options. It explains relational databases for structured data, columnar databases for big data analytics, document databases for flexible schema, graph databases for complex relationships, time-series databases for chronological data, vector databases for AI and ML, and key-value stores for rapid data retrieval. The video script emphasizes the importance of selecting the right database for optimal performance and cost-effectiveness.
Takeaways
- 🗃️ Relational databases organize data in tables with rows and columns, using primary and foreign keys to link data across tables.
- 🔍 SQL is the query language used for relational databases, allowing for complex joins and data retrieval.
- 🔄 ACID transactions in relational databases ensure that all changes are treated as a single operation, maintaining data integrity.
- 📊 Columnar databases are optimized for big data scenarios, reading data by column rather than row to improve query performance.
- 📚 Document databases, like MongoDB, store data in JSON-like documents, offering flexible schemas and ease of development.
- 📈 Graph databases excel at representing and querying complex relationships between data points, using nodes and edges.
- 📈 Vector databases store data as high-dimensional vectors, enabling similarity searches in AI and ML applications.
- 🔑 Key-value databases, such as etcd, store data in key-value pairs, providing fast data access and horizontal scalability.
- ⏱ Time series databases are specialized for handling time-stamped data, with efficient storage and retrieval for analytics.
- 🛠️ Different database types serve different purposes, and choosing the right one can significantly impact application performance and cost.
- 🔬 The script provides a comprehensive overview of various database types, their use cases, and examples of open-source and cloud-managed databases.
Q & A
What is a relational database and how was it originally developed?
-A relational database is a collection of spreadsheet files that help businesses organize, manage, and relate data. It was originally developed by IBM in the 1970s.
How are data organized in a relational database model?
-In the relational database model, data is organized into tables that store information as columns (attributes) and rows (records or tuples). Columns specify data types, and each record contains values corresponding to those data types.
What is a primary key in a relational database?
-A primary key in a relational database is a unique identifier for each row, ensuring that each record can be distinctly recognized within the table.
Can you explain the concept of 'foreign key' in relational databases?
-A foreign key in a relational database refers to the primary key in another table, allowing rows from different tables to be linked together.
What is an ACID transaction in the context of relational databases?
-ACID transactions in relational databases mean that all changes to data are performed as if they are a single operation. If at least one task fails, the whole transaction is rolled back to maintain data integrity.
What are some examples of relational databases?
-Examples of relational databases include MySQL, PostgreSQL, MariaDB, Microsoft SQL Server, and Oracle Database.
Why were columnar databases developed and how do they differ from traditional databases?
-Columnar databases were developed to improve query performance with big data. They differ from traditional databases by reading data from top to bottom and only reading the columns that are needed for a query, which can be more efficient with large datasets.
What is a document database and how does it differ from a relational database?
-A document database is used to store and query data in JSON-like documents. It differs from a relational database by allowing for flexible, semi-structured, and hierarchical data storage without a predefined schema.
How does a graph database represent data and relationships?
-A graph database represents data as nodes, which are records, and relationships as edges connecting these nodes. Relationships can have a direction and properties associated with them, allowing for complex queries about connections between data points.
What is a vector database and how does it enable unique search experiences?
-A vector database stores data as high-dimensional vectors, where each vector represents an asset with a certain number of dimensions. It enables unique search experiences by finding similar assets through vector search methods, such as image or document similarity searches.
What are some advantages of key-value databases?
-Key-value databases offer advantages such as scalability, as they can distribute data across servers; ease of use, as they follow the object-oriented paradigm; and performance, as they do not require resource-intensive table joins like relational databases.
What is a time series database and what type of data is it optimized for?
-A time series database is optimized for time-stamped or time series data, which includes measurements or events tracked and monitored over time, such as server metrics, sensor data, or financial trades.
Outlines
🗃️ Database Selection and Relational Databases
This paragraph introduces the importance of selecting the right database for an application's performance and outlines the use cases for different database types. It focuses on relational databases, which are structured collections of data in table format, with each table having a primary key and the ability to link tables through foreign keys. The paragraph explains how relational databases use SQL for querying and support ACID transactions for reliable data operations. Examples of popular relational databases like MySQL, PostgreSQL, and Oracle are provided, along with cloud-managed options such as Amazon Aurora.
📊 Columnar Databases for Big Data Performance
The second paragraph delves into columnar databases, which are optimized for handling large datasets by storing data column-wise, allowing for faster querying of specific columns. It contrasts the traditional row-wise data retrieval with the columnar approach, demonstrating how the latter can significantly speed up data retrieval for large datasets. The paragraph also touches on the use of algorithms in columnar databases to simplify data retrieval and mentions Apache Cassandra as a widely adopted open-source columnar database.
📄 Document Databases for Flexible Data Storage
This section discusses document databases, which are ideal for storing and managing semi-structured data with varying attributes, such as e-commerce products. Document databases allow for the storage of each product with its attributes in a single document, enabling easy updates and flexible schema design. The paragraph explains the JSON-like structure of document databases and their operations, such as create, read, update, and delete. It highlights the ease of development with document databases, their flexible schema, and performance at scale, mentioning MongoDB as a popular open-source document database.
🔍 Exploring Graph and Vector Databases for Complex Relationships and AI
The fourth paragraph introduces graph databases, which excel at representing and querying complex relationships between data points, such as social connections or academic lineage. It contrasts graph databases with relational databases, emphasizing the former's ability to handle complex queries involving distant relationships. The paragraph also introduces vector databases, which store data as high-dimensional vectors for applications like image or document similarity searches. It mentions the use of vector search methods in AI and machine learning scenarios and suggests ArangoDB as an open-source graph database and milvus for vector databases.
🔑 Key-Value Databases for Fast Data Access
This section covers key-value databases, which store data as pairs where keys are unique identifiers and values can be of any type. It highlights the scalability, ease of use, and performance advantages of key-value databases, which do not require complex joins like relational databases. The paragraph discusses use cases such as session management, shopping cart storage, metadata storage, and caching, and recommends etcd as a reliable key-value database used by Kubernetes for cluster data storage.
🕒 Time Series Databases for Sequential Data Tracking
The final paragraph focuses on time series databases, designed for handling time-stamped data such as server metrics, sensor readings, and market trades. It explains how these databases optimize data retrieval and provide a cost-effective storage solution for sequential data. The paragraph mentions InfluxDB as a general-purpose metrics database and Prometheus for collecting time series data from infrastructure, emphasizing their importance in analytics and monitoring.
Mindmap
Keywords
💡Relational Database
💡Primary Key
💡Foreign Key
💡SQL
💡ACID Transactions
💡Columnar Database
💡Document Database
💡JSON
💡Graph Database
💡Key-Value Database
💡Time Series Database
Highlights
Performance of applications often depends on the selected database, emphasizing the importance of choosing the right database for specific tasks.
Relational databases, developed by IBM in the 1970s, are structured as tables with columns and rows, and use primary and foreign keys for data linkage.
SQL is used for querying relational databases, and ACID transactions ensure data integrity by rolling back changes if any part of a transaction fails.
Examples of relational databases include MySQL, PostgreSQL, and Oracle Database, with cloud-managed options like Amazon Aurora and Azure SQL.
Columnar databases improve query performance with big data by reading data column-wise rather than row-wise, reducing the amount of data processed.
Apache Cassandra is a popular open-source columnar database used for highly aggregated analysis in large enterprises.
Document databases, such as MongoDB, are ideal for e-commerce applications with varying product attributes, offering flexibility and ease of development.
JSON documents in document databases allow for flexible, semi-structured data storage with key-value pairs, arrays, and nested objects.
Graph databases excel at querying complex relationships and are useful for data science communities to test inferences, with ArangoDB as a suggested open-source option.
Vector databases store data as high-dimensional vectors, enabling the search for similar assets based on encoded data, with Milvus as an example of an open-source vector database.
Key-value databases offer scalability, ease of use, and high performance, with etcd being a reliable option used by Kubernetes for storing cluster data.
Time series databases are optimized for time-stamped data, providing fast retrieval and cost-effective storage, with InfluxDB and Prometheus as notable examples.
Each database type has specific use cases and advantages, from relational databases for structured data to key-value databases for fast data access.
The choice of database can significantly impact application performance, with the wrong choice potentially leading to slow queries and high costs.
The video provides an educational resource for learning about different database types and their practical applications in various scenarios.
Understanding the unique features and use cases of each database type is crucial for developers to make informed decisions for their applications.
The video concludes with a call to action for viewers to explore further resources on relational databases and SQL joins for a deeper understanding.
Transcripts
Performance of your application frequently depends
on the database you select for a specific task. As a developer,
you may need to choose the appropriate database based on your use case and requirements.
In this video, we'll go through the most common database types used in
enterprises. I'll provide use cases when to choose a specific database,
as well as examples of open-source and cloud-managed databases.
Let's start with the relational database. Originally, it was developed by IBM in the
1970s. You can think of a relational database as a collection of spreadsheet files that help
businesses organize, manage, and relate data. In the relational database model, each 'spreadsheet'
is a table that stores information, represented as columns (attributes) and rows (records or
tuples). Now, columns specify data types, and each record (or row) contains values corresponding to
those specific data types. For example, in the customer table, you may have an integer type
for the ID column and varchar for the name and address. Also, in a relational database,
each table has a unique identifier for each row called the "primary key." Rows from different
tables can be linked using a "foreign key," which refers to the primary key in another table. Let’s
take a look at how the relational database model works in practice. Let's say we have
a website and track every single action that a user may do. A user can click on a button,
add an item to the shopping cart, and things like that. Now, in a very simplified version, we would
have a customer table and an event table. Each time a customer does something on the website,
we would store it in the event table and use a foreign key to link this event to the customer.
Also, we can use SQL to query our database. For example, you can use the following query
to get all the customers you have. Now, if you want to get all customers who are active,
meaning they have performed at least one action, you can use the following join. I have another
video explaining all the different joins as well. Another important feature provided by
most relational databases is ACID transactions. This means all changes to data are performed as
if they are a single operation. If at least one task fails, the whole transaction is rolled back.
For example, consider a transfer of funds from one bank account to another. First, you need to check
if there is enough money to transfer, then you would debit one account and credit another. So,
if at least one step fails, you want to roll back the entire transaction. Now, you would
use a relational database for structured data and ACID compliance. Structured data means that
you have rows and columns that clearly define data attributes. Some examples of relational databases
are MySQL, PostgreSQL, MariaDB, Microsoft SQL Server, and Oracle Database. There are also
some cloud-managed databases available, such as Amazon Aurora, Azure SQL, and many others.
Now, traditional databases do not scale very well. If you have a lot of data, like billions
of records, traditional databases can be slow. To improve query performance with big data,
columnar databases have been developed. Let’s go through the simple exercise, which will help
you see the power of using a columnar database for certain tasks. To illustrate this exercise,
I’m going to use a spreadsheet populated with some customer data. First of all,
let's see how a traditional database works. Before we start, there is a rule that we need to follow:
we have to read the data from left to right, starting from the beginning of each row,
and proceed from row to row, kind of like reading a book. So, let’s start. Let's ask the traditional
database to return all paying customers. Here it goes: as you read the data, notice that you're
retrieving the Name and Tier on each pass of every row, and eventually, you'll acquire all the Names
and their associated Tiers. Now, servers do this very quickly; however, what if my request involves
several billion rows? You can start to see that reading each row just to grab the Name and Tier
columns could take a while. And this is where a columnar database comes in handy. We’re going
to redo the exercise one more time, but instead of using traditional database rules, I’m going
to give you columnar database rules. This means you’re going to read the data from top to bottom,
and you’re only going to read the columns that I ask you to. Now, let's ask the columnar database
the same question: to return all paying customers. Let’s start. Notice that this method skips all
the data that isn’t related to what we’re looking for. Once we’ve finished getting that information,
we start with the next column, skipping the columns that we don’t need. But wait a second,
how does the computer know which Tier to assign to each Customer name it’s retrieving?
The way it does this is by having the Columnar Database assign a number to each row of data,
allowing it to quickly pair up the different columns it retrieves. You can see how this really
comes in handy when you start reading the Name column. All it needs to know is the number where
the Name values ended, in this case, row 10. This means I just need the Tier rows from 1 to 10. Now,
you've probably noticed how the columnar database was repeating 'basic' over and
over. This is actually another advantage of a columnar database. By using the numbering system,
columnar databases can utilize algorithms to simplify data retrieval. And you’ll find
that each columnar database system has highly sophisticated methods for achieving further
performance enhancements. So, compared to traditional databases, a columnar database
does not actually read billions of rows of data. Instead, it just reads a few long columns. This
can be a very straightforward way of addressing the large amounts of data that corporations have
to deal with.. They're not replacements for traditional databases, but they are certainly
a powerful way of performing highly aggregated analysis. I think Apache Cassandra is one of the
most popular open-source columnar databases that is widely adopted by many Fortune 500 companies.
Now, both relational and columnar databases require you to define a schema before you can
store anything. If your requirements change over time, you may need to add additional columns or
perhaps change the data type of one of them. What if you have an e-commerce application
with thousands of products? Different products usually have a different number of attributes.
Managing thousands of attributes in relational databases is inefficient and can slow down your
database. This is an ideal use case for the document database. You can store a product
with all its attributes in a single document. It simplifies management and improves the reading
speed. And if you change the attribute of one product, it won’t affect others. So,
what is a document database? Well, it’s a type of database used to store and query data in
JSON-like documents. JSON is a data format that is both human and machine-readable. Developers
can use JSON documents in their code and save them directly into the document database. The flexible,
semi-structured, and hierarchical nature of documents and document databases allows them
to evolve with the needs of applications. Let's take a look at the structure of JSON
documents. JSON represents data in three ways. The first is key-value pairs. These
pairs are recorded within curly braces. The key is a string, and the value can be any data type,
such as an integer, decimal, or boolean. For example, a simple key-value pair is {"year":
2024}. Next is the array. An array is an ordered collection of values defined within left ([) and
right (]) brackets. Items in the array are comma-separated. For example, {"fruit": ["apple",
“grapes”]}. And finally, objects. An object is a collection of key-value pairs. Essentially,
JSON documents allow developers to embed objects and create nested pairs. For example, {"address":
{"country": "USA", "state": “Texas”}}. Now, here is an example of a JSON-like document that
describes a film dataset. You can see that the JSON document holds simple values, arrays, and
objects quite flexibly. You can even have an array with JSON objects within it. Document-oriented
databases allow you to create an unlimited-level hierarchy of embedded JSON objects. It's entirely
up to you what schema you want to give to your document store. Now, let's talk about
the operations that you can perform on a document database. You can create, read, update, and delete
entire documents stored in the database. Document databases provide a query language or an API that
allows developers to run these operations. Okay, first, you can create documents in the
database. Each document has a unique identifier that serves as a key. Then, you can use the API
or query language to read document data. You can run queries using field values or keys. You can
also add indexes to the database to increase read performance. And finally, you can update existing
documents flexibly. You can either rewrite the entire document or update individual values. So,
what are the advantages of document databases? The first thing that comes to mind is the ease
of development. JSON documents map to objects—a common data type in most programming languages.
When building applications, developers can flexibly create and update documents directly
from the code. This means they spend less time creating data models beforehand. As a result,
application development becomes more rapid and efficient. Next, there's flexible schema. A
document-oriented database allows you to create multiple documents with different
fields within the same collection. This can be handy when storing unstructured data, like emails
or social media posts. However, some document databases offer schema validation, allowing you
to impose some restrictions on the structure. Another advantage is performance at scale.
Document databases offer built-in distribution capabilities. You can scale them horizontally
across multiple servers without impacting performance, which is also cost-efficient. Also,
document databases provide fault tolerance and high availability through built-in replication.
If you want to try it out, MongoDB is a great option. It's the most popular open-source document
database available, easy to start with, has SDKs for most languages, and has large community.
Have you ever seen a detective board in the movies, with pictures, news articles, and notes
connected by thumbtacks and yarn? Immediately, you can see the power of connecting the dots
in all of those relationships. Imagine taking that detective board and applying a mathematical engine
that could query its data relationships. Well, that is a graph database. I want to explain graph
databases by starting with relational databases. One of the main traits of a relational database
is the constraining nature of its relationships, which makes it ideal for processing transactions.
However, these strict constraints often make it too complex to answer questions about distant
relationships. Imagine you had all the university professor and student data ever collected. Then
imagine you wanted to know the relationship between a group of 10 students originating from
completely different universities. Well, at first glance, you would think that since the students
didn't go to the same universities, they really don't have a connection. But if we look at their
professors, we can discover that they all shared a common professor when they were students. So now
let's zoom in and describe some of the traits of a graph database. First, you have nodes,
which are essentially records. Connected to these nodes is a type of relationship, which can have a
direction and a property associated with it. So in our case, the direction points from the original
professor. The relationship type is 'student_of.' And the property is the year and semester where
they were taught. Now, querying this database isn't like your typical SQL query. Graph database
vendors often have their own query language, so this is something that the industry is still
working out. You need to be very careful with graph databases because they can infer connections
that don't actually mean anything. For example, imagine the inference you could make if all the
students in our previous example ended up dropping out of school. Does that mean the original
professor had some kind of meaningful impact on that bad outcome? Well, anything is possible,
but we have to be a little more skeptical of such conspiratorial patterns. So, graph
databases are usually a mechanism for starting questions but not necessarily answering them.
So, what we see is data science communities using graph databases to test inferences. The
discovery of these relationships, and their relevance to the organization,
often is what gets bubbled up into data warehouses. Thus closing the loop.
One open source graph database that I can suggest is ArangoDB.
Now, information comes in many forms. Some information is unstructured, like text documents,
pictures, videos, and audio, while some is structured, like application logs, tables,
and graphs. On the other side, we have vector databases that store data as high-dimensional
vectors. Each vector has a certain number of dimensions, ranging from tens to thousands,
depending on the complexity of the data. Now, we can apply some kind of transformations to the raw
data. We can encode all types of data into vectors that capture the meaning and context of the asset.
This allows us to find similar assets by searching for neighboring data points. Vector search methods
enable unique experiences, such as taking a photograph with your smartphone and searching for
similar images. You can also find documents that are similar to a given document based on their
topic and sentiment. And find products that are similar to a given product based on their features
and ratings. It just a few examples. You can try milvus which is open source vector database.
Now, let's talk about a key-value database. It stores data as a collection of key-value pairs,
where a key serves as a unique identifier. Both keys and values can be anything,
ranging from simple objects to complex compound objects. The document database
is a special type of key-value store where keys can only be strings. Also,
when querying your document store, you can read the entire value or a part of the value,
especially if the value is another JSON object. For example, you can have {"book": {"id": 1,
"price": $10}}, then query book.price, and the database will return the value 10. Key-value
databases always return the whole value, including both ID and price information.
There are a few advantages of key-value databases. The first is scalability. Most key-value databases
can scale horizontally and automatically distribute data across servers to reduce
bottlenecks at a single server. Then there's ease of use. Key-value databases follow the
object-oriented paradigm, allowing developers to map real-world objects directly to software
objects. And Performance. Unlike relational databases, key-value databases don't have
to perform resource-intensive table joins, which makes them much faster.
Now let's talk about use cases.
First is session management. A session-oriented application, such as a web application, starts
a session when a user logs in and remains active until the user logs out or the session times out.
You can use it for a shopping cart. An e-commerce website may receive billions
of orders per second during the holiday shopping season, so you need a high-performance database.
You can also use it for a metadata storage engine. Your key-value
store can act as an underlying storage layer for higher levels of data access.
And caching: You can use a key-value database to store data temporarily for faster retrieval.
If you need a reliable key-value database, you can take a look at etcd. Kubernetes
uses etcd to store all its cluster data, such as pods, deployments, and services.
And finally, let's talk about time series databases. As the name suggests, it's optimized
for time-stamped or time series data. Time series data are simply measurements or events that are
tracked, monitored, downsampled, and aggregated over time. This could include server metrics,
application performance monitoring, network data, sensor data, events, clicks, trades in a market,
and many other types of analytics data. It's used over traditional database types
because it enables fast data retrieval and is a cost-effective storage solution for that type of
data. One example of general-purpose metrics database is InfluxDB. We also frequently use
Prometheus to collect time series data from infrastructure. Prometheus is also a time
series database with additional features to query targets such as VMs and Kubernetes pods.
Each database type has its specialty: Relational for structured data and ACID compliance,
Columnar for analytics, Document for unstructured data flexibility,
Graph for complex relationships, Time-Series for time-stamped data,
Vector for AI and ML scenarios, and Key-Value for simple, fast data access.
Using the right database type can be a game-changer for performance, and the wrong one
can slow down your queries and be very expensive. If you want to learn about relational databases
and SQL joins, you can watch this video. Thank you for watching, and I’ll see you in the next lesson.
Посмотреть больше похожих видео
Which Database Model to Choose?
Learn What is Database | Types of Database | DBMS
Relational vs. Non-Relational Databases
SQL vs NoSQL in 2024 Make the Right Choice (Difference Explained)
Types of Databases | Criteria to choose the best database in the System Design Interview
How do NoSQL databases work? Simply Explained!
5.0 / 5 (0 votes)