Choosing a Database for Systems Design: All you need to know in one video
Summary
TLDRThis video offers an in-depth exploration of databases, tailored for systems design interviews. The host begins with a review of database indices, comparing LSM trees and SS tables with B trees, highlighting their respective strengths in read and write operations. The discussion then delves into replication strategies, contrasting single leader, multi-leader, and leaderless models. Various databases are dissected, including SQL and NoSQL types like MongoDB, Cassandra, and Riak, each with their use cases and trade-offs. Technologies like Memcache, Redis, and graph databases like Neo4j are also touched upon, with special mentions of time series databases for data ingestion and VoltDB and Spanner for high-performance SQL needs. The video serves as a comprehensive guide for those preparing for technical interviews or looking to broaden their database knowledge.
Takeaways
- 👋 The video is a comprehensive guide on databases, a topic the creator has been promising to cover since their first episode.
- 👚 The creator shows support for women in tech by wearing an 'Women at Amazon' shirt, encouraging women to participate in the tech industry.
- 📈 The script includes an in-depth technical discussion about databases, aiming to go more in-depth than other videos on the platform.
- 🔍 The importance of database indices is highlighted, with a comparison between LSM trees and SS tables versus B trees, emphasizing their impact on read and write speeds.
- 📚 A review of database concepts is provided for viewers who may not be familiar with the topic, including explanations of database indices and their types.
- 🗣️ The video covers different types of replication strategies in databases, such as single leader, multi-leader, and leaderless replication, and their respective pros and cons.
- 📊 The script discusses various types of databases, including SQL, NoSQL, and specific examples like MongoDB, Cassandra, Riak, HBase, Redis, and Neo4j, detailing their use cases and features.
- 🛠️ The video provides insights into when to use SQL databases, suggesting they are best for scenarios where data correctness is more critical than speed.
- 📈 The advantages of NoSQL databases like MongoDB and Cassandra are explained, particularly in scenarios requiring high write throughput and flexibility in data modeling.
- 🕒 The video also touches on specialized databases like time series databases and graph databases, highlighting their unique use cases and benefits.
- 🏢 Honorable mentions are given to NewSQL databases like VoltDB and Spanner, which offer innovative approaches to traditional SQL databases with enhanced performance.
Q & A
What is the main focus of the video?
-The video focuses on explaining different types of databases, their features, and use cases, particularly in the context of system design interviews.
Why might the presenter be wearing a 'women at Amazon' shirt?
-The presenter is showing support for women in the tech industry, possibly in relation to diversity and inclusion initiatives or events.
What are the two predominant types of database indices discussed in the video?
-The two predominant types of database indices discussed are the LSM tree and SS table combination, and the B tree.
How do LSM trees and SS tables work together in a database?
-LSM trees are in-memory balanced binary search trees that store keys and their corresponding values. When the tree grows too large, it gets flushed to an SS table on disk, which is a sorted list of keys and values. Reads first check the in-memory LSM tree, then proceed through SS tables from most recent to least recent.
What is the advantage of using B trees for databases?
-B trees, which are implemented completely on disk, allow for faster reads because they enable direct access to the location of a key by following the tree structure on disk, without having to iterate through multiple SS table files.
What are the different types of replication strategies mentioned in the video?
-The video mentions single leader replication, multi-leader replication, and leaderless replication as the different types of replication strategies.
What is the trade-off between single leader and multi-leader replication?
-Single leader replication ensures no write conflicts but has lower write throughput because all writes go through one master node. Multi-leader replication can have higher write throughput but increases the likelihood of write conflicts.
Why are SQL databases recommended for use cases where correctness is more important than speed?
-SQL databases are recommended for correctness-focused scenarios due to their support for transactions, ACID properties, and the use of B trees, which ensure data integrity even if it slows down the database performance.
What is MongoDB's data model, and how does it differ from SQL databases?
-MongoDB uses a document-based data model, where data is stored in documents with potential nesting of documents, unlike SQL's relational and normalized data model that uses rows and joins to reference data across tables.
How does Cassandra handle write conflicts in a multi-leader or leaderless replication setup?
-Cassandra handles write conflicts using a 'last write wins' strategy, which may not be ideal as it can lead to some writes being overwritten based on timestamp.
What is special about the storage model of Apache HBase compared to other databases?
-Apache HBase uses a column-wise storage model instead of the traditional row-wise storage, which improves data locality and read performance when accessing entire columns of data.
Why are in-memory key-value stores like Memcached and Redis not considered the best database solutions?
-Memcached and Redis are not the best database solutions because they are in-memory key-value stores without persistent storage, making them more suitable for caching and frequently accessed data rather than for long-term data storage.
What is the primary use case for graph databases like Neo4j?
-Graph databases like Neo4j are primarily useful for modeling and traversing complex relationships and networks, such as social networks, recommendation systems, and geographical mapping.
What are the characteristics of time series databases that make them efficient for handling time-stamped data?
-Time series databases are efficient for time-stamped data due to their use of LSM trees and SS tables for fast ingestion, and their ability to split data into smaller indexes that can be quickly accessed and deleted as needed.
What are some of the 'NewSQL' databases mentioned in the video, and how do they differ from traditional SQL databases?
-NewSQL databases like VoltDB aim to provide the benefits of SQL databases with improved performance. VoltDB, for example, runs entirely in-memory and uses single-threaded execution to eliminate race conditions and achieve high performance, but at the cost of scalability and memory usage.
What is the unique feature of Google's Spanner database that helps achieve strong consistency?
-Google's Spanner uses GPS clocks in its data centers to assign accurate timestamps to each write operation. These timestamps allow the database to order all writes and achieve strong consistency through linearizability, without the need for extensive locking mechanisms.
Outlines
🚀 Introduction to Databases Video
The speaker welcomes viewers to a comprehensive databases video, acknowledging the long-awaited nature of the content. They express support for women in Amazon and hint at a more technical depth than usual for such videos on YouTube. The video is described as lengthy and detailed, covering the selection of databases for system design interviews, and the importance of understanding different database types and their use cases.
📚 Reviewing Database Fundamentals
This paragraph delves into the basics of database indexing, explaining the role of indexes in speeding up reads and the trade-off with write speeds. Two predominant types of indices are discussed: LSM trees and SS tables, and B trees. The speaker provides a brief overview of how LSM trees work in-memory and how SS tables function on disk, including the process of flushing, compacting, and searching. The advantages and disadvantages of LSM trees and B trees are compared in terms of read and write performance.
🔄 Exploring Database Replication
The speaker discusses the concept of database replication, which is crucial for data retention in case of hardware or software failures. Different types of replication strategies are outlined, including single leader, multi-leader, and leaderless replication. The pros and cons of single leader replication, which offers conflict-free writes but lower throughput, are contrasted with the unlimited write throughput but potential for write conflicts in multi-leader or leaderless replication.
🗃️ SQL and NoSQL Databases Overview
The paragraph provides an overview of SQL databases, emphasizing their relational and normalized data model, which can be challenging for write operations at scale due to the need for distributed transactions. The benefits of SQL databases, such as ACID guarantees and the use of B trees for efficient reading, are highlighted. The speaker then transitions into NoSQL databases, starting with MongoDB, a document database that offers flexibility and data locality but can face issues with data synchronization and conflicts.
🌐 Cassandra and Other NoSQL Databases
The speaker introduces Cassandra as a preferred NoSQL database for system design interviews due to its wide column store model, flexible data schema, and support for multi-leader and leaderless replication. Cassandra's use of LSM tree indices for faster writes and its approach to conflict resolution using last write wins are discussed. The paragraph also mentions Riak, another NoSQL database that uses CRDTs for more sophisticated conflict resolution. The speaker corrects a mistake regarding Riak's data store type and moves on to discuss Apache HBase, highlighting its column-wide storage and performance benefits for certain use cases.
🔑 In-Memory Databases and Graph Databases
This paragraph covers in-memory databases like Memcached and Redis, which offer fast access for frequently read and written data due to their key-value store nature and lack of need for indexing. The speaker then introduces Neo4j, a graph database, explaining its efficiency for graph traversal compared to SQL databases. The unique use case for graph databases in modeling relationships, such as social networks or map data, is highlighted.
🕒 Time Series Databases and Honorable Mentions
The speaker discusses time series databases, which are optimized for fast ingestion and ordered data retrieval based on timestamps. These databases use smaller LSM tree indexes to improve read efficiency and simplify the deletion of old data. The paragraph also includes honorable mentions of NewSQL databases like VoltDB and Google's Spanner, which offer innovative approaches to SQL database performance. Additionally, data warehouses are mentioned for their use in analytics.
Mindmap
Keywords
💡Database Index
💡LSM Tree
💡SS Table
💡B Tree
💡Replication
💡SQL Database
💡NoSQL Database
💡MongoDB
💡Cassandra
💡Riak
💡Apache HBase
💡Redis
💡Neo4j
💡Time Series Database
💡VoltDB
💡Spanner
Highlights
The video discusses the importance of choosing the right database for a given problem in a systems design interview.
It provides an in-depth technical review of database indices, including LSM trees, SS tables, and B trees.
The presenter shares insights on the trade-offs between write and read performance in different database types.
Replication strategies such as single leader, multi-leader, and leaderless replication are explained with their pros and cons.
SQL databases are highlighted for their strong consistency and transactional capabilities, best suited for data correctness over speed.
MongoDB is introduced as a document database offering flexibility and data locality but with potential consistency issues.
Cassandra is recommended for systems requiring high write throughput and flexibility in data consistency.
Riak is presented as an alternative to Cassandra, with advanced conflict resolution using CRDTs.
Apache HBase is discussed for its columnar storage model, beneficial for fast reads on a single column of data.
Memcache and Redis are mentioned as in-memory key-value stores, ideal for frequently accessed data.
Neo4j, a graph database, is showcased for its efficiency in modeling relationships and traversing nodes.
Time series databases are highlighted for their efficiency in handling large volumes of timestamped data.
VoltDB is introduced as a NewSQL database that runs entirely in-memory with a single thread for high performance.
Google's Spanner is discussed for its use of GPS clocks to achieve strong consistency across distributed databases.
The video concludes with a mention of data warehouses, typically used for analytical purposes with SQL formats.
The presenter emphasizes the redundancy of the content, aiming to compile information in a comprehensive manner.
The video is intended to be a resource for those new to the channel or needing a refresher on database selection.
Transcripts
welcome back everyone and welcome to the
channel if this is your first time uh
you lazy in the comments section
finally got me to do it and now I'm
going to make an all-encompassing
databases video I know I've literally
been promising this one since the first
episode on my channel which uh is pretty
embarrassing that I'm only finally
getting around to it right now
additionally as you can see I've got my
women at Amazon shirt on I'm definitely
supporting the crowd so if there are any
women watching this video
I stand with you anyways let's continue
let's get into this video
um all of this information is completely
redundant as far as I'm concerned uh
with the other stuff that I've made on
my channel but I've gone ahead and
compiled it so hopefully that works and
uh additionally
you know I think I'm gonna go a little
bit more technically in depth than any
other video of this kind has ever gone
on YouTube so uh you know if you
appreciate that please follow me my
accounts are dwindling and um I'm
starting to get insecure all right
everyone this is gonna be a kind of
longer PowerPoint so sit back buckle up
get some tissues and lotion and let's
get into this so we're gonna go ahead
and choose a database you're doing a
systems design interview and your
interviewer asks the famous question
what database are you going to use in
this problem if not more than one
oftentimes it is better to use more than
one so let's go ahead and talk about how
we can think about this problem
okay like I said there's your background
let's start jumping into some databases
however before we can really do any of
this jumping into databases first we
have to review some information I
wouldn't be surprised if some of the
people watching this video have not seen
my earlier ones because you know just by
virtue of being earlier on my channel
not everyone has seen them and so I am
going to do a little bit of review
I'll make sure to put timestamps in the
video if you actually know this stuff
and then you can go ahead and Skip
through to the part where I'm actually
talking about database evaluations just
really quickly a database index is super
useful for speeding up reads on a
database it does slow down rates but the
general gist is that you are sorting in
a specific order by key and this makes
it the case that when you're searching
for a specific key in the database and
the value of it you can do so much
faster than had you not used an index so
in today's databases that are on disk
there are two predominantly popular
types of database indices we are going
to be talking about both the first is
the LSM tree and the SS table
combination and the next is the B tree
so let's look at LSM trees and SS tables
the general point is this we have our
LSM tree which is basically a balanced
binary search tree in memory you can do
that with something like a red black
tree it doesn't really matter for the
purposes of this video like I said it's
review all this information is on my
channel and more depth already so if
you're confused about anything just go
watch the dedicated video anyways the
point is we have this memory tree which
basically has keys and their
corresponding values
when that tree gets too big because
there are too many keys in it it gets
flushed to an SS table on disk an SS
table is effectively just a sorted list
of keys and their values on disk and
then finally when we want to go ahead
and read a key we first check for it in
the mem table which is that LSM tree and
then if it's not in there we go through
the SS tables in order from the most
recent one to the least recent one so
we're never actually overwriting any
Keys We simply just write new values for
those keys and we keep creating new SS
tables however the key here is if you
know we're running out of space we can
actually go ahead and compact our SS
Tables by merging them together it's
kind of like the merge sort algorithm
where you're merging two sorted lists
and then that way we can free up some
space again the final point is that
because our SS tables are sorted we can
kind of keep a sparse index of the
location of some keys in the SS table
and that allows us to do searching
really fast in O of log n time via a
binary search so the general gist of
things here are that the pro of this
approach is that we have really fast
rights because those right are going to
an in-memory buffer in oauth login time
again it's a balanced binary search tree
so I'll log in and then the biggest con
is that for reads we potentially have to
go through many SS tables to find the
value of a key we can use things like
Bloom filters to speed this up but
ultimately it's just not as good as B
trees for reading and we'll talk about
why that is
so B trees on the other hand as you can
see from the right is also a binary
search tree however it's a binary search
tree that is implemented completely on
disk basically we have pages on disk
that have a range of keys and a bunch of
pointers so that you can figure out for
a specific value of a key you basically
iterate through a few of the pointers
and then go to the proper place on disk
this is really useful for actually being
able to get to the right place to both
read and write from however for writes
because rights in an LSM tree are going
to memory that is still going to be
better for those so B trees are going to
be worse for rights and faster for reads
because we don't have to iterate through
a bunch of ss table files we could just
go directly to the location of the key
by virtue of following our tree on disk
Okay so we've touched upon indices but
now let's go ahead and get into
replication so just as a quick recap
replication is what you do when you go
ahead and duplicate the data that you
have on one database node to another
database node so that in the event of
some sort of Hardware or software
failure you can go ahead and make sure
that that data is retained so there are
a few different types of replication
that we're going to quickly cover the
first being single leader replication
where you basically have one master node
that goes ahead and writes to the others
and then you can read from any of the
nodes you have multi-leader replication
and leaderless replication where rights
can go to multiple nodes but a
multi-leader replication you might have
just a couple of liters whereas in
leaderless replication those rights
might be sent out to every single node
in the cluster and in turn the reads
might actually come from every single
node in the cluster and you may use some
sort of like majority voting process to
decide what the accurate read is
okay so to just quickly sum this up in
terms of the pros and cons of the single
leader versus multiple leader approach
basically the general gist is that with
a single leader approach you have one
master and a bunch of slaves so as a
result you never have to worry about
right conflicts but on the flip side
you're going to have lower right
throughput because all of your rights
have to go through that one master node
you can try and mitigate this with
things like partitioning or sharding but
at the end of the day you know if all
the rights do have to go to one
partition
you know the throughput is going to be
limited on the exact contrary we have
leaderless multi-leader replication
where you know we have technically
unlimited write throughput in terms of
we could be writing to different
replicas every single time however at
the same time the more replicas that we
write to the more likely that there is
to be a right conflict and you know
there are kind of smart ways that we can
try and get about avoiding this where we
have certain rights going to the same
replicas but at the same time you know
it is good to kind of be able to know
that your data is correct and you can
avoid conflicts at all times
okay we've gone over some stuff so let's
actually go ahead and get started by
talking about some legitimate database
examples so the first one that I'm going
to start with are SQL databases so the
reason I'm not breaking SQL down into
things like MySQL and postgres and
things like that is because the truth of
the matter is the feature set is
generally similar between the SQL
databases but it's more so that there
are many different types of nosql
databases so I'm going to break those
down further but just for the sake of
getting started let's go ahead and break
down the important features of the SQL
databases so the first thing is just the
actual data model itself SQL is uh you
know kind of biased towards using
relational and normalized data what that
means is that you don't basically
duplicate data in multiple places but
rather you'll have rows and then you can
use joins in order to kind of reference
pieces of data with one another
throughout the tables what this means is
that a lot of times when you're doing
rights you may have to write to multiple
different tables and if those tables are
on multiple different nodes that can be
a problem at scale the reason being
if you want to write to two different
tables and they're on two different
computers for those to either both work
or to not work we would need something
like a two-phase commit protocol and
I've spoken about two phase commit
plenty in the past but the point is if
you plan on guaranteeing that two rights
are going to occur on two different
computers and you know if one fails then
neither occurs basically saying uh you
want a transaction over multiple
computers that is very expensive
distributed transactions are pretty
unreasonable in practice and as a result
it's going to be hard to actually go
ahead and Implement that additionally
with SQL databases we have asset
guarantees meaning that we're actually
going to be implementing transactions
which basically says every single
transaction acts as if it were
serializable right they can't go ahead
and you know kind of read each other in
the middle of a transaction you don't
have to worry about race conditions you
don't have to worry about a transaction
partially succeeding or partially
failing it's either all going to succeed
or all going to fail and then finally
we're also using B trees so B trees um
like I mentioned versus LSM trees in
particular are better for reading in
theory but worse for writing and just to
quickly go back on my point about
transactions um kind of the issue with
transactions are that even though it's
good for the correctness of the data it
also potentially slows down the entire
database due to most of these SQL
databases using an expensive two-phase
locking scheme in order to ensure data
correctness
okay so what is the actual conclusion
about when we want to be using SQL
databases well you know based on
everything I've kind of just said here
it's really good when correctness is
more important than speed right you know
if we're ever trying to deal with
transactions where we're modifying
multiple rows in a table at once we want
to make sure that all of those things
happen or don't you know something like
a bank transaction
SQL is going to be super useful because
we can't have it be the case where you
know I get a hundred dollars and then
you never lose your hundred dollars now
as the bank I've just lost a hundred
dollars so that doesn't work another
good situation would be job scheduling
if we have like a status table and we're
you know editing multiple rows of the
status table at the same time to do
updates we want to make sure that those
are relatively in sync and that all of
those rows are either being changed
should they're not but yeah I mean like
I said SQL databases are kind of the
default in this situation and I don't
think the breakdown between the specific
types are too important because at the
end of the day they're kind of all based
on the same Technologies but you know
they may have some subtle differences in
terms of feature sets
okay so let's move on to our first nosql
database which is actually going to be
mongodb so mongodb is a document
database which basically means that as
opposed to storing data in rows now you
have these documents which are basically
just lists of items but then within that
document you can have more nested
documents so now you can store a ton of
data in these Collections and you know
if you actually wanted to go ahead and
store your data that way it can provide
you with really good data locality so
just to give an example you know with
SQL for example you could have a table
of books and a table of authors and
those might be stored on different
computers and you know your author table
might have an ID that helps you to fetch
the relevant book rows right that being
said in a document data model you know
you might have one author document but
all of those book documents will
actually be stored within the author
document which is really nice because if
you're only going to be updating those
at the same time it means that you can
do that really easily you have great
data locality at the same time though
you know say you have multiple documents
that rely on kind of the same piece of
information if one piece of information
gets updated but the other doesn't then
those documents are out of sync and
that's kind of the issue with
denormalized data additionally as far as
mongodb goes in terms of Technology
under the hood I believe they do use
transactions and also Beatrice so think
of it as terms of like similar
performance to SQL but you do get a
little bit more flexibility in terms of
the document data model
that being said you know a lot of people
are kind of familiar with relational
databases and that data model so that
seems to be why SQL is kind of still the
norm but documents do have their own
usage so again in terms of the systems
to sign an interview I don't think
mongodb is like particularly
groundbreaking but if you just want kind
of like the SQL technology with kind of
a more advanced way of actually storing
your data then can be a little bit
useful
okay let's move on to Cassandra I would
say this is really the nosql database of
choice that you should probably be
listing in most of your systems assigned
interviews and there are a few good
reasons for that for starters it is a
wide column data score which is
essentially just an Excel spreadsheet
more or less except you know other than
the required sharding column and sorting
column you can basically put any other
columns in there and that can change
from row to row which is really nice in
terms of having data flexibility but
additionally there's also going to be
multi-leader and leaderless replication
and this is configurable so what that
means is you know you can choose to use
something like Quorum rights but at the
same time
you can also just choose to you know
have like one node kind of be the leader
and propagate that out elsewhere or have
just a couple nodes be the leaders but
it's not necessarily going to be a
quorum read or write so you don't always
need the majority
um that's really good because compared
to single leader replication you can
have much faster rights but at the same
time it also puts into question the
correctness of these rights if you're
going to have right conflicts Cassandra
in particular lets you use last right
wins in order to resolve rate conflicts
which isn't always ideal because it
means that certain rights can be
clobbered if they have the lower time
stamp keep in mind that time stamps and
distributed systems are not reliable
unless you're using something like a GPS
clock and we'll talk about that later
with spanner
but as long as that's the case it means
that certain rights may be unfairly
clobbered and then the last useful part
about Cassandra is that they are using
an LSM tree index which means there
should be faster rights albeit slower
reads okay so what's Cassandra really
useful for um well it's really great
when you want Hive right throughput and
you know if you don't care as much about
the consistency it's okay if the
occasional piece of data is overwritten
or lost then Cassandra is definitely
going to be the way to go so something
where this could be really useful is
something like a Facebook messenger chat
where you know maybe your sharding key
would be like the chat ID and then from
there you would have all of your
messages ordered using timestamp as the
sort key
okay let's talk about rayak I've got a
bunch of information listed here but the
general point of ryak is this it's
basically the same as Cassandra but one
weakness of Cassandra that I mentioned
was the fact that conflict resolution
was completely done using last right
wins on the other hand riak actually has
something called crdts in order to help
you with conflict resolution where
basically crdts are these conflict-free
replicated data types that act as ways
of kind of aggregating certain
conflicting rights and then allowing you
to use more complex logic in order to
actually resolve the conflicts this is
useful if you want to make sure that
you're not just having rights clobbered
and it can allow you to perform things
like sets and counters in a multi-leader
or leaderless replication setup that are
eventually consistent which is really
great so again kind of the same use
cases as Cassandra but it's just another
nice point to touch upon that you can
actually go ahead and you know kind of
resolve those conflicts one piece of
information that I did actually failed
to mention on this slide is that I've
wrote that it's a wide column data store
it's actually a key value store so
that's an error on my part
okay
next let's talk about Apache hbase so
what are the key features of hbase well
basically this is actually a wide column
store so Sam basically you know kind of
outer interface as Cassandra however
there are some pretty subtle differences
that do make a difference in performance
so the first thing is that it is built
on top of the hdfs or Hadoop distributed
file system which means that you are
using single leader replication you
write to a single node and then from
there that data gets replicated over
other nodes so this is going to be
slower than leaderless replication but
at the same time it means we don't have
to worry about write conflicts
additionally the indexes are based off
of the LSM tree so we should have
relatively fast rights but the
interesting thing about Apache hbase is
that the actual storage itself instead
of using a row wise storage model
actually uses column wide storage which
essentially means that instead of
storing data within the same row
together you actually store data within
the same column of the table together
and as a result you can achieve much
better data locality when you're trying
to read the entirety of just one column
so that's really useful
basically this is going to be really
great for times where you know you're
trying to get a very fast read on a
column of data so for example you know
let's say we're like Tick Tock or
something or you know even PornHub here
where when you actually press on a video
it shows you a few images of that video
that would be a great place to use
Apache hbase because you can actually
store that raw image data in the table
itself and use column compression to
quickly pull all of those images from
that column whereas you know if it were
row storage we would have to read
multiple rows you'd have less data
locality and the read would be slower
okay let's next talk about memcache and
redis so these are technically not the
best database Solutions and the reason
for that is that they are key value
stores that are actually implemented in
memory the subtle difference here is
that redis I believe has a little bit
more of a rich feature set you can do
things like um geosharting and you know
assorted sets and Hyper log logs but at
the end of the day they're mostly just
key value stores in memory and the point
is is that instead of using anything
like an SS table LSM tree combo or a B
tree combo You're simply just using a
hash map to store your data and so you
don't need an index
the one thing to note about a hash map
is that it's worse for range queries so
anyways the point about memcache and
redis is that it's really useful for
data where it's going to be you know
written and read very frequently it's
good for your most necessary data and
especially you know for small data sets
it's useful because you know storing
stuff in memory is just a lot more
expensive than storing things on disk so
this is really great for things like
caches or if you just have an essential
part of your application that's getting
written to and read from really often it
can also be good for that so one example
is actually the geospatial index in an
app like Uber or Lyft where you see it's
constantly being updated and read from
with all the updated positions of
drivers and Riders and so that's going
to be very useful there
okay next we have neo4j so this is a
less popular one but the reason for that
is that it's actually a graph database
so to quickly explain the premise behind
neo4j the point of a graph database is
this if we wanted to implement a graph
with a SQL database we could do it but
it would have to be using a many-to-many
relationship and the issue with the
many-to-many relationship is that we
would basically have a table with you
know the IDS of two nodes and basically
the fact that two nodes were in that
many-to-many table together would
indicate that there was an edge between
them so you could do that however that's
going to be very slow the reason for
this being is as that table gets bigger
we know that our index for that table is
going to get bigger and the index keep
in mind that the time complexity of
reading from an index is proportional to
log of the number of elements in that
index so as the database table grows
bigger our graph database would get
slower if we implemented it using a SQL
table instead neo4j is actually a native
graft database which basically means
that you have pointers to the actual
location of the node corresponding to
the other end of The Edge on disk so you
can more quickly Traverse them in
constant time as opposed to having to do
logarithmic work every single time you
want to Traverse a node so again this is
kind of a niche topic I can't really
think of too many systems to sign
questions where they actually want you
to use a graph database but if it ever
does come up this is really useful for
things like maybe map data and perhaps
also something like modeling Out friends
on social media
okay another type of database that we're
going to talk about are time series
databases so you see I listed a couple
examples at the top of this slide but
generally the point is this time series
databases are really useful for when
you're modeling a bunch of ingestion
from many different sources of data but
also want to keep them in order relative
to their timestamp right so let's say we
have like five sensors and we want to
model one day's worth of data for all of
those sensors so the reason these are
useful is they do use LSM trees and SS
tables which means that you can have
relatively fast ingestion because things
go to that in-memory buffer first
however they also take a little bit of a
turn on them to kind of make them even
more efficient for starters as opposed
to having one huge LSM index for the
entire table what they do instead is
split the table into many small error
indexes the reason for this being is
that you're likely only going to need to
be able to access a small chunk of data
at a time in a Time series database and
being able to make these indexes uh into
a kind of small chunk indexes is going
to allow you to put that entire index in
memory the SS table I mean and as a
result really quickly read from it and
so you can use kind of the CPU cache
much more efficiently by having all
these smaller indexes additionally this
is also really efficient for deleting
those indexes because a lot of times
with time series databases as the data
gets too old you want to throw it out
this is really inefficient with a
typical LSM index where you wouldn't
actually delete the data directly but
you would put in a tombstone into the
LSM tree where you're saying that a key
is going to be deleted and then
eventually when those SS table files get
merged together that key is going to be
thrown out however by actually just
deleting the index itself this is a much
faster process than having to wait for
table compaction to occur so basically
again pretty Niche kind of type of
database but if you're ever dealing with
assistance design problem about you know
ingesting a ton of metrics or logs or
sets or data or anything like that A
Time series database is definitely a way
to go for at least part of that problem
okay kind of honorable mentions here
that um you know probably won't come up
as much in interviews unless you're
having you know just like a fun
discussion about databases first we have
voltdb so new SQL is kind of the
category that I'm talking about here and
new SQL basically means they're SQL
databases however they're implemented in
interesting ways so as to get better
performance than the traditional SQL
database so Vault DB basically goes
ahead and takes a SQL database however
it runs it all in memory and it also
runs it using only a single thread of a
CPU so what this means is that there
quite literally can't be any race
conditions because there's only one
thread being run you know there's no
concurrent operations that are actually
occurring and by running them all in
memory we can basically make all these
operations happen so fast that using
single threaded execution is actually
feasible for performance the issue with
volt DB is obviously that it's going to
be expensive it's running in memory and
also it's only going to be good for
small data sets because it's running in
memory but it is kind of an interesting
concept additionally we have spanner
which is actually a Google product and
the premise of spanner is that you're
also using SQL however in order to kind
of avoid having to do a ton of locking
to figure out what right came after you
know what right what it does instead is
it puts GPS clocks in the actual data
center itself in order to assign each
right a relatively accurate timestamp
and then use those timestamps to figure
out you know kind of what right came
after what and once you can order all of
these rights it means that all of the
databases are able to go in a consistent
State and we can basically achieve
strong consistency by virtue of
linearizability spanner is also very
expensive because we were no longer
using commodity Hardware we're actually
putting you know clocks in the data
center which is not an easy thing to do
so again you know cool discussions to
have but also not really something you
need to know as much a last kind of
final honorable mention that I forgot to
even make a slide about was just kind of
data warehouses in general those are
generally for SQL formats and probably
after you would do some sort of batch
work to properly format the data but
yeah if it ever comes down to an
analytic question something like a data
warehouse could be very useful alrighty
guys well I hope you enjoyed I'm not
really one for redundancy normally and
you know this is all information that's
on my channel I highly encourage that
you check out the full videos on these
topics because there'll be a lot more in
depth but at the end of the day I guess
there is some value in terms of
compilation and you know worse comes to
worst hopefully it'll attract some new
people to the channel who kind of you
know haven't been exposed to these
topics before so anyways have a good one
look forward to seeing you all in the
next video
Browse More Related Video
![](https://i.ytimg.com/vi/Pd3sIbWzlHw/hq720.jpg)
Google SWE teaches systems design | EP22: HBase/BigTable Deep Dive
![](https://i.ytimg.com/vi/9mdadNspP_M/hq720.jpg)
Which Database Model to Choose?
![](https://i.ytimg.com/vi/j09EQ-xlh88/hq720.jpg)
Learn What is Database | Types of Database | DBMS
![](https://i.ytimg.com/vi/ym0cXSKZYnw/hq720.jpg?v=63351df9)
Types of Databases | Criteria to choose the best database in the System Design Interview
![](https://i.ytimg.com/vi/FIPCDRRBGz4/hq720.jpg)
Intro to Replication - Systems Design "Need to Knows" | Systems Design 0 to 1 with Ex-Google SWE
![](https://i.ytimg.com/vi/kSJLGc9ij7c/hq720.jpg)
What is MongoDB?
5.0 / 5 (0 votes)