Google SWE teaches systems design | EP28: Time Series Databases
Summary
TLDRThis video explores the design and benefits of time series databases, tailored for storing data that streams in over time, like logs or sensor readings. It discusses the importance of sorted data, compound indexing, and column-oriented storage for efficient read and write operations. The script delves into unique features like hyper tables and chunking, which optimize performance, caching, and deletion of old data. The presenter emphasizes that while not every application requires time series databases, they are highly efficient for handling such data, making them a valuable tool in the right scenarios.
Takeaways
- 🕒 Time Series Databases are specialized for storing time-stamped data, such as logs from servers or sensor data.
- 🔍 They are designed for high read and write throughput, particularly for ordered time series data.
- 📈 Write operations typically involve recent time intervals and are not usually updated once written.
- 🔗 Adjacent values in time series data are often similar, which can be leveraged for efficient data storage.
- 🗂️ A compound index using timestamp and data source ID can effectively manage data from multiple sources over time intervals.
- 📊 Reads are often focused on a single column of data and from a relatively small time interval, such as hours, days, or weeks.
- 🗑️ Deletion of time series data often involves removing old data that is no longer relevant, such as data older than six months.
- 🔑 Sorting data by timestamp and source ID creates a natural compound index, which is beneficial for performance.
- 📚 Column-oriented storage is advantageous for time series data as it simplifies aggregations and reduces disk I/O.
- 🔬 Compression techniques can significantly reduce the storage space needed for time series data due to the similarity of values.
- 🧩 The 'hyper table' concept, which is a collection of smaller 'chunk tables', optimizes performance by caching only relevant indexes in memory.
- 🚀 This chunking design also simplifies the deletion process, allowing for the quick removal of entire chunks of outdated data.
Q & A
What are time series databases and why are they used?
-Time series databases are specialized databases designed to handle time-stamped data, such as logs from servers or sensor data. They are used because they are optimized for high read and write throughput for ordered time series data, which makes them efficient for storing and querying large volumes of data that are collected over time.
What kind of data is typically stored in time series databases?
-Time series databases are used to store data that comes in a consistent stream over a time interval, such as logs from servers, sensor readings, or any data that changes over time and requires time-stamping.
Why are time series databases optimized for writes to a recent time interval?
-Writes are optimized for recent time intervals because time series data, such as sensor readings, are typically recorded once and not updated. They are usually appended towards the end of a time interval, making it efficient to write new data to the most recent part of the database.
How are time series databases optimized for reading data?
-Time series databases are optimized for reading data by using column-oriented storage, which is efficient for aggregations and reduces disk I/O. Additionally, they use a 'hyper table' concept where data is divided into smaller chunks, allowing for faster access by keeping only relevant chunks' indexes in memory.
What is a 'hyper table' in the context of time series databases?
-A 'hyper table' is a concept where all time series data on a node is represented as one huge table. However, this table is actually composed of many smaller 'chunk tables', each representing a combination of a source and a time interval. This design allows for efficient caching and deletion of data.
How do time series databases handle data that is no longer relevant, such as older time series data?
-Time series databases often handle irrelevant data by deleting it. A common practice is to set a time threshold, like six months, and delete data older than this threshold, as it is no longer needed for analysis or other purposes.
What is the benefit of using a compound index with timestamp and data source ID in time series databases?
-Using a compound index with timestamp and data source ID allows for efficient querying and organization of data from multiple sources over various time intervals. It ensures that data from the same source is grouped together and ordered by timestamp, which is beneficial for both write and read operations.
Why is column-oriented storage beneficial for time series databases?
-Column-oriented storage is beneficial for time series databases because it simplifies the process of reading and aggregating data. Since time series operations often involve reading a single column of data at a time, column-oriented storage reduces the amount of disk I/O and allows for efficient data compression.
How do time series databases optimize for data compression?
-Time series databases optimize for data compression by taking advantage of the fact that many values within a column are similar over a short period. They use encoding techniques such as run-length or bitmap encoding to reduce the amount of storage space required.
What are some common operations performed on time series data?
-Common operations on time series data include writes to recent time intervals, reads that typically access data from the same timestamp and data source combination, and deletes that involve removing older data that is no longer needed.
How does the chunking design in time series databases improve delete operations?
-The chunking design improves delete operations by allowing for the deletion of entire chunks of data at once, rather than individual key-value pairs. This is more efficient because it avoids the need to update multiple indexes or in-memory buffers, and it simplifies the deletion process by removing entire files or chunks.
Outlines
🕰️ Introduction to Time Series Databases
The video begins with an introduction to time series databases, which are specialized for storing and managing time-stamped data typically generated by servers, sensors, and other data streams. The speaker clarifies that the discussion will not focus on any specific time series database but will instead highlight common design decisions and features found across various types. The purpose of these databases is to provide efficient storage and retrieval for time series data, which is characterized by high read and write throughput and ordered data entries. The video aims to cover the use cases, access patterns, and the rationale behind the design of time series databases.
📈 Time Series Data Operations and Optimization
This paragraph delves into the operations typically performed on time series data, emphasizing the importance of understanding access patterns to optimize database design. The speaker discusses the nature of time series data, which often involves writing to recent time intervals and rarely updating past entries. The data is usually similar from one timestamp to the next, which is leveraged for efficient storage. The use of a compound index with timestamp and data source ID is highlighted as a way to manage data from multiple sources efficiently. The paragraph also touches on the common practice of deleting old data and the challenges of doing so in large datasets. The speaker suggests that the design of time series databases, including the use of sorted data and column-oriented storage, is tailored to optimize both writes and reads, with compression techniques further enhancing storage efficiency.
🔄 Advanced Optimization Techniques for Time Series Databases
The speaker introduces advanced optimization techniques used in time series databases, such as the concept of a 'hyper table' composed of smaller 'chunk tables', each representing a combination of a data source and a time interval. This design allows for significant performance improvements by enabling the caching of only relevant chunk indexes in memory, thus speeding up access times. The paragraph also explains how this chunking approach simplifies the deletion of old data by allowing the removal of entire chunks rather than individual entries. This method is more efficient than traditional delete operations in databases that do not employ such a design. The speaker concludes by emphasizing the benefits of using time series databases for managing time-stamped data due to their specialized features that enhance performance and storage efficiency.
Mindmap
Keywords
💡Time Series Databases
💡Design Decisions
💡Write Throughput
💡Read Throughput
💡Data Access Patterns
💡Column-Oriented Storage
💡Compression
💡Chunk Tables
💡Hyper Table
💡Caching
💡Deletes
Highlights
Introduction to time series databases and their purpose.
Time series databases are tailored for storing time-stamped data like logs and sensor readings.
Use cases for time series data include consistent data streams over time intervals.
Specialized databases are beneficial for high read and write throughput of ordered time series data.
Design decisions in time series databases optimize for specific access patterns.
Writes in time series databases are generally towards the end of a time interval and not updated.
Adjacent values in time series data are often similar, impacting data storage and retrieval.
Compound indexes using timestamp and data source ID are effective in time series databases.
Reads are typically from a specific timestamp and data source combination, focusing on one column of data.
Time series data is often deleted in bulk for data older than a certain threshold.
Optimizing writes involves sorting data by timestamp and source ID for efficient storage.
Column-oriented storage is beneficial for time series data due to frequent single-column reads.
Compression techniques like run-length or bitmap encoding reduce storage needs for similar values.
The concept of a 'hyper table' in time series databases aggregates data into manageable chunks.
Caching strategies in time series databases leverage chunking for performance improvements.
Chunking simplifies the deletion process by allowing the removal of entire data segments.
Distributed time series databases can utilize chunking for efficient data partitioning across nodes.
Conclusion emphasizing the advantages of time series databases for specific types of data.
Transcripts
alrighty i am back for another video
today we're going to be talking about
time series databases which shouldn't
take too long but um yeah just generally
the points i'm going to make in this
video i'm not going to talk about any
specific time series databases simply
just because i feel like in an interview
no interviewer is ever just going to be
like yeah so um tell me about how time
scale db works i mean that would be like
way too specific because it's not that
popular
but i did abstract
you know a couple of design decisions
away from a few different time series
databases so what i might say doesn't
necessarily apply to every single time
series database but they did seem to be
good ideas that are in use in multiple
different types of them and as a result
that's kind of what i'm going to be
covering in this video
so anyways let's get into that
alrighty so time series databases well
what are they in most applications or
most companies at one point or another
you're going to have to probably store
some sort of time series data
this happens when you have logs from a
bunch of servers sensors or any other
type of you know data that's coming in
at a consistent type of stream
over a time interval
as a result a bunch of databases have
actually been created that are
specifically tailored towards this type
of application
we'll go over kind of the use cases of
you know what you might typically be
doing when dealing with time series data
in the first place but
overall
it's obviously good to use a specialized
tool for certain types of data whenever
you can and as a result that's why these
types of databases popped up
so they're really good for high read and
write throughput for specifically
ordered time series data and we'll talk
about why they work and you know kind of
the design decisions some of the
creators have made in order to kind of
follow through with that promise
so what are some time series operations
well in order to optimize for the data
we should definitely consider the access
patterns used in order to basically make
sure that all of those access patterns
are kind of being handled the best by
our design
okay so rights generally speaking the
rights are going to go to a recent time
interval you know like you're not going
to get sensor readings that are like
seven days late you might get some that
are a few minutes late due to network
delays but generally speaking you're
just writing things once not updating
them and you're inserting them generally
towards the end of a time interval
and then additionally keep in mind the
adjacent values that you're inserting
from row to row are probably going to be
pretty similar if it's something like a
sensor reading for example
one the time stamps are going to be
pretty similar and two whatever values
you're sharing probably didn't change
that much in the you know the small unit
of time so as a result we have a bunch
of similar values next to one another
additionally
by using a compound index like timestamp
in conjunction with some sort of data
source id we can kind of express you
know all of the metrics that we're
getting from a bunch of possible
different data sources and still be able
to cover all of those time intervals
in terms of you know reads and kind of
the access patterns there
generally speaking
it's going to be from that same
timestamp data source combination so
just a tuple of those but just over one
column of data so maybe we have a sensor
that makes you know four readings per
time
generally speaking on graphs we're
probably only going to be using one of
those columns of data
additionally they're all probably from a
relatively small time interval
you know you're not looking at years of
data generally speaking it's probably
only you know hours days weeks
and then in terms of deletes the the
most common scenario is to you know take
a bunch of your older time series data
say older than six months and just start
getting rid of that you know it's it's
not always the case that people will
hold this analytics data for such a long
time
okay so how would we go about optimizing
those rights well the first thing is i
kind of mentioned that
it's very important that things are
sorted both on that timestamp value and
also on the kind of source id where the
source might be either like you know the
server producing the logs or the sensor
producing a bunch of metrics but the
point is
time series data from the same source
should be grouped together and it should
probably be ordered by the time stamp
itself so we now kind of have this tuple
that
provides like a very natural compound
index for us and so this way when we
have this we allow writes from the same
data source over similar intervals of
time to be on the same note assuming
we're doing you know say some type of
sharding that way
and you know if we're sharding over
multiple nodes or even within one node
as long as we're keeping those writes
together they should be relatively quick
in terms of optimizing reads
a storing data in a column-oriented
storage type is actually going to be
great because like i said generally
speaking you want to read probably one
column of data at a time and that means
that in terms of performing aggregations
on the data
having column oriented storage makes
that really easy because all the data is
stored together in one file it reduces a
lot of disk io
additionally since all those values are
going to be so similar over you know the
duration of one column we can do a ton
of encoding on it so that might be run
length or bitmap encoding but the point
is you know whatever library a time
series database does use it can greatly
reduce the amount of storage that you
need to be using
by virtue of using this compression
okay
and then in terms of optimizing reads
further
this is something that i saw that was
kind of unique to time series data and i
haven't seen it before and so i'm going
to go into a little bit of depth on it
so
imagine that we have
as you can see in the bottom right here
we have this hyper table so basically
what a few of these databases do is they
represent all that time series data
on one node as one huge table where that
table represents both you know
time stamps and the source id
and the point is it's kind of treated as
if it were this one huge table which
might insinuate that for that entire
table there's one contiguous index right
well actually instead what these
databases have chosen to do
are abstract away all of these things
into the one huge table which they call
the hyper table but really the hyper
table is made out of all these mini
chunk tables where chunk tables are a
combination of some sort of source and
time interval tuple so as you can see on
the bottom here i have these four chunks
and each one is just kind of a
combination of a time interval plus a
you know a sensor id
and so as a result of that there are
some huge benefits to this type of
design schema first of all since most
writes are only going to actually access
a couple of these at a time because like
i said we're
kind of modifying things that are only
recent we can achieve much better
performance by caching the entire index
of only the relevant chunks in memory
for example if we had the entire hyper
table
and we didn't have you know smaller
individual indexes what we would have to
end up doing is have this entire huge b
tree or all of these ss table files and
then we would have too much index
information to potentially have in
memory at once and we would constantly
be swapping in page files from disk in
and out to memory to you know access
them and change them and as a result
that would add a ton of overhead in
terms of doing that disk to memory swap
so as a result by creating all of these
sets of smaller indexes we can only keep
the relevant small indexes in memory and
that hugely speeds up performance
additionally having this chunking design
as opposed to just one huge table really
helps by optimizing deletes
like i said
it's a very common use case that you
kind of just want to take a ton of old
data and just wipe it out because you no
longer need it anymore it's kind of past
that time threshold that it's relevant
so by breaking the main table into all
these chunks we can hugely improve the
speed of that exact operation why well
if think about like an lsm tree which
we've spoken about a lot in the past
each delete of data is its own right
which means it has to go into that
in-memory buffer and then you eventually
add a tombstone to an ss table file
which exists there until it's compacted
and then finally that key is deleted
but if we're writing a ton of deletes
that's going to be super inefficient
instead it would just be better if we
actually just deleted the entire index
and just dropped the file as a whole so
that's what we can do here by using
these chunks you literally just delete
the chunk which is great
and then the same thing applies to b
trees if we were doing a ton of deletes
every single time we deleted a key value
pair we would have to go ahead and
traverse through that b tree table and
actually get rid of the the key value or
just you know set the pointer to it to
null whereas now we can just go ahead
and delete that entire index so this is
something that becomes much faster and
it's actually a pretty common thing that
happens in time series databases
okay so i know this was a short video
but um in conclusion even though storing
time series data is not something that
you know every application has to do all
the time if you are dealing with time
series data using a time series specific
database is a really good way to go
between creating separate indexes for
each data source and time interval you
can take advantage of really good cache
performance
as well as the ability to quickly delete
things
and also you know just having a smaller
index which makes things really easy to
write to because you can just write to
the cache and then additionally the fact
that all that data is so similar
means that using column oriented storage
can greatly reduce the amount of storage
space that you're going to need
as well as making it really easy to do
aggregations over some sort of data
so
even though not every time series
database is identical these are
generally the main features of them
which allow them to handle this type of
data so well and so quickly so as a
result you know if it comes up in an
interview that you know you have some
sort of logging type of data then you
should definitely consider a time series
database like i said those chunks are
making a huge difference and you know
it's kind of unique because it's the
first time
that we really see partitioning within a
single node other than kind of using
that fixed size partition schema that
i've spoken about a bit in the past this
is actually a way of using adaptive size
chunks on a single node and then
obviously if you were to scale out your
time series database in a distributed
manner over multiple nodes then you
could probably imagine that one way or
another those chunks would be put on
different machines and you know hashed
somehow
okay guys so i hope you enjoyed the
video and i hope it was useful um i'll
figure out what to do for tomorrow's but
uh
i'll speak to you guys soon
Voir Plus de Vidéos Connexes
Choosing a Database for Systems Design: All you need to know in one video
Introduction to NoSQL databases
Database Design Tips | Choosing the Best Database in a System Design Interview
Solve Complex Fuzzy Name Variations with WinPure AI Data Match | 05 Minute Guide
Database Indexes: What do they do? | Systems Design Interview: 0 to 1 with Google Software Engineer
Database vs Spreadsheet - Advantages and Disadvantages
5.0 / 5 (0 votes)