Google SWE teaches systems design | EP28: Time Series Databases

Jordan has no life
2 May 202210:44

Summary

TLDRThis video explores the design and benefits of time series databases, tailored for storing data that streams in over time, like logs or sensor readings. It discusses the importance of sorted data, compound indexing, and column-oriented storage for efficient read and write operations. The script delves into unique features like hyper tables and chunking, which optimize performance, caching, and deletion of old data. The presenter emphasizes that while not every application requires time series databases, they are highly efficient for handling such data, making them a valuable tool in the right scenarios.

Takeaways

  • 🕒 Time Series Databases are specialized for storing time-stamped data, such as logs from servers or sensor data.
  • 🔍 They are designed for high read and write throughput, particularly for ordered time series data.
  • 📈 Write operations typically involve recent time intervals and are not usually updated once written.
  • 🔗 Adjacent values in time series data are often similar, which can be leveraged for efficient data storage.
  • 🗂️ A compound index using timestamp and data source ID can effectively manage data from multiple sources over time intervals.
  • 📊 Reads are often focused on a single column of data and from a relatively small time interval, such as hours, days, or weeks.
  • 🗑️ Deletion of time series data often involves removing old data that is no longer relevant, such as data older than six months.
  • 🔑 Sorting data by timestamp and source ID creates a natural compound index, which is beneficial for performance.
  • 📚 Column-oriented storage is advantageous for time series data as it simplifies aggregations and reduces disk I/O.
  • 🔬 Compression techniques can significantly reduce the storage space needed for time series data due to the similarity of values.
  • 🧩 The 'hyper table' concept, which is a collection of smaller 'chunk tables', optimizes performance by caching only relevant indexes in memory.
  • 🚀 This chunking design also simplifies the deletion process, allowing for the quick removal of entire chunks of outdated data.

Q & A

  • What are time series databases and why are they used?

    -Time series databases are specialized databases designed to handle time-stamped data, such as logs from servers or sensor data. They are used because they are optimized for high read and write throughput for ordered time series data, which makes them efficient for storing and querying large volumes of data that are collected over time.

  • What kind of data is typically stored in time series databases?

    -Time series databases are used to store data that comes in a consistent stream over a time interval, such as logs from servers, sensor readings, or any data that changes over time and requires time-stamping.

  • Why are time series databases optimized for writes to a recent time interval?

    -Writes are optimized for recent time intervals because time series data, such as sensor readings, are typically recorded once and not updated. They are usually appended towards the end of a time interval, making it efficient to write new data to the most recent part of the database.

  • How are time series databases optimized for reading data?

    -Time series databases are optimized for reading data by using column-oriented storage, which is efficient for aggregations and reduces disk I/O. Additionally, they use a 'hyper table' concept where data is divided into smaller chunks, allowing for faster access by keeping only relevant chunks' indexes in memory.

  • What is a 'hyper table' in the context of time series databases?

    -A 'hyper table' is a concept where all time series data on a node is represented as one huge table. However, this table is actually composed of many smaller 'chunk tables', each representing a combination of a source and a time interval. This design allows for efficient caching and deletion of data.

  • How do time series databases handle data that is no longer relevant, such as older time series data?

    -Time series databases often handle irrelevant data by deleting it. A common practice is to set a time threshold, like six months, and delete data older than this threshold, as it is no longer needed for analysis or other purposes.

  • What is the benefit of using a compound index with timestamp and data source ID in time series databases?

    -Using a compound index with timestamp and data source ID allows for efficient querying and organization of data from multiple sources over various time intervals. It ensures that data from the same source is grouped together and ordered by timestamp, which is beneficial for both write and read operations.

  • Why is column-oriented storage beneficial for time series databases?

    -Column-oriented storage is beneficial for time series databases because it simplifies the process of reading and aggregating data. Since time series operations often involve reading a single column of data at a time, column-oriented storage reduces the amount of disk I/O and allows for efficient data compression.

  • How do time series databases optimize for data compression?

    -Time series databases optimize for data compression by taking advantage of the fact that many values within a column are similar over a short period. They use encoding techniques such as run-length or bitmap encoding to reduce the amount of storage space required.

  • What are some common operations performed on time series data?

    -Common operations on time series data include writes to recent time intervals, reads that typically access data from the same timestamp and data source combination, and deletes that involve removing older data that is no longer needed.

  • How does the chunking design in time series databases improve delete operations?

    -The chunking design improves delete operations by allowing for the deletion of entire chunks of data at once, rather than individual key-value pairs. This is more efficient because it avoids the need to update multiple indexes or in-memory buffers, and it simplifies the deletion process by removing entire files or chunks.

Outlines

00:00

🕰️ Introduction to Time Series Databases

The video begins with an introduction to time series databases, which are specialized for storing and managing time-stamped data typically generated by servers, sensors, and other data streams. The speaker clarifies that the discussion will not focus on any specific time series database but will instead highlight common design decisions and features found across various types. The purpose of these databases is to provide efficient storage and retrieval for time series data, which is characterized by high read and write throughput and ordered data entries. The video aims to cover the use cases, access patterns, and the rationale behind the design of time series databases.

05:02

📈 Time Series Data Operations and Optimization

This paragraph delves into the operations typically performed on time series data, emphasizing the importance of understanding access patterns to optimize database design. The speaker discusses the nature of time series data, which often involves writing to recent time intervals and rarely updating past entries. The data is usually similar from one timestamp to the next, which is leveraged for efficient storage. The use of a compound index with timestamp and data source ID is highlighted as a way to manage data from multiple sources efficiently. The paragraph also touches on the common practice of deleting old data and the challenges of doing so in large datasets. The speaker suggests that the design of time series databases, including the use of sorted data and column-oriented storage, is tailored to optimize both writes and reads, with compression techniques further enhancing storage efficiency.

10:05

🔄 Advanced Optimization Techniques for Time Series Databases

The speaker introduces advanced optimization techniques used in time series databases, such as the concept of a 'hyper table' composed of smaller 'chunk tables', each representing a combination of a data source and a time interval. This design allows for significant performance improvements by enabling the caching of only relevant chunk indexes in memory, thus speeding up access times. The paragraph also explains how this chunking approach simplifies the deletion of old data by allowing the removal of entire chunks rather than individual entries. This method is more efficient than traditional delete operations in databases that do not employ such a design. The speaker concludes by emphasizing the benefits of using time series databases for managing time-stamped data due to their specialized features that enhance performance and storage efficiency.

Mindmap

Keywords

💡Time Series Databases

Time Series Databases are specialized databases designed to handle time-stamped data, which is a sequence of data points indexed in time order. They are crucial for applications that require the storage and analysis of data collected over time, such as financial data, sensor data, or server logs. In the video, the speaker discusses the general design decisions and optimizations in time series databases without focusing on any specific product, highlighting their importance in handling ordered time series data efficiently.

💡Design Decisions

Design decisions refer to the strategic choices made during the development of a system, such as a database, to optimize its performance and functionality for specific use cases. In the context of the video, the speaker abstracts design decisions from various time series databases, emphasizing how these decisions contribute to the databases' ability to manage time-stamped data effectively, including aspects like write and read operations, data storage, and deletion strategies.

💡Write Throughput

Write throughput is a measure of how much data a system can write to a storage device in a given amount of time. It is a critical performance metric for databases, especially time series databases, which often deal with high volumes of incoming data. The script mentions that time series databases are optimized for high write throughput, meaning they can handle a large number of data entries quickly and efficiently.

💡Read Throughput

Read throughput is the rate at which data can be retrieved from a database. Similar to write throughput, it is essential for time series databases, which may need to serve large amounts of historical data for analysis or reporting. The video script explains that time series databases are optimized for high read throughput, allowing for quick access to ordered time series data.

💡Data Access Patterns

Data access patterns describe the typical ways in which data is retrieved or manipulated in a database. Understanding these patterns is crucial for optimizing database performance. In the video, the speaker discusses how time series databases consider access patterns, such as the tendency to write to recent time intervals and read from specific time intervals and data sources, to structure their data storage and retrieval mechanisms effectively.

💡Column-Oriented Storage

Column-oriented storage is a database storage approach where data is stored column by column rather than row by row. This method is beneficial for time series databases because it allows for efficient storage and retrieval of large datasets where only a few columns are typically accessed at a time. The video mentions that column-oriented storage reduces disk I/O and facilitates data compression, making it ideal for time series data.

💡Compression

Compression in the context of databases refers to the process of reducing the size of stored data to save storage space and improve I/O efficiency. Time series databases often use compression techniques due to the nature of the data, which can include many similar or repeating values. The script explains that compression techniques like run-length or bitmap encoding can significantly reduce the storage requirements for time series data.

💡Chunk Tables

Chunk tables are a concept introduced in the video where time series data is divided into smaller, manageable pieces or 'chunks', each representing a combination of a source and a time interval. This design allows for efficient caching, deletion, and management of time series data. The speaker uses the term 'chunk tables' to describe this partitioning strategy within the 'hyper table' of a time series database, which enhances performance and simplifies operations.

💡Hyper Table

A hyper table, as discussed in the video, is a conceptual representation of all time series data within a database as a single, large table. However, instead of being a physical table, a hyper table is composed of multiple smaller chunk tables. This abstraction allows for optimizations in caching, deletion, and storage management, as each chunk can be handled independently while still being part of the larger dataset.

💡Caching

Caching is the process of storing frequently accessed data in a faster storage medium, such as RAM, to improve performance. In the context of time series databases, caching is used to keep the indexes of relevant chunk tables in memory, which speeds up data retrieval. The video script explains how the design of time series databases with chunk tables allows for efficient caching, as only the necessary chunks need to be cached.

💡Deletes

In the context of databases, deletes refer to the operation of removing data from the system. Time series databases often need to handle deletes efficiently, as they may deal with data that becomes irrelevant over time. The script discusses how the chunk table design in time series databases allows for quick deletion of old data by simply removing entire chunks, which is more efficient than deleting individual entries.

Highlights

Introduction to time series databases and their purpose.

Time series databases are tailored for storing time-stamped data like logs and sensor readings.

Use cases for time series data include consistent data streams over time intervals.

Specialized databases are beneficial for high read and write throughput of ordered time series data.

Design decisions in time series databases optimize for specific access patterns.

Writes in time series databases are generally towards the end of a time interval and not updated.

Adjacent values in time series data are often similar, impacting data storage and retrieval.

Compound indexes using timestamp and data source ID are effective in time series databases.

Reads are typically from a specific timestamp and data source combination, focusing on one column of data.

Time series data is often deleted in bulk for data older than a certain threshold.

Optimizing writes involves sorting data by timestamp and source ID for efficient storage.

Column-oriented storage is beneficial for time series data due to frequent single-column reads.

Compression techniques like run-length or bitmap encoding reduce storage needs for similar values.

The concept of a 'hyper table' in time series databases aggregates data into manageable chunks.

Caching strategies in time series databases leverage chunking for performance improvements.

Chunking simplifies the deletion process by allowing the removal of entire data segments.

Distributed time series databases can utilize chunking for efficient data partitioning across nodes.

Conclusion emphasizing the advantages of time series databases for specific types of data.

Transcripts

play00:00

alrighty i am back for another video

play00:03

today we're going to be talking about

play00:04

time series databases which shouldn't

play00:05

take too long but um yeah just generally

play00:09

the points i'm going to make in this

play00:10

video i'm not going to talk about any

play00:12

specific time series databases simply

play00:14

just because i feel like in an interview

play00:16

no interviewer is ever just going to be

play00:17

like yeah so um tell me about how time

play00:20

scale db works i mean that would be like

play00:22

way too specific because it's not that

play00:24

popular

play00:26

but i did abstract

play00:28

you know a couple of design decisions

play00:30

away from a few different time series

play00:32

databases so what i might say doesn't

play00:34

necessarily apply to every single time

play00:35

series database but they did seem to be

play00:37

good ideas that are in use in multiple

play00:40

different types of them and as a result

play00:42

that's kind of what i'm going to be

play00:42

covering in this video

play00:44

so anyways let's get into that

play00:48

alrighty so time series databases well

play00:50

what are they in most applications or

play00:52

most companies at one point or another

play00:54

you're going to have to probably store

play00:56

some sort of time series data

play00:58

this happens when you have logs from a

play01:00

bunch of servers sensors or any other

play01:02

type of you know data that's coming in

play01:04

at a consistent type of stream

play01:06

over a time interval

play01:08

as a result a bunch of databases have

play01:10

actually been created that are

play01:11

specifically tailored towards this type

play01:13

of application

play01:14

we'll go over kind of the use cases of

play01:17

you know what you might typically be

play01:18

doing when dealing with time series data

play01:20

in the first place but

play01:22

overall

play01:23

it's obviously good to use a specialized

play01:25

tool for certain types of data whenever

play01:28

you can and as a result that's why these

play01:30

types of databases popped up

play01:32

so they're really good for high read and

play01:34

write throughput for specifically

play01:36

ordered time series data and we'll talk

play01:38

about why they work and you know kind of

play01:40

the design decisions some of the

play01:42

creators have made in order to kind of

play01:44

follow through with that promise

play01:47

so what are some time series operations

play01:49

well in order to optimize for the data

play01:52

we should definitely consider the access

play01:54

patterns used in order to basically make

play01:56

sure that all of those access patterns

play01:58

are kind of being handled the best by

play02:00

our design

play02:01

okay so rights generally speaking the

play02:04

rights are going to go to a recent time

play02:06

interval you know like you're not going

play02:08

to get sensor readings that are like

play02:09

seven days late you might get some that

play02:11

are a few minutes late due to network

play02:12

delays but generally speaking you're

play02:15

just writing things once not updating

play02:17

them and you're inserting them generally

play02:19

towards the end of a time interval

play02:21

and then additionally keep in mind the

play02:23

adjacent values that you're inserting

play02:25

from row to row are probably going to be

play02:26

pretty similar if it's something like a

play02:28

sensor reading for example

play02:30

one the time stamps are going to be

play02:32

pretty similar and two whatever values

play02:34

you're sharing probably didn't change

play02:35

that much in the you know the small unit

play02:38

of time so as a result we have a bunch

play02:40

of similar values next to one another

play02:43

additionally

play02:44

by using a compound index like timestamp

play02:47

in conjunction with some sort of data

play02:49

source id we can kind of express you

play02:52

know all of the metrics that we're

play02:53

getting from a bunch of possible

play02:55

different data sources and still be able

play02:58

to cover all of those time intervals

play03:00

in terms of you know reads and kind of

play03:02

the access patterns there

play03:04

generally speaking

play03:06

it's going to be from that same

play03:08

timestamp data source combination so

play03:10

just a tuple of those but just over one

play03:12

column of data so maybe we have a sensor

play03:14

that makes you know four readings per

play03:16

time

play03:17

generally speaking on graphs we're

play03:18

probably only going to be using one of

play03:20

those columns of data

play03:22

additionally they're all probably from a

play03:24

relatively small time interval

play03:26

you know you're not looking at years of

play03:28

data generally speaking it's probably

play03:30

only you know hours days weeks

play03:34

and then in terms of deletes the the

play03:36

most common scenario is to you know take

play03:38

a bunch of your older time series data

play03:40

say older than six months and just start

play03:42

getting rid of that you know it's it's

play03:44

not always the case that people will

play03:46

hold this analytics data for such a long

play03:48

time

play03:50

okay so how would we go about optimizing

play03:52

those rights well the first thing is i

play03:54

kind of mentioned that

play03:56

it's very important that things are

play03:59

sorted both on that timestamp value and

play04:02

also on the kind of source id where the

play04:05

source might be either like you know the

play04:08

server producing the logs or the sensor

play04:10

producing a bunch of metrics but the

play04:12

point is

play04:14

time series data from the same source

play04:15

should be grouped together and it should

play04:17

probably be ordered by the time stamp

play04:19

itself so we now kind of have this tuple

play04:21

that

play04:22

provides like a very natural compound

play04:24

index for us and so this way when we

play04:27

have this we allow writes from the same

play04:29

data source over similar intervals of

play04:30

time to be on the same note assuming

play04:33

we're doing you know say some type of

play04:34

sharding that way

play04:36

and you know if we're sharding over

play04:38

multiple nodes or even within one node

play04:40

as long as we're keeping those writes

play04:41

together they should be relatively quick

play04:46

in terms of optimizing reads

play04:48

a storing data in a column-oriented

play04:51

storage type is actually going to be

play04:53

great because like i said generally

play04:56

speaking you want to read probably one

play04:57

column of data at a time and that means

play05:00

that in terms of performing aggregations

play05:02

on the data

play05:03

having column oriented storage makes

play05:05

that really easy because all the data is

play05:07

stored together in one file it reduces a

play05:09

lot of disk io

play05:10

additionally since all those values are

play05:12

going to be so similar over you know the

play05:14

duration of one column we can do a ton

play05:16

of encoding on it so that might be run

play05:18

length or bitmap encoding but the point

play05:20

is you know whatever library a time

play05:22

series database does use it can greatly

play05:24

reduce the amount of storage that you

play05:26

need to be using

play05:27

by virtue of using this compression

play05:30

okay

play05:32

and then in terms of optimizing reads

play05:34

further

play05:35

this is something that i saw that was

play05:37

kind of unique to time series data and i

play05:39

haven't seen it before and so i'm going

play05:41

to go into a little bit of depth on it

play05:43

so

play05:44

imagine that we have

play05:46

as you can see in the bottom right here

play05:48

we have this hyper table so basically

play05:50

what a few of these databases do is they

play05:52

represent all that time series data

play05:55

on one node as one huge table where that

play05:58

table represents both you know

play06:01

time stamps and the source id

play06:04

and the point is it's kind of treated as

play06:06

if it were this one huge table which

play06:08

might insinuate that for that entire

play06:10

table there's one contiguous index right

play06:13

well actually instead what these

play06:15

databases have chosen to do

play06:17

are abstract away all of these things

play06:19

into the one huge table which they call

play06:21

the hyper table but really the hyper

play06:23

table is made out of all these mini

play06:24

chunk tables where chunk tables are a

play06:27

combination of some sort of source and

play06:29

time interval tuple so as you can see on

play06:32

the bottom here i have these four chunks

play06:34

and each one is just kind of a

play06:36

combination of a time interval plus a

play06:38

you know a sensor id

play06:40

and so as a result of that there are

play06:41

some huge benefits to this type of

play06:43

design schema first of all since most

play06:45

writes are only going to actually access

play06:47

a couple of these at a time because like

play06:49

i said we're

play06:51

kind of modifying things that are only

play06:53

recent we can achieve much better

play06:55

performance by caching the entire index

play06:57

of only the relevant chunks in memory

play07:00

for example if we had the entire hyper

play07:02

table

play07:03

and we didn't have you know smaller

play07:05

individual indexes what we would have to

play07:07

end up doing is have this entire huge b

play07:10

tree or all of these ss table files and

play07:12

then we would have too much index

play07:14

information to potentially have in

play07:16

memory at once and we would constantly

play07:18

be swapping in page files from disk in

play07:20

and out to memory to you know access

play07:22

them and change them and as a result

play07:24

that would add a ton of overhead in

play07:26

terms of doing that disk to memory swap

play07:29

so as a result by creating all of these

play07:31

sets of smaller indexes we can only keep

play07:33

the relevant small indexes in memory and

play07:36

that hugely speeds up performance

play07:39

additionally having this chunking design

play07:41

as opposed to just one huge table really

play07:43

helps by optimizing deletes

play07:45

like i said

play07:47

it's a very common use case that you

play07:48

kind of just want to take a ton of old

play07:50

data and just wipe it out because you no

play07:52

longer need it anymore it's kind of past

play07:53

that time threshold that it's relevant

play07:56

so by breaking the main table into all

play07:57

these chunks we can hugely improve the

play08:00

speed of that exact operation why well

play08:03

if think about like an lsm tree which

play08:05

we've spoken about a lot in the past

play08:07

each delete of data is its own right

play08:09

which means it has to go into that

play08:10

in-memory buffer and then you eventually

play08:12

add a tombstone to an ss table file

play08:14

which exists there until it's compacted

play08:17

and then finally that key is deleted

play08:19

but if we're writing a ton of deletes

play08:21

that's going to be super inefficient

play08:22

instead it would just be better if we

play08:24

actually just deleted the entire index

play08:26

and just dropped the file as a whole so

play08:28

that's what we can do here by using

play08:29

these chunks you literally just delete

play08:31

the chunk which is great

play08:33

and then the same thing applies to b

play08:34

trees if we were doing a ton of deletes

play08:36

every single time we deleted a key value

play08:38

pair we would have to go ahead and

play08:40

traverse through that b tree table and

play08:42

actually get rid of the the key value or

play08:45

just you know set the pointer to it to

play08:46

null whereas now we can just go ahead

play08:48

and delete that entire index so this is

play08:51

something that becomes much faster and

play08:53

it's actually a pretty common thing that

play08:55

happens in time series databases

play08:58

okay so i know this was a short video

play09:00

but um in conclusion even though storing

play09:02

time series data is not something that

play09:04

you know every application has to do all

play09:06

the time if you are dealing with time

play09:08

series data using a time series specific

play09:11

database is a really good way to go

play09:14

between creating separate indexes for

play09:16

each data source and time interval you

play09:18

can take advantage of really good cache

play09:20

performance

play09:21

as well as the ability to quickly delete

play09:23

things

play09:24

and also you know just having a smaller

play09:26

index which makes things really easy to

play09:28

write to because you can just write to

play09:30

the cache and then additionally the fact

play09:32

that all that data is so similar

play09:34

means that using column oriented storage

play09:36

can greatly reduce the amount of storage

play09:38

space that you're going to need

play09:40

as well as making it really easy to do

play09:42

aggregations over some sort of data

play09:45

so

play09:46

even though not every time series

play09:47

database is identical these are

play09:48

generally the main features of them

play09:50

which allow them to handle this type of

play09:51

data so well and so quickly so as a

play09:54

result you know if it comes up in an

play09:56

interview that you know you have some

play09:58

sort of logging type of data then you

play10:00

should definitely consider a time series

play10:02

database like i said those chunks are

play10:04

making a huge difference and you know

play10:07

it's kind of unique because it's the

play10:08

first time

play10:09

that we really see partitioning within a

play10:12

single node other than kind of using

play10:13

that fixed size partition schema that

play10:15

i've spoken about a bit in the past this

play10:18

is actually a way of using adaptive size

play10:20

chunks on a single node and then

play10:22

obviously if you were to scale out your

play10:24

time series database in a distributed

play10:26

manner over multiple nodes then you

play10:28

could probably imagine that one way or

play10:30

another those chunks would be put on

play10:31

different machines and you know hashed

play10:33

somehow

play10:35

okay guys so i hope you enjoyed the

play10:36

video and i hope it was useful um i'll

play10:39

figure out what to do for tomorrow's but

play10:41

uh

play10:42

i'll speak to you guys soon

Rate This

5.0 / 5 (0 votes)

Related Tags
Time SeriesDatabasesData StorageEfficiencySensor DataLog AnalysisData StreamsIndexingCompressionAggregation