How to Build a Streaming Database in Three Challenging Steps | Materialize

Data Council
11 May 202334:56

Summary

TLDRThe video script discusses the concept of a streaming database, specifically Materialize, which operates as a database that anticipates and reacts to data changes autonomously. It contrasts traditional databases, which require user commands to perform actions, with streaming databases that can proactively work on behalf of the user. The talk delves into the scalability and cloud-native aspects of streaming databases, highlighting the importance of decoupling layers for performance and consistency. It also introduces the concept of virtual time to coordinate operations across different components of the database system, ensuring accurate and efficient data processing. The demonstration showcases the system's ability to handle concurrent queries, maintain low latency, and scale up or down without interrupting services, providing a seamless experience for users.

Takeaways

  • 📚 Materialize is a streaming database that allows for real-time data processing and interaction.
  • 🔄 It supports standard SQL operations, enabling users to create tables, insert data, and run queries.
  • 🚀 Streaming databases proactively work on behalf of users, anticipating needs and preparing data ahead of time.
  • 📊 The concept of 'create view' in SQL is used to establish a long-term relationship with a query, allowing the database to anticipate and prepare for future requests.
  • 🔧 Work done in response to 'create view' is an ongoing cost, but it serves as a prepayment for potential future work, improving efficiency.
  • 📈 Streaming databases offer a new dimension in deciding when work gets done, either as data is ingested or when a query is made.
  • 🌐 Scalable cloud-native streaming databases allow for the addition of more resources without disrupting existing use cases.
  • 🔗 Virtual time is a crucial concept that decouples the storage, compute, and adapter layers, allowing for independent scaling and coordination.
  • 🔄 The storage layer ensures durability by recording data changes with timestamps, providing a consistent view of updates.
  • 🧠 The compute layer processes data flows, maintaining views and indexes for low-latency access to data.
  • 🔌 The adapter layer coordinates SQL commands, providing a facade of a single, consistent streaming database experience.

Q & A

  • What is a streaming database and how does it differ from a traditional database?

    -A streaming database is a modification of a traditional database that allows it to take action on its own, anticipating user needs based on data changes. Unlike traditional databases that respond to user commands, streaming databases can proactively perform tasks, such as updating views or indexes, in response to data changes without explicit user queries.

  • What is the significance of the 'create view' and 'select' commands in SQL in the context of a streaming database?

    -In a streaming database, the 'create view' command is used to establish a long-lived relationship with a query, hinting to the database that the query will be accessed repeatedly. This allows the database to anticipate and prepare for these queries, potentially improving performance. The 'select' command, on the other hand, is used to request immediate results from the database, which can be faster due to the pre-emptive work done by 'create view'.

  • How does a streaming database optimize the trade-off between work done at data ingestion time and when a user asks a query?

    -A streaming database optimizes this trade-off by performing work at data ingestion time, which is an ongoing cost as data changes. This prepayment of work can lead to faster response times when a 'select' query is issued, as the database has already done some of the necessary processing in anticipation of the query.

  • What is the role of virtual time in a scalable, cloud-native streaming database?

    -null

  • How does the storage layer in a streaming database ensure durability?

    -The storage layer ensures durability by recording data updates with timestamps and maintaining a log of these changes. It surfaces these updates consistently and durably, allowing for the reconstruction of the data at any point in time, thus providing a reliable foundation for the rest of the system.

  • What are the challenges faced by the compute layer in a streaming database?

    -The compute layer's main challenge is to process and maintain views as data changes, transforming time-varying collections into output while ensuring low latency access to data. It must also manage the trade-off between maintaining indexes for fast access and writing results back to the storage layer for other compute instances to use.

  • What is the function of the adapter layer in a streaming database?

    -The adapter layer is responsible for providing the appearance of a single, consistent streaming database. It sequences SQL commands with timestamps, ensuring that the system behaves as if all events occur in a total order. This layer also manages the consistency of the system, allowing for multiple independent operations to occur without interfering with each other.

  • How does a streaming database handle scaling and the addition of new use cases?

    -A streaming database handles scaling by allowing users to add more streaming databases without affecting existing ones. This is achieved through the use of virtual time, which decouples the execution of different layers, allowing them to operate independently and scale as needed without cross-contamination of work.

  • What are the benefits of using a streaming database for low-latency applications?

    -Streaming databases provide real-time data updates and allow for the materialization of views, which can significantly reduce query response times. This makes them ideal for low-latency applications where users require immediate and interactive access to the most current data.

  • How does a streaming database ensure data consistency across multiple users or teams working independently?

    -By using virtual time, a streaming database ensures that all users or teams working with the database see a consistent view of the data, as if all operations were executed simultaneously. This eliminates the need for each team to manage data synchronization manually, simplifying the development and maintenance of complex applications.

Outlines

00:00

📚 Introduction to Streaming Databases

The speaker introduces the concept of a streaming database, emphasizing its ability to anticipate user needs and perform actions autonomously. The Materialize database is highlighted as a streaming database that supports standard SQL operations, allowing users to create tables, insert data, and run queries. The key difference is that a streaming database can initiate work based on data changes, rather than waiting for user commands. The speaker also discusses the trade-offs between work done at data ingestion time versus when a query is issued, and the benefits of pre-emptive work in terms of performance and user experience.

05:01

🔄 Auction Data Use Case

The speaker presents a hypothetical auction data use case to illustrate the streaming database's capabilities. The scenario involves tracking active bids and outbids in real-time. The speaker explains how Materialize can create views to collect and update this information, providing a live feed of auction dynamics. The use case demonstrates the streaming database's ability to maintain low latency and high interactivity, as well as its potential for scalability and economic efficiency in data processing.

10:02

🛠️ Scalable Cloud Native Streaming Databases

The speaker delves into the architecture of a scalable cloud-native streaming database, highlighting the importance of coordination and consistency across multiple layers. The three layers discussed are the storage layer for durability, the compute layer for processing, and the adapter layer for user interaction. The concept of virtual time is introduced as a mechanism to coordinate these layers, allowing for decoupling of execution while maintaining a consistent view of data changes. The speaker emphasizes the benefits of this approach, including the ability to scale and handle multiple use cases without interference.

15:03

🔄 Data Flow and Processing

The speaker explains the data flow and processing within the streaming database, focusing on the compute layer's role in transforming time-varying collections. The layer's ability to handle CDC (Change Data Capture) streams and maintain state in memory for rapid access is discussed. The speaker also touches on the potential for writing results back to the storage layer for other compute instances to use, emphasizing the importance of determinism in ensuring correctness and performance.

20:05

🔄 Scaling and Consistency

The speaker demonstrates the scaling capabilities of the streaming database by showing how additional compute resources can be added or removed without interrupting the interactive experience. The ability to handle failures and maintain data integrity is highlighted, as well as the system's resilience to changes in compute resources. The speaker also shows how the streaming database ensures consistency across different compute instances, even when they are performing the same task, by using virtual time to synchronize data updates.

Mindmap

Keywords

💡Streaming Database

A streaming database is a type of database that can process and analyze data in real-time as it arrives, rather than in batches. It's designed to handle continuous data flow and is particularly useful for applications that require immediate insights, such as monitoring systems or real-time analytics. In the video, the speaker discusses Materialize as an example of a streaming database, highlighting its ability to perform SQL operations on streaming data.

💡Materialize

Materialize is a streaming database system that allows users to create tables, insert data, and run SQL queries in a manner similar to traditional databases, but with the added capability of handling streaming data. The speaker uses Materialize to illustrate the concept of a streaming database, demonstrating how it can maintain views and indexes that update in real-time as data changes.

💡SQL

SQL (Structured Query Language) is a standard language for managing and querying relational databases. In the context of the video, SQL is used to interact with the streaming database, allowing users to perform operations like creating tables, inserting data, and running select statements. The speaker emphasizes the importance of SQL in streaming databases, as it provides a familiar interface for users.

💡Distributed Database

A distributed database is a database that is spread across multiple locations or servers but appears to users as a single database. This concept is relevant in the video as it discusses the trade-offs between when work happens in a database: either as data is ingested or when a user asks a query. Distributed databases allow for scalability and can handle a large volume of data and queries efficiently.

💡Create View

In SQL, 'CREATE VIEW' is a command used to create a virtual table based on the result of a SELECT statement. In the video, the speaker explains that creating a view in a streaming database is a way to tell the database about a query of interest, which the database can then anticipate and prepare for, potentially improving performance when the view is accessed repeatedly.

💡Virtual Time

Virtual time is a concept used in database systems to manage concurrent operations and maintain consistency across different layers of the system. It assigns a timestamp to each event or operation, allowing the system to simulate the order in which events would have occurred if they were not happening concurrently. In the video, virtual time is crucial for coordinating the storage, compute, and adapter layers of the streaming database, ensuring that operations are processed in a consistent and predictable manner.

💡Decoupling

Decoupling refers to the design principle of separating components of a system to make them independent of each other, which can improve flexibility, scalability, and maintainability. In the video, decoupling is discussed in the context of the streaming database's architecture, where each layer (storage, compute, adapter) operates independently but is coordinated through virtual time, allowing for efficient scaling and management of complex operations.

💡Consistency

Consistency in a database context means that the data remains accurate and up-to-date across all components of the system. The video emphasizes the importance of consistency in streaming databases, particularly when multiple users or processes are accessing and modifying data simultaneously. The adapter layer in the streaming database uses virtual time to ensure that all operations appear to occur in a consistent order, even though they may be processed by different components at different times.

💡Scalability

Scalability refers to the ability of a system to handle increased workload by adding more resources, such as additional servers or computing power. In the video, scalability is a key feature of the streaming database, allowing it to grow and adapt to the needs of users without disrupting existing operations. The speaker demonstrates how the streaming database can scale by adding more compute resources without affecting the performance of ongoing tasks.

💡Cloud Native

Cloud native refers to applications or systems that are designed to take full advantage of cloud computing features, such as elasticity, scalability, and on-demand self-service. The video discusses the streaming database as a cloud-native solution, implying that it is optimized for deployment and operation in cloud environments, offering benefits like automatic scaling and simplified management.

Highlights

A streaming database is a modification to a traditional database that allows it to anticipate and perform work on behalf of the user based on data changes.

In a streaming database, the work can be done either as data is ingested or when a user asks a query, giving users control over when the work occurs.

The concept of 'create view' in SQL hints to the database that the user will frequently access the data, allowing the database to prepare and maintain the data efficiently.

Materialized views in a streaming database can be updated in real-time, providing a more interactive experience for users.

The trade-off in streaming databases is between pre-emptive work done during data ingestion (ongoing cost) and work done in response to queries (potentially faster response times).

A streaming database can materialize a lot of data with 'create view' statements, allowing users to continually monitor data changes without incurring additional costs.

The same SQL can be used for both querying and materializing data in a streaming database, provided by the same underlying execution engine.

A streaming database can maintain data up-to-date in proportion to the rate of change in the data, rather than the number of times it's viewed.

The concept of 'virtual time' is used in streaming databases to coordinate events and ensure consistency across different layers of the system without forcing synchronization.

The storage layer in a streaming database ensures durability by recording data changes with timestamps and maintaining a consistent view of the data.

The compute layer processes data flows, transforming input streams into output streams, and can maintain data in memory for fast access or write back to storage for broader consumption.

The adapter layer sequences SQL commands with timestamps, providing the appearance of a sequential total order and ensuring serializable isolation.

Scalable cloud-native streaming databases allow for the addition of more resources without disrupting existing use cases, providing a purely additive expansion.

Decoupling of layers in a streaming database allows for independent scaling and maintenance, improving performance and simplifying the system's complexity.

Virtual time enables the system to behave like a simulator, computing correct answers at each layer based on the timestamped events.

The storage layer's main challenge is ensuring durability, which involves writing down updates and being able to reproduce them exactly as recorded.

The compute layer's challenge is to process data flows efficiently, maintaining low latency and minimal resource usage while providing deterministic results.

The adapter layer's challenge is to provide consistency across the system by coordinating timestamps for SQL commands, ensuring the system presents as if it's a single, consistent entity.

Transcripts

play00:00

what's a streaming database of this and

play00:01

that and what I'm going to give you a

play00:03

very specific thing that I'm going to

play00:05

use for the talk that we have

play00:06

materialize think of as a streaming

play00:07

database

play00:08

foremost it is a database streaming

play00:11

modifies database so it's a database for

play00:13

material as a SQL database and you do

play00:16

standard SQL databases things with it

play00:18

you connect to it you say I'd like to

play00:19

create a table I'm going to insert some

play00:21

stuff into the table I'm going to run a

play00:22

select statement to read stuff back out

play00:24

of the table I'll create some views I'll

play00:25

create some indexes

play00:27

all those sorts of things

play00:29

but but everything I've just described

play00:31

is is very poll oriented you ask the

play00:34

database to do a thing it does that

play00:36

thing for you it responds to your

play00:37

commands and that's the basis on which

play00:39

the database takes action

play00:41

a streaming database is a modification

play00:44

to such a database where there's some

play00:46

sort of framework for the database to

play00:48

LEAP into action on its own to do some

play00:49

work on your behalf anticipating

play00:51

potentially what you need to do but some

play00:53

way that you're able to communicate to

play00:54

the database hey you should go and do

play00:55

some work potentially when the data

play00:57

change rather than when I ask about the

play00:59

data to uh to prep perhaps and work for

play01:02

me and that's uh the direction we're

play01:04

gonna be hitting that's what's going to

play01:05

be exciting about a streaming database

play01:07

is this this new dimension of when does

play01:09

when does work get done

play01:11

so a bit more specifically with some

play01:13

concrete examples we'll build this up

play01:16

the thing that I want you to think about

play01:17

Distributing database is that users get

play01:19

to trade off when work happens there's

play01:21

two roughly times that work might happen

play01:23

either as data are ingested

play01:26

or when a user asks a query these are

play01:28

two times that work could go on in the

play01:30

database and a user gets to control this

play01:32

they get to decide or guide the database

play01:33

as to when this this work should occur

play01:36

and the way to think about this uh that

play01:38

I think is as easy as these two commands

play01:40

that are pretty common in SQL people use

play01:41

them a lot and they guide the databases

play01:43

to when you would like the work to

play01:45

happen so we have create View and select

play01:48

probably a lot of you know select here's

play01:51

a query I want the answers right now

play01:52

please give them to me database and this

play01:55

is crazy the day which now leap into

play01:56

action and do that work for you

play01:58

create view probably a lot of you know

play02:01

what this is this is telling a database

play02:03

hello I have a query that is so

play02:04

interesting I'm going to give it a name

play02:05

and I'm going to introduce you the

play02:07

database to this query we're gonna have

play02:08

a nice long-lived relationship with this

play02:10

query and this is a great hint to the

play02:12

database like I'm going to come back and

play02:13

I'm going to ask about this view over

play02:15

and over again uh please look at it

play02:17

remember it anticipate potentially that

play02:19

I'm interested in it

play02:22

the the trade-off here the intended sort

play02:24

of value proposition if you will is that

play02:27

work that is done ahead of time the the

play02:28

working done in response to create view

play02:30

the work done that data ingestion time

play02:32

is an ongoing cost as the data changed

play02:35

that in some ways is a prepayment

play02:37

for work that might need to happen when

play02:39

you issue a select query so you know a

play02:41

bit more potentially predictable uh done

play02:44

in anticipation so when a select query

play02:46

comes around surprisingly potentially

play02:47

you have a head start on producing the

play02:50

the results there it takes a lot less

play02:51

time potentially to get those answers

play02:53

back that might have otherwise taken I

play02:55

don't know we'll see examples but you

play02:56

know whole integer seconds uh to return

play02:58

or or worse you've got the answer ready

play03:00

to go and as a consequence you can give

play03:02

a much more interactive experience

play03:04

there's also really interesting Dynamic

play03:06

here in terms of uh I guess economics is

play03:09

the right way to think about it

play03:10

um rather than redoing the work over and

play03:13

over again in select statements you're

play03:14

keeping work up to date so you're doing

play03:16

work in proportion to the rate of change

play03:18

in the data rather the number of times

play03:19

you look at it and in the limit this is

play03:21

really exciting you can materialize a

play03:23

lot of data with create view statements

play03:25

and then select from them every second

play03:27

like just continually watch the data as

play03:29

it spills out rather than trying to dial

play03:31

down your use of snowflake order from

play03:32

hours to days you go the other direction

play03:34

you just look at it every second and uh

play03:37

it's not any more expensive to do that

play03:40

so it prompts hopefully a bunch of

play03:42

people think like well I do something

play03:43

fascinatingly different if they're

play03:44

actually free to ask the questions more

play03:46

and more often

play03:48

it's important in this framework that

play03:49

we're talking about the same sequel in

play03:51

these two fragments it's it's really not

play03:53

great if create view can only do some

play03:55

counts or uh fairly limited types of

play03:57

things

play03:58

um this is a bit of an uncanny valley

play03:59

where either it's the same SQL you can

play04:02

both select from it or take what you're

play04:03

going to select from and materialize it

play04:06

uh and if you can't

play04:08

consistently do that it's just not as

play04:10

useful so it turns out and materialize

play04:12

at least these are actually the exact

play04:14

same SQL the same underlying execution

play04:16

engine handles both of these types of

play04:17

queries the things that will produce the

play04:19

answer for you in the first place but

play04:20

also then keep it up to date is the same

play04:22

streaming scale out dataflow

play04:24

infrastructure

play04:27

all right I'm going to show a demo

play04:30

um just to sort of warm you up to what

play04:31

is a streaming database I'm going to

play04:33

work through an example of a

play04:35

hypothetical use case this could be

play04:37

totally made up but I'm going to show

play04:38

off what might it look like to do all

play04:40

the work at query time to maintain it

play04:41

all ahead of time and why might it be

play04:43

good to do a little bit of both actually

play04:46

all right so the in the demo I'm going

play04:49

to log into materialize it just presents

play04:51

as uh well you just go through P SQL it

play04:54

looks like looks like postgres

play04:56

there's a source uh I'm not going to

play04:58

create the source so I had the source

play04:59

running for a little while and I want

play05:00

all the data that has produced this is

play05:02

an auction uh data generator it's got

play05:05

two relations of interest to us auctions

play05:07

and bids uh it's good Bruce a lot of

play05:10

them will just do some counts to see how

play05:11

much data is in there but fundamentally

play05:12

bids speak about auctions and auctions

play05:15

have ending times

play05:17

let me just pause this for a second so

play05:19

that we can see well I guess we'll see

play05:20

the number there we've got I don't know

play05:22

almost half a million auctions but at

play05:24

any particular moment if we restrict our

play05:25

Ascension to the auctions that have not

play05:27

yet expired it's about a thousand or so

play05:29

they turn over pretty quickly in this

play05:30

example just to keep things keep things

play05:31

moving

play05:34

and we're going to end up

play05:36

with I don't know 2 million plus bids or

play05:39

so

play05:41

and the intent the the use case we're

play05:43

going to put together you can sort of if

play05:44

you read ahead on the right right side

play05:45

over here you can sort of see where

play05:46

we're going we're going to put together

play05:48

a view that collects active bids so bids

play05:51

associated with auctions that are still

play05:53

in Flight that haven't expired yet let

play05:55

me just Define a view for that called

play05:56

active bids

play05:58

standard SQL just a join between options

play06:00

and bids

play06:02

and then

play06:03

pause it here we're going to take so

play06:04

hopefully that makes

play06:06

something's still in auction things

play06:08

might change these you know potentially

play06:10

very interesting you get up-to-date

play06:11

information about Moment by moment

play06:14

what are we going to do with that we're

play06:15

going to put together a new view on top

play06:17

of that it's meant to be useful to

play06:18

people and this is uh I've called it the

play06:20

outbids relation but fun about to anyone

play06:22

who's bid in an auction currently active

play06:25

might be interested to see what are the

play06:27

other bids that have outbid me who am I

play06:29

competing against right is there just

play06:31

one other person with a relatively

play06:32

similar bid or are there thousands of

play06:34

people that I'm competing with and maybe

play06:36

I should just go find something else to

play06:37

bid on because it's not going to happen

play06:40

just another join so this is a

play06:42

self-joined between active bids and

play06:44

itself on the auction ID where we

play06:46

restrict our attention down to bids uh

play06:48

the second bid should have a greater

play06:50

value

play06:50

and be a different bidder that's the

play06:53

sort of interesting stuff that we're

play06:53

gonna try to show back to people and say

play06:55

like look look

play06:56

uh you've been outbid by all of these

play06:58

people

play07:01

so uh the plan is to try to offer this

play07:04

up to folks bidder shows up bitter 500

play07:06

it's going to be and repeatedly says

play07:08

show me what's going on with the

play07:09

auctions I'm participating in who am I

play07:11

losing to uh

play07:13

sorry it uh

play07:15

my Mastery of

play07:16

Google Slides is limited

play07:21

I'll just watch it here for a moment

play07:22

we're going to do this three different

play07:23

ways

play07:24

so materialize has this concept of

play07:26

clusters think of them maybe as

play07:28

workspaces but these are environments

play07:30

where you can set up uh views you can

play07:33

just pull data in through select

play07:34

commands we're going to start with one

play07:36

that's the pull cluster and this is just

play07:38

going to be

play07:39

running a select running the entire uh

play07:42

join join pipeline a bunch of filtering

play07:45

pulling in all those millions of rows

play07:46

and you can see uh in this in this case

play07:49

it takes seven seconds or so not not a

play07:51

I'm going to say not a great number

play07:53

um or at least you know you should hope

play07:55

for more with a streaming database up to

play07:57

up to date information so that's really

play07:58

good but uh I'm going to try to convince

play08:00

you that you should you should want more

play08:03

why does it take so long if you look at

play08:05

the the query plan for this it's big and

play08:07

gross it does a whole bunch of joins in

play08:09

response to to issuing the query uh sort

play08:12

of understandable that a bunch of work

play08:13

just happened

play08:14

on your behalf and then it takes some

play08:15

time

play08:17

so let's do it a different way

play08:19

let's go to a different cluster we'll

play08:21

call this one push and this is going to

play08:22

uh essentially put together

play08:25

a data flow materialization for the

play08:28

entirety of outbids so we're going to

play08:29

create an index here

play08:31

I'm just calling it outbids by buyer so

play08:33

an index on the entire output's relation

play08:35

indexed by fire this will make it very

play08:37

easy for us to show up and say for buyer

play08:39

500 what are the results

play08:41

and uh if we watch that

play08:44

should happen

play08:46

so we'll we'll run some queries here you

play08:49

can see that instead of uh seven seconds

play08:51

is taking a few hundred milliseconds and

play08:54

uh this is actually an interesting

play08:56

consequence of materialize providing by

play08:58

default strict serializability as its

play08:59

isolation layer which if you're familiar

play09:01

with database is pretty strong uh it

play09:04

doesn't get much better than that we can

play09:06

dial it down though in the interest of

play09:07

performance so we set the transaction

play09:09

isolation just as serializable

play09:11

and these numbers will drop down to sub

play09:14

20 millisecond response times to hand

play09:16

back up-to-date information about who's

play09:19

out bidding buyer 500 in in various uh

play09:22

various auctions why is it so fast

play09:24

here's the query plan now it's just read

play09:27

the data out of an index right it's not

play09:29

a very complicated query plan not much

play09:30

work is happening when you ask the

play09:31

question we just go and read the data

play09:33

out

play09:35

well that sounds great why don't we just

play09:36

do that I want to just materialize

play09:38

everything all the way the problem uh if

play09:40

you're if you have your your thinking

play09:41

hat on is that

play09:44

the uh the size of outbids for each

play09:46

auction is quadratic in the number of

play09:48

active bids in there every pair of

play09:49

active bids one of them is visible to

play09:50

the other one right one is an outbidding

play09:52

of uh of the other which means if you've

play09:54

got an auction with a thousand bidders

play09:55

in it there's a million outbids uh

play09:58

elements that are being maintained

play10:00

continually that feels sort of bad if no

play10:01

one's actually looking at them all right

play10:02

so a lot of compute that needs to go on

play10:03

a bunch of memory you need to keep this

play10:05

stuff all Resident a bit of a waste of

play10:07

resources if it's not the case that all

play10:10

bidders are looking at at all moments

play10:14

so we'll do this a third way now uh a

play10:17

push pull away over here where we build

play10:20

instead of an index on outbids we're

play10:23

going to build two indexes on active

play10:25

bidders

play10:26

uh sorry active bids

play10:27

two ways uh both by

play10:30

um

play10:31

Sorry video goes faster than I can talk

play10:34

um both by buyer and by auction ID and

play10:38

when we go and run queries from there

play10:40

just give it a moment it's actually

play10:41

still setting up the data flows why it

play10:42

takes a few hundred things it's now on

play10:45

the order of 20 20 milliseconds to get

play10:46

the results back

play10:48

now why is that all right so a very

play10:51

different thing happens in this this

play10:52

query plan again thinky hat goes on uh

play10:56

if someone shows up and says I'm buyer

play10:58

500 amazing we have an index on active

play11:00

bids by buyer we can just go and leap

play11:02

and directly get access to their active

play11:03

bids

play11:04

each of those active bids names an

play11:06

auction and we can use again the active

play11:08

bids indexed by auction ID to LEAP

play11:10

directly to the relevant auctions there

play11:12

we're just leaping around in indexes at

play11:15

the moment nothing is scanning data

play11:17

playing anything in all the data right

play11:18

in memory index exactly the way we need

play11:20

and it's just looking up information

play11:22

however we're only maintaining data the

play11:25

active vids is linear in the number of

play11:27

input bits as opposed to potentially

play11:28

quadratic so we can do this with a much

play11:30

much lower resource envelope and just do

play11:33

the expansion of the data when someone

play11:35

actually asks

play11:39

all right and as a final trick for

play11:40

assuming databases

play11:42

these are all select queries they show

play11:44

you answers in response to you typing

play11:45

things and pressing enter

play11:46

in materializing good streaming

play11:49

databases you should be able to

play11:50

subscribe to these results you'll get an

play11:51

answer that comes out as well as a

play11:53

continually arriving change log thing at

play11:55

every in this case every second how have

play11:57

the data changed and in particular if

play11:59

nothing has changed clearly communicate

play12:01

that back out so this is

play12:04

data streaming in changing query results

play12:06

and streaming all the way back out to in

play12:08

this case buyer 500 who's just curious

play12:10

to see maybe their UI is curious to see

play12:13

what are the auctions they need to be

play12:14

paying attention to and the new buyers

play12:16

who've entered the uh the auction

play12:19

so that's the streaming database uh and

play12:22

uh

play12:23

some amount of time explaining but

play12:24

hopefully that's like oh that's

play12:25

fascinating I'd love to hear more about

play12:27

streaming databases uh in particular you

play12:30

might be

play12:31

there we go uh interested in a few more

play12:34

adjectives in front of the streaming

play12:35

database um

play12:37

so we'll talk a bit now about scalable

play12:38

Cloud native

play12:40

streaming databases and there's two

play12:41

words I want to call out on here

play12:43

um one is scalable I think a lot of

play12:46

people I just want to take a moment to

play12:47

say what I mean by by scalable for a lot

play12:49

of folks scalable means more computers

play12:50

more stuff

play12:52

uh that's true more computers will be

play12:54

involved in the story here but what

play12:56

we're really interested in is more

play12:57

streaming database so people have

play13:00

potentially had success with their

play13:01

streaming database like this was great

play13:02

I've got more things I'd like to do with

play13:04

it I would like to have more streaming

play13:06

database

play13:07

but in a way that doesn't screw up the

play13:09

streaming database I already have right

play13:10

so you want to give people more of the

play13:13

streaming days in a way that's purely

play13:14

additive if I give you some more

play13:15

computers and then you turn on a second

play13:17

use case and it tanked the first use

play13:18

case right if your query latencies went

play13:20

from 20 milliseconds to 100 milliseconds

play13:22

that's not good like that thanks for the

play13:24

additional computers and stuff but but

play13:26

my first use case is now broken because

play13:28

of these sort of cross-contamination of

play13:30

all this work so I say scalable I

play13:33

actually really mean figuring out how to

play13:34

give people who want to use streaming

play13:36

datases more and more and more of the

play13:37

streaming database rather than just

play13:39

computers and bytes and stuff like that

play13:43

the other word on here that's important

play13:44

is a you want a

play13:48

scalable coordination they don't want 27

play13:49

of them uh like if I just told you

play13:51

here's a different one for each of your

play13:52

use cases enjoy that defeats the purpose

play13:54

of having a database the database exists

play13:56

so that lots of different use cases

play13:57

users can use the same information

play13:59

produce results that can be integrated

play14:01

together and more value derived from

play14:03

that a bunch of different independent

play14:04

silos of streaming databases are not

play14:06

going to solve the problems that

play14:07

organizations have

play14:10

that's where the tension comes in of

play14:11

course if these things have to work

play14:12

together but shouldn't interfere with

play14:14

one another well you know the easy

play14:17

solutions go out the window and we have

play14:18

to we have to start to be a bit smarter

play14:19

we have to start to have multiple

play14:20

multiple steps

play14:22

so here's uh three steps

play14:25

and looking at this uh it's sort of

play14:28

painfully obvious this is like a fairly

play14:29

standard picture actually for what a

play14:31

cloud data warehouse looks like

play14:33

there's going to be some differences

play14:34

though and maybe we're going to pay

play14:35

attention the differences as we go

play14:39

so three layers here these these will

play14:41

correspond to the three steps that I

play14:42

want you to be thinking about at the

play14:44

bottom there's a storage layer data

play14:46

arrive into the storage layer the

play14:48

streaming database records the data

play14:51

there in particular updates to the data

play14:52

constant arrival of changes to the data

play14:54

going on second by second

play14:57

recorded data are then provided up to a

play14:59

compute layer and this is where we both

play15:00

compute and maintain these views as the

play15:03

data change a little different than a

play15:05

conventional Data Warehouse in that this

play15:06

layer is much less ephemeral right a lot

play15:08

of traditional dataware says queries

play15:09

come they get answered and retired and

play15:12

who knows you could just go to a totally

play15:13

different instance next time the value

play15:16

of this layer is actually in maintaining

play15:18

these these views and maintaining data

play15:20

index so that it's readily accessible at

play15:22

very low latencies

play15:24

um and sometimes it's the state held at

play15:26

these these compute nodes that's

play15:27

important it's

play15:29

um soft State not hard state so there

play15:31

isn't a complicated consistency question

play15:33

going on here but the the value is that

play15:36

they're holding on to something in index

play15:38

representation that's ready to go this

play15:40

very moment

play15:42

and up top there's there's what we call

play15:43

the adapter layer which is what

play15:45

interacts our interfaces with SQL you

play15:47

know the users show up and say I would

play15:48

like the experience of interacting with

play15:50

a single streaming database and it

play15:52

provides the facade of that it

play15:54

coordinates all the work underneath and

play15:55

coordinates the interaction with each of

play15:57

these SQL connections to make sure that

play15:59

the system presents as if it's just one

play16:02

streaming database that magically

play16:03

everyone seems to be able to use in a

play16:05

consistent and serializable or strictly

play16:08

serializable way

play16:11

we're going to break this down with a

play16:12

bit more detail now this is the exciting

play16:14

slide coming up it's not very different

play16:16

but

play16:17

ta-da so

play16:20

this is and we're going to say because

play16:21

this is really like the punch not the

play16:23

punch but uh virtual time is a concept

play16:27

got introduced back in the 1980s 1985

play16:30

paper by Jefferson

play16:31

many of you may not know what it is

play16:33

that's totally fine uh not all of us are

play16:35

doing computer science back in in the

play16:36

80s you may know of it as event time uh

play16:41

it's very analogous to event time let me

play16:43

explain

play16:44

so back in in 85 was actually proposed

play16:46

as a database concurrency control

play16:48

mechanism and the idea was that all

play16:50

interactions with uh events outside the

play16:52

system so in this case data updates SQL

play16:54

commands should have timestamps attached

play16:57

to the virtual time stamps attached to

play16:58

them

play16:59

and then the systems job now

play17:01

in essence is to behave like a like a

play17:02

simulator to say like well let's imagine

play17:04

that those events did happen at exactly

play17:06

those moments

play17:07

what should the right answer be like if

play17:09

if we had some updates at various times

play17:10

and then someone showed up and said I'd

play17:12

like to see the answer to my query at a

play17:14

very particular time there's a right

play17:15

answer

play17:16

um Can the system go and compute this

play17:17

now as quickly as possible Just Produce

play17:19

that that right answer for us

play17:22

uh and what this does which is really

play17:25

nice is that uh taking this approach of

play17:27

great let's time stamp everything let's

play17:29

compute correct answers at each of these

play17:31

layers

play17:32

it provides a really nice boundary

play17:33

between each of these these layers where

play17:35

storage

play17:36

assigns virtual times to all these

play17:37

updates and then its job is to surface

play17:39

them upwards and consistently repeatedly

play17:41

durably say here are time stamped

play17:44

updates for you the compute layer and

play17:45

potentially the adapter layer

play17:47

the compute layer consumes timestamped

play17:50

updates and its job is to pretend as if

play17:52

all of these views updated

play17:53

instantaneously so to produce output

play17:56

updates because timestamps correspond

play17:57

exactly to the timestamps and the inputs

play18:00

to the to the Views and are maintained

play18:03

sort of imperfect correspondence again

play18:04

as if in Virtual time they update

play18:06

instantaneously

play18:08

and then finally the adapter layer

play18:10

sequences essentially puts timestamps

play18:12

onto all of the SQL commands that come

play18:13

through to provide the apparent

play18:15

experience of

play18:16

a sequential total order the timestamps

play18:19

are the things that provide the total

play18:20

order and a little bit of a little bit

play18:22

of finesse the adapter layer is actually

play18:23

able to provide even more

play18:25

Advanced properties than just a total

play18:27

order on all of these events

play18:31

all right why do this I mean there's

play18:32

lots of different ways that you could go

play18:34

and try to build concurrency control

play18:36

throughout a complicated system the

play18:37

decoupling is really helpful so the fact

play18:40

that we've we're able to design and

play18:41

Implement each of these three layers

play18:43

differently is really valuable the

play18:45

techniques that you use in each of these

play18:46

layers very different and the basis on

play18:49

which each of them are correct or

play18:50

performance very different techniques

play18:52

and allowing each of each of them to be

play18:55

implemented separately just very

play18:56

powerful you can take experts in each of

play18:58

these domains throw them at the problem

play19:00

and the storage folks will use very

play19:02

different techniques to ensure

play19:03

durability than the compute folks will

play19:04

use to get performance than the adapter

play19:06

folks will use to get the appearance of

play19:08

of consistency

play19:12

what's especially nice about virtual

play19:14

times though is that while they

play19:15

coordinate the results while they make

play19:17

sure that things happen at the same time

play19:18

they don't actually synchronize them so

play19:19

they don't force the execution to occur

play19:21

at exactly the same moments the

play19:23

execution of all these components are

play19:24

decoupled they happen as fast as they

play19:26

can happen which is great you know

play19:28

everyone the three different sources can

play19:30

all operate at whatever rate they're

play19:32

getting data no one has to wait for

play19:33

anyone else

play19:34

the compute layer if several views are

play19:37

being updated concurrently they update

play19:39

as fast as they can update and

play19:41

it's the availability of data of updates

play19:43

at these virtual times that move the

play19:45

system forward if you're up at the

play19:46

adapter layer and you ask about

play19:47

something that's good to go you get your

play19:49

answer right away and if you ask about a

play19:51

view that is on fire because uh you did

play19:54

some horrible cross join you don't get

play19:55

an answer but you haven't interfered

play19:57

with the rest of the system either

play20:00

so it's a really powerful decoupler both

play20:01

from an architecture point of view and

play20:03

also from a performance point of view

play20:05

at the same time at the end of the day

play20:07

you get the experience of a single

play20:09

timeline where all of the events in the

play20:12

system happen on that timeline and as if

play20:13

there was just one thread running the

play20:15

entire system for you but it's all of

play20:16

course it's all big lie but but that's

play20:18

the experience that you get

play20:23

so really we'll have this light up a few

play20:24

more times because it's such an

play20:26

important slide

play20:28

I'm going to talk a little bit about

play20:29

utilities to give you a taste of what

play20:31

goes on in each of them

play20:33

um

play20:34

they're complicated I said there would

play20:36

be three challenging they are indeed

play20:38

challenging and and smart people are

play20:40

hard at work at each of these layers as

play20:41

we speak

play20:42

but uh

play20:44

I hope to convince you that they're

play20:45

they're tractable you can you can work

play20:47

on them and continue to make progress on

play20:48

them

play20:49

um

play20:50

uh fortunately without inheriting any of

play20:52

the complexity of the other layers

play20:54

so the storage layer the main challenge

play20:55

here is durability folks show up with

play20:57

updates from outside the system and the

play20:59

storage layer needs to make sure to

play21:00

write them down and be able to reproduce

play21:02

them exactly as recorded uh arbitrarily

play21:05

far into the future did arrive as change

play21:08

data capture streams you can think of

play21:09

this as potentially division a postgres

play21:12

replication log where is Kafka

play21:14

representations basically folks show up

play21:16

and explain how their data change in a

play21:18

way that is meant to be unambiguous and

play21:20

we uh we write it down as we see it when

play21:23

we do that though we have some very

play21:25

opinionated things that we do we put a

play21:27

timestamp on each of these updates the

play21:29

timestamps have to be carefully chosen

play21:31

in that if data for example updated in a

play21:33

transaction all those updates seem to

play21:34

have exactly the same virtual timestamp

play21:36

very important that

play21:37

uh atomicity basically says I should not

play21:40

be able to see half of a transaction so

play21:42

they happen exactly the same virtual

play21:44

timestamp

play21:45

some other ordering constraints

play21:47

but moreover

play21:48

we uh so we've Journal a picture down

play21:50

here of of one of these one of these

play21:52

timelines a Time varying collection is

play21:53

the name that we use for them you can

play21:55

see a bunch of update events that go on

play21:57

on this timeline time goes from left to

play21:59

the right

play22:00

but there's some other important moments

play22:01

here the boundaries of this this green

play22:02

box are important you're also able to

play22:05

get out of

play22:06

this information to Frontiers there's a

play22:08

right Frontier which is sort of the

play22:09

Leading Edge of where we're currently

play22:11

writing data down things at the right

play22:13

Frontier so it's a Time updates at the

play22:15

right Frontier are not yet known the

play22:17

data may still change at that time and

play22:19

Beyond in the future so it's important

play22:21

for example if you're thinking what's

play22:22

the right answer to know that I couldn't

play22:24

quite tell you just yet if it's that

play22:26

time or or above

play22:28

similarly there's a read Frontier that

play22:30

sort of trailing behind collecting up

play22:32

all these updates essentially rolling

play22:34

them up into a maintained snapshot for

play22:36

reasons of bounded memory Footprints and

play22:38

bounded storage and efficiency but this

play22:42

is uh preventing you know if you allow

play22:44

it but preventing access into the far

play22:46

history of a collection it's sort of

play22:48

saying anytime inside the green region

play22:49

you can look at roll up all of the

play22:51

updates there and get the current

play22:53

contents of your particular collection

play22:56

both those constraints in mind

play22:58

durability is what the folks here work

play23:00

on uh you know using exciting tools like

play23:02

cockroachdb and S3 and various other

play23:05

considerations make sure they did are

play23:07

durable

play23:09

there are other layers the compute layer

play23:11

this is the one that I'm personally most

play23:12

familiar with

play23:13

uh is essentially the data flow

play23:15

processing layer it takes time varying

play23:18

collections as inputs does various bits

play23:21

of transformation to them and produces

play23:22

the corresponding time varying

play23:23

collections as output so CDC streams in

play23:26

CDC streams out that exactly correspond

play23:28

to the input as if all of these updates

play23:30

happened instantaneously

play23:33

um

play23:33

just got some some pictures you know

play23:35

there's a rich variety of data flow

play23:36

operators that allow us to turn any SQL

play23:39

query into a nice streaming data flow

play23:42

having done that there's a few different

play23:44

things you can do with the results so

play23:45

I've put here in purple you could have

play23:47

some of these things be an index this is

play23:50

a state that stays in memory the compute

play23:52

layer and is randomly accessible from

play23:53

the adapter layer up above it's what

play23:55

gives you the millisecond time scale

play23:56

access to data but you could also take

play23:59

the results and write them back out to a

play24:01

materialized view this create materials

play24:03

used as a command that we'll see in just

play24:04

a moment

play24:05

puts the data back in the storage plane

play24:06

which allows people in other compute

play24:08

instances to pull the data in so for

play24:10

example if you're the first step in the

play24:12

data pipeline you're cleaning things up

play24:13

you're doing some denormalization who

play24:15

knows what

play24:17

you might have a whole bunch of

play24:18

Downstream consumers that you don't want

play24:20

in your in your address but like they're

play24:21

doing crazy stuff you don't want them to

play24:22

crash or slow down your work which is

play24:24

important to lots of people

play24:26

so it makes sense to write the results

play24:28

of your views out to the storage layer

play24:30

and have other people bring that data

play24:31

back into their compute environments

play24:33

they're isolated away from yours

play24:36

where major lazy would be relevant but

play24:38

other than that go as fast as possible

play24:40

like be as efficient displays as little

play24:41

memory as possible go go fast

play24:44

um

play24:45

durability is not not a problem here

play24:47

because of the deterministic nature of

play24:49

all these operators determinism is what

play24:50

provides the correctness guarantees that

play24:52

we need here

play24:57

all right last layer and this is a bit

play24:59

of a thinky one uh the adapter layer and

play25:02

fundamentally what happens here is those

play25:04

SQL commands that came in need

play25:05

timestamps right we've actually got 100

play25:08

people connected all saying select this

play25:09

insert that create whatever

play25:11

we need to put timestamps on these

play25:13

commands and the timestamps that we

play25:14

choose for those commands are what

play25:16

determine the apparent behavior of the

play25:18

entire system

play25:19

the people observing the system are the

play25:21

folks connected through these through

play25:22

these sessions

play25:24

so if you just put timestamps on

play25:25

commands actually a good thing happens

play25:26

already you get serializable isolation

play25:28

the timestamps themselves and our

play25:30

correct behavior in response to them are

play25:33

n order on all of the events

play25:35

it's not so bad though it turns out if

play25:37

you're familiar serializable isolation

play25:39

has a bunch of really weird properties

play25:40

that you would like no that can't be

play25:42

true but

play25:43

um the order doesn't need to correspond

play25:45

to real time and you would really like

play25:47

that in a lot of cases you really like

play25:48

if you go and insert some data into a

play25:51

table and then read from the table that

play25:52

maybe you should definitely be seeing

play25:53

what you just inserted into there

play25:56

uh and to get this property into

play25:57

something slightly stronger than

play25:58

serializability something that's

play26:00

deficient as strict serializability that

play26:02

time stamps need to increase as real

play26:04

time moves forward

play26:07

so there's a few constraints that this

play26:09

layer has in terms of installing

play26:11

timestamps that's one of them strict

play26:12

serializability that's some others

play26:14

though just to throw up if your query

play26:16

let's say it involves a table from the

play26:18

storage layer and an index from the

play26:19

compute layer

play26:21

they're these read and write Frontiers

play26:23

there and for example for attempts them

play26:25

to invalid it needs to be at least as

play26:26

big as all of the read Frontiers right

play26:28

if that's not the case we're not sure

play26:29

they'll actually give you correct data

play26:30

out so you have to pick sufficiently

play26:32

large timestamp to get correct results

play26:34

but you might also want to be prompt

play26:35

right the result might want to you know

play26:36

you want to come back right away you

play26:38

want to pick a timestamp that is not

play26:39

great or equal to any of these right

play26:40

Frontiers and if you do that you're in a

play26:43

position to get your result back pretty

play26:44

much right away certainly you don't have

play26:46

to wait for things things to happen

play26:49

you might want it to be the right end of

play26:52

this timeline to get the freshest of

play26:53

possible results but that's a little bit

play26:55

in tension with strict serializability

play26:56

which says things only have to go to the

play26:58

right so if you go all the way to the

play26:59

right you've ruled out a bunch of

play27:00

options for the next person who asks a

play27:02

question it's it's you know complicated

play27:04

and challenging there's some fun

play27:05

trade-offs here to provide a great

play27:06

experience to everyone that uh totally

play27:09

conceals the fact that that actually uh

play27:11

there's 57 different computers doing

play27:13

different things all at once

play27:15

so their challenge here is consistency

play27:17

providing the appearance of uh

play27:19

consistency across a whole bunch of

play27:21

fundamentally independent things

play27:23

coordinated only through virtual time

play27:28

all right this slide like I said um here

play27:31

it is again super important

play27:32

so uh you know conventional three layers

play27:35

of uh of a cloud data warehouse I think

play27:37

but coupled through virtual times we're

play27:39

able to have things continually change

play27:41

continually update without losing our

play27:43

heads without without uh utterly losing

play27:45

track of what's going on in the system

play27:51

so I'd like to do now is show off a

play27:53

little bit of the scaling aspect of the

play27:56

uh the scalable Cloud native streaming

play27:58

database so

play28:00

there are a few things I said you're

play28:01

going to want out of such a thing and

play28:02

I'm going to show off three of them that

play28:04

I think are are pretty cool that you

play28:05

wouldn't get if you just put materialize

play28:07

onto a

play28:08

one computer and then sort of said good

play28:10

luck enjoy

play28:13

so we're gonna look at the same set of

play28:16

us before the left pane over here is

play28:18

going to be that same streaming

play28:20

updates on that pushable cluster that

play28:22

are going to show us all the changes to

play28:23

outbids for buyer 500 it's really not

play28:26

going to change throughout the course of

play28:27

the demo so the only thing to notice on

play28:28

the left side of the screen is that

play28:29

stuff's happening as long as it

play28:30

continues to happen you should be pretty

play28:32

happy and I'm going to do most of the

play28:34

action over here on the on the right

play28:35

side

play28:37

uh the first thing we're going to do is

play28:39

just try to make a mess out of

play28:39

everything so we're going to head on

play28:41

over to the the pull cluster you might

play28:42

remember that was the place that queries

play28:43

go slow right because we have no built

play28:45

indexes we're just re-running stuff from

play28:47

scratch

play28:48

and we're just going to start to ask a

play28:49

bunch of questions over there basically

play28:50

you know take many integer seconds take

play28:52

answers and observe that that does not

play28:54

contaminate the interactive experience

play28:56

that's going on over on the left-hand

play28:57

side the left-hand side still continues

play28:59

to tick at the same rate even though the

play29:01

right hand side has decided I have 10

play29:03

seconds worth of work to go and actually

play29:04

do

play29:06

um you can isolate the performance

play29:07

maintain low latency properties

play29:09

on one and allow analytic work to occur

play29:12

concurrently without

play29:13

you know jumping the queue or getting in

play29:14

the way of of

play29:17

the interactive experiences yeah

play29:19

demonstration number one ta-da all right

play29:21

great no you know we'll we'll do a pause

play29:23

in the uh in the question section later

play29:27

the uh the other thing that we might do

play29:30

and this requires just a bit of

play29:31

explanation these clusters these

play29:32

workspaces can be back they're backed by

play29:34

compute resources of course but they can

play29:35

be backed by multiple replicas of the

play29:37

same uh the same work essentially

play29:39

different compute resources

play29:42

so what we're going to do here is take

play29:44

the pushball cluster the thing that's

play29:45

powering the left-hand side and give it

play29:47

another replica let's give it a we're

play29:49

going to scale it up is that a small

play29:50

replica now we're gonna give it a medium

play29:51

replica and we've gone and created that

play29:53

and it just stitches itself in there's a

play29:56

notice that comes back but otherwise

play29:57

just stitches itself in and is

play29:59

supporting the the first replica which

play30:03

sorry I apologize I should have warned

play30:04

you to watch for this which were disrupt

play30:07

so there's only one replica running now

play30:09

it's a medium replica and there was no

play30:10

interruption over on the right hand side

play30:12

left-hand side yeah so we just rescaled

play30:14

from a small cluster to a medium cluster

play30:16

while maintaining this interactive

play30:17

experience with no interruption

play30:19

right this is one of the really powerful

play30:21

parts of compute uh

play30:23

being decoupled but still coordinated

play30:24

through virtual time these two

play30:25

computances were doing exactly the same

play30:27

thing so we could deduplicate the

play30:28

results and as long as one of them is

play30:29

live and running

play30:31

we're getting fresh results uh spilling

play30:33

out even as we fiddle around with which

play30:36

one is backing at what's the scale you

play30:38

know all those sorts of things

play30:40

now what we're about to do is is uh drop

play30:44

the medium clusters another there's no

play30:46

replicas sorry behind this cluster is

play30:48

not running anymore it's stopped and

play30:52

uh you might say oh that sucks but but

play30:54

actually a good thing is happening over

play30:55

there I mean as good as it can be when

play30:57

you don't have any computers to do your

play30:58

work

play30:59

but the the stream that's coming out has

play31:01

has paused it knows that it does not

play31:03

know the answer to how the data changes

play31:05

in the next second so it's holding its

play31:09

uh holding its fire it's not going to

play31:10

tell you

play31:11

yeah yeah I don't know probably nothing

play31:13

happened no it's just waiting until it

play31:15

actually

play31:15

gets the information that it needs and

play31:18

we've reprovisioned a a small replica

play31:20

behind it now just to cut back over and

play31:22

as soon as that comes online and gets

play31:23

hydrated it'll pick up exactly where it

play31:25

left off and actually backfill all of

play31:27

the changes in the sort of 10 seconds or

play31:28

so

play31:29

uh for which we had no compute resources

play31:32

so even in the case of failure or or

play31:34

de-provisioning

play31:36

you actually get the right information

play31:37

out and it's a behavioral hiccup like

play31:40

there's there's some time that passes

play31:42

but you don't get the wrong data coming

play31:43

back out of this so you can without

play31:45

having to change your application

play31:47

uh just rely on getting correct

play31:49

information slower or faster based on

play31:51

the machines you have

play31:53

all right that was that was trick number

play31:54

two

play31:55

trick number three that we're going to

play31:57

do is hop onto both on the push cluster

play32:01

and on the pull cluster so the two not

play32:03

pushable clusters create materialized

play32:05

views that select out of the outbids

play32:07

relation so we've got output one and

play32:09

output two output one is the pull

play32:10

cluster output two is from the push

play32:13

cluster these are the same view computed

play32:15

two different ways on two different

play32:17

computers

play32:18

uh uh the for the pull way it has no

play32:21

indexes so it's it's pulling in all the

play32:23

work and doing it doing it uh over and

play32:24

over the push one is reading out of an

play32:26

index and writing it back into into

play32:27

storage

play32:29

and what I've said hopefully you can

play32:31

believe me I said these are the same and

play32:32

that's what we're going to do now is

play32:33

actually subscribe to the query that

play32:36

says show me what's in the first one but

play32:37

not the second one

play32:39

and that's what's happening right now uh

play32:41

you are seeing all of those changes

play32:43

which isn't necessarily convincing until

play32:45

I go and I put those progress statements

play32:47

back in

play32:48

so materialize not only doesn't show you

play32:50

anything it confirms second by second

play32:52

there is nothing in there there is no

play32:54

millisecond at which there's a single

play32:55

record in one of these relations and not

play32:57

in the other one that will be true

play32:58

forever these things barring bugs on our

play33:00

part these things are exactly equal

play33:02

virtual time is keeping them perfectly

play33:03

coupled if these were two separate use

play33:05

cases for example like you had one team

play33:07

that was figuring out who are the

play33:08

auction winners how much money should we

play33:10

collect from them and another team that

play33:12

was producing whose auctions have closed

play33:14

we should give them some money those

play33:15

numbers will always match perfectly the

play33:18

uh the imbalance between those dollar

play33:20

amounts will be zero for all time

play33:23

and this is really powerful for teams

play33:25

building low latency applications uh two

play33:28

different people doing two different

play33:29

things and then bringing their data

play33:31

together and having the guarantee that

play33:32

it is as if their work was executed

play33:35

instantaneously simultaneously against

play33:37

the same data

play33:39

there's no more trying to figure out

play33:41

whose data are slightly out of date and

play33:43

and patching that up with various

play33:44

bandages

play33:50

great

play33:52

it's a slide again uh it's the last time

play33:55

last time for this slide uh just just to

play33:57

recap because like I said it's super

play33:59

important uh the three steps that we

play34:01

have here are storage compute and

play34:03

adapter and they're coupled through the

play34:05

use of virtual time this is again really

play34:07

powerful really not fundamentally

play34:09

complicated coordination primitive that

play34:11

allows us to reduce the complexity in

play34:13

the overall system into challenging

play34:15

steps for sure but ones that are

play34:17

tractable by folks who are expert in

play34:19

each of these in each of these domains

play34:20

and allows you to give this experience

play34:22

that I don't know increasingly I think

play34:24

is really cool

play34:26

just all of the really impressive stuff

play34:27

that you can do spinning up whole

play34:29

bunches of compute teams of people who

play34:31

don't even know about other people's

play34:32

existences using the results of their

play34:34

their data products back through the

play34:35

storage layer and not having to reason

play34:37

about all the complicated stuff you'd

play34:38

normally reason about in a system with

play34:40

so many moving parts

play34:42

uh yeah this is what I have for you so

play34:46

we're gonna end here

play34:50

[Music]

Rate This

5.0 / 5 (0 votes)

Related Tags
Streaming DatabasesMaterializeReal-Time ProcessingScalabilityConsistencyVirtual TimeData ProcessingCloud NativeSQL DatabasesDistributed Systems