Validating System Executions with the TLA+ Tools Markus A Kuppe, Microsoft

Markus Kuppe
5 May 202445:44

Summary

TLDRThe speaker from Inria and Microsoft discusses the use of TLA+ and its tool, TLC, for validating system executions, specifically in distributed systems. The talk revolves around the concept of Trace Validation, a technique that involves logging local events from each node of a distributed system, then combining these logs into a global log to maintain causal order. By comparing the behaviors derived from these logs to a high-level specification, developers can identify discrepancies between the system's intended and actual behaviors. The speaker shares experiences of applying Trace Validation to various systems, including etcd and CCF, where it has uncovered several bugs and improved system specifications. The process not only helps in validating existing systems but also guides the development of new features based on the refined specifications. The talk emphasizes the importance of TLA+ expertise and the effectiveness of Trace Validation in narrowing the gap between specifications and code, ultimately enhancing system reliability.

Takeaways

  • πŸ“š The speaker discusses the use of TLA+ and Trace Validation for validating system executions, emphasizing its effectiveness in narrowing the spec-to-code gap.
  • πŸ” Trace Validation is a technique that involves collecting log files from a distributed system and comparing them against a high-level specification to ensure correctness.
  • 🌐 The process begins with logging local events, which are then combined into a global log to maintain causal order, essential for validating system behaviors.
  • πŸ“ˆ The speaker shares experiences with seven different systems, highlighting how Trace Validation has uncovered discrepancies and bugs in real-world applications.
  • βš™οΈ The implementation of logging was straightforward in the discussed systems, requiring minimal changes to the existing codebase.
  • πŸ•’ The validation process is efficient, with the model checker exploring all possible combinations of actions due to the concurrent nature of TLA+.
  • πŸ”§ The speaker suggests that TLA+ expertise is necessary for effective use, but theε›žζŠ₯ (return on investment) is high, given the bugs and issues identified.
  • πŸ”— The use of logical distributed clocks, like vector clocks or Lamport clocks, is recommended for ordering log files, which can also aid in visualization and understanding of system interactions.
  • πŸ“‰ The speaker addresses challenges in applying Trace Validation, such as the need to refine high-level specifications to align with implementation details.
  • πŸ› οΈ Trace Validation serves as a complementary tool to other verification methods like fuzzing and chaos engineering, potentially guiding these processes by providing a notion of coverage.
  • βͺ The technique is not limited to new systems; it's also highly effective for verifying existing or 'brownfield' systems with pre-existing implementations.
  • πŸ”¬ The future work includes generalizing Trace Validation for broader application, integrating it with model-based testing, and automating the translation of counterexamples into system tests.

Q & A

  • What is the main focus of the talk?

    -The talk focuses on validating system executions with TLA+ (Temporal Logic of Actions) and how it can be used to close the spec-to-code gap in distributed systems.

  • What is the significance of maintaining a log file in each node of a distributed system?

    -Log files in each node help in recording local events. They are crucial for debugging incidents in production and for creating a global log to maintain causal order during testing.

  • How does Trace validation work in TLA+?

    -Trace validation in TLA+ involves creating a trace specification that reads the log file and generates a set of behaviors conforming to what is observed in the log file. The validation part then compares this set of behaviors to the set of behaviors defined by a high-level specification to check for intersections, indicating acceptance of the execution by the specification.

  • What is the role of the high-level specification in the validation process?

    -The high-level specification serves as a reference model that is compared against the behaviors derived from the log file. It helps in identifying whether the actual system execution conforms to the intended behavior as defined by the specification.

  • How does the speaker describe the experience with applying Trace validation to seven different systems?

    -The speaker describes the experience as effective, with Trace validation finding spec-to-code divergences in all seven systems and uncovering non-trivial bugs in real-world systems.

  • What is the importance of logging in the context of Trace validation?

    -Logging is essential for Trace validation as it provides the necessary data to create the global log file. The speaker suggests logging when messages are sent and received, and any observable node local state changes to ensure accurate validation.

  • What are the benefits of using a distributed clock for ordering log files?

    -A distributed clock, such as a vector clock or a Lamport clock, helps in causally ordering the log files from different nodes, which is necessary for maintaining the correct sequence of events during Trace validation.

  • What is the role of the TLA+ tools in narrowing the spec-to-code gap?

    -TLA+ tools, particularly Trace validation, are shown to be mature enough to help bridge the gap between high-level specifications and actual code implementations by providing a method to validate system executions against their specifications.

  • How does the speaker suggest improving the Trace validation process?

    -The speaker suggests refining the high-level specifications to bring them closer to the implementation, using model-based testing to generate diverse sets of behaviors, and possibly automating the translation of counterexamples into system tests.

  • What are the challenges faced when applying Trace validation to existing systems?

    -The challenges include the need for TLA+ expertise, the effort to map the specific system to the model, and the time spent on updating the TLA+ tools for better integration with existing systems.

  • How does the speaker view the future of Trace validation?

    -The speaker is optimistic about the future of Trace validation, believing it to be a valuable tool for narrowing the spec-to-code gap and improving the reliability of distributed systems.

Outlines

00:00

πŸ˜€ Introduction to TLA+ and Trace Validation

The speaker introduces the topic of validating system executions using TLA+ (Temporal Logic of Actions) and discusses the importance of closing the spec-to-code gap. They mention their initial skepticism about the question but later realized the need for a good answer, especially for those wanting to use TLA+. The talk focuses on distributed systems and the process of collecting local log files to create a global log, maintaining causal order, and using a trace specification to conform system behaviors to the log file. The validation process involves comparing the system's set of behaviors with a high-level specification.

05:02

πŸ” Trace Validation in Practice

The speaker shares experiences with trace validation, including its application to various systems like a multi-producer multi-consumer problem, a distributed termination detection algorithm, and others. They discuss how trace validation helped identify discrepancies and bugs in implementations, even when the developers were familiar with the specifications. The talk also covers the application of trace validation in spec-driven development and how it has been integrated into larger systems like etcd, a popular distributed key-value store.

10:04

πŸš€ Applying Trace Validation to Larger Systems

The speaker describes the process of applying trace validation to larger and more complex systems, such as the Confidential Consortium Framework (CCF) developed by Azure Research. They discuss the challenges of creating a specification for an existing implementation and how trace validation helped identify issues and refine the high-level specification. The talk also covers the discovery of data loss issues during trace validation and the iterative process of updating the specification to match the implementation more closely.

15:05

πŸ”— Aligning Traces with High-Level Specifications

The speaker explains the process of aligning execution traces with high-level specifications using TLA+. They discuss the need for introducing stuttering steps and action composition to handle the mismatch between asynchronous message delivery in the implementation and atomic token passing in the specification. The talk also covers the use of logical clocks like vector clocks or Lamport clocks to order log files and the importance of locking observable node-local state changes.

20:07

πŸ“ˆ Trace Validation as a Verification Tool

The speaker emphasizes the effectiveness of trace validation as a verification tool, highlighting its success in finding spec-to-code divergences in seven systems. They discuss the integration of trace validation into continuous integration/continuous deployment (CI/CD) pipelines to ensure that changes to systems like CCF adhere to the specification. The talk also explores the potential for trace validation to generalize beyond the systems it has been applied to and the possibility of combining it with other testing methodologies.

25:10

πŸ€” Trade-offs and Future Directions

The speaker discusses the trade-offs between using trace validation and other general-purpose programming language tools. They argue that trace validation offers a high return on investment, especially for brownfield projects with existing implementations. The talk also touches on the potential of using deterministic simulators to generate tests and subject implementations to automated validation. The speaker concludes by emphasizing the value of TLA+ in ensuring that an implementation is correct before development begins.

Mindmap

Keywords

πŸ’‘TLA+

TLA+, short for Temporal Logic of Actions, is a formal specification language developed by Leslie Lamport for the design and verification of concurrent and distributed systems. In the video, it is used to validate system executions, ensuring that they conform to a high-level specification. The speaker discusses how TLA+ has been instrumental in narrowing the gap between specifications and code, thereby enhancing the reliability of system implementations.

πŸ’‘Trace Validation

Trace validation is a method used to verify the correctness of a system's execution by comparing it against a formal specification. The process involves collecting logs from a system's operation and using them to generate behaviors that are checked against a TLA+ specification. The video emphasizes the effectiveness of trace validation in finding discrepancies between system executions and their specifications, as demonstrated through various case studies.

πŸ’‘Distributed Systems

Distributed systems are a type of system where components, known as nodes, communicate and coordinate with each other over a network to achieve a common goal. The video primarily focuses on distributed systems, discussing how TLA+ and trace validation can be used to ensure the correct behavior of these systems, which can be complex due to the asynchronous nature of communication and potential for concurrent operations.

πŸ’‘Causal Ordering

Causal ordering is a concept that ensures the sequence of events in a distributed system reflects the cause-and-effect relationships between them. The video mentions the importance of maintaining causal order when combining log files into a global log, which is essential for accurate trace validation and understanding the system's behavior over time.

πŸ’‘Raft

Raft is a consensus algorithm used for managing replicated logs in a distributed system. It provides a way to ensure that all nodes in the system agree on the same sequence of operations. The video discusses the application of trace validation to Raft-based systems, highlighting how it has helped identify issues and improve the reliability of implementations.

πŸ’‘Vector Clock

A vector clock is a data structure used to track the partial ordering of events in a distributed system. It is particularly useful for determining the causal relationships between events. In the context of the video, vector clocks can be used to timestamp log entries, which aids in the process of trace validation by providing a means to order events across different nodes.

πŸ’‘Model Checking

Model checking is a formal verification technique used to check whether a finite-state model of a system meets certain specifications, often expressed in temporal logic. In the video, model checking is used in conjunction with TLA+ to find potential issues in system specifications and to ensure that the implementation adheres to the desired properties.

πŸ’‘Spect to Code Gap

The 'spec to code gap' refers to the discrepancy between the formal specification of a system and its actual implementation in code. The video discusses efforts to close this gap through the use of TLA+ and trace validation, which help ensure that the system's implementation aligns with its formal specification, thereby reducing bugs and inconsistencies.

πŸ’‘Concurrency

Concurrency in system design refers to the ability of components to operate independently and, potentially, simultaneously. The video touches on the challenges of designing and verifying concurrent systems, where TLA+ provides a framework to model and validate the complex interactions that can occur without relying on a specific execution order.

πŸ’‘Implementation Bugs

Implementation bugs are errors or faults in the code of a system that cause it to behave differently from what is specified or expected. The video highlights how trace validation has been used to uncover such bugs in real-world distributed systems, emphasizing the importance of rigorous testing and validation in ensuring system correctness.

πŸ’‘Action Composition

Action composition in TLA+ is a technique used to combine two or more actions into a single step, which can be useful for modeling complex behaviors in a more concise way. The video discusses how action composition was used to handle cases where multiple events occur in a single step, such as when a node both deactivates and passes a token in a distributed system.

Highlights

The talk discusses the use of TLA+ and its tool, TLC, for validating system executions, specifically focusing on distributed systems.

The concept of Trace Validation is introduced as a method to narrow the spec-to-code gap in system development.

The importance of logging local events in distributed systems for incident analysis and testing is emphasized.

The process of collecting log files to create a global log and maintain causal order is explained.

The use of TLA+'s Trace Specification to conform system behaviors to observed log file events is described.

The discovery of multiple behaviors in a system due to incomplete information in log files is highlighted.

The validation process involves comparing the system's set of behaviors to a high-level specification.

Previous work by a group around Lamport using TLA+ and TLC to verify a cache coherence protocol is mentioned.

The application of Trace Validation in spec-driven development for systems like etcd and CCF is discussed.

The identification of bugs in real-world systems through Trace Validation is covered, including issues found in etcd and CCF.

The integration of Trace Validation into CI/CD pipelines for ongoing validation of system commits is proposed.

The tutorial on how to perform Trace Validation using EWD 998 as an example is provided.

The importance of logging send and receive events for Trace Validation is emphasized without needing a deterministic hypervisor.

The use of logical distributed clocks, such as vector clocks or Lamport clocks, for ordering log files is suggested.

The potential for TLA+ to function as a generator of traces for deterministic testing environments is explored.

The effectiveness of TLA+ in thinking abstractly about the system and ensuring correctness before implementation is highlighted.

The trade-offs between using TLA+ for Trace Validation and other general-purpose programming languages or tools are discussed.

The future work section suggests generalizing Trace Validation, refining specifications, and integrating with model-based testing or fuzzing.

Transcripts

play00:00

so yeah welcome I'm going to talk about

play00:02

validating system executions with TLA

play00:05

plus and when I say I it's a group of

play00:08

people at inria in various entities of

play00:13

Microsoft and the real title of my talk

play00:15

is obviously a homework assignment an

play00:18

answer to an homework assignment by our

play00:20

keynote speaker this morning how do we

play00:23

close the spect to code Gap uh in the

play00:26

past I guess I've been one of the people

play00:27

who has been a bit dismissive of this

play00:29

question question and but then I

play00:31

realized yeah uh we leave a lot of

play00:33

people sort of out there who want to do

play00:37

t plus but for which we don't have a

play00:38

good answer so the answer is well pretty

play00:43

simple pretty straightforward we have a

play00:45

distributed system and I will primarily

play00:47

talk about distributed systems

play00:49

here all of your notes are logging the

play00:52

local events anyway right each note

play00:54

maintains a loog file it's get gets

play00:56

written somewhere to local storage fine

play00:58

and if there's an incident in production

play01:00

or then you look at those lock files

play01:02

anyway so in testing what we do is let's

play01:05

just collect these lock files and

play01:08

combine them into one single Global lock

play01:11

now by some time stamp by some clock so

play01:14

that the causal order is maintained um

play01:17

in the log

play01:18

file and then in t+ the answer to

play01:22

everything is write another

play01:23

specification we have a trace

play01:26

specification uh called Trace that reads

play01:30

the loog file and creates the set of

play01:33

behaviors T that conform to what you see

play01:36

in the loog file so naively you might

play01:38

think there's just a single behavior in

play01:41

t this was just a single execution of

play01:43

your system but perhaps you weren't able

play01:46

to collect all the information of your

play01:49

distributed system so you had sort of

play01:51

holes in your loog file and you had to

play01:53

guess and then you end up with more than

play01:55

one behavior in T and the validation

play01:59

part then is well you have your

play02:01

beautiful handwritten specification

play02:04

perhaps reverse engineered from the code

play02:06

but you compare that its set of

play02:09

behaviors to the set of behaviors of uh

play02:12

trace and if the intersection is

play02:14

nonempty well then luckily this Behavior

play02:17

this execution was accepted by our high

play02:20

level specification yeah so you want to

play02:23

want to be in the intersection

play02:25

part um this idea isn't new there was

play02:28

this prior work actually goes back more

play02:31

than 20 years uh when a group around

play02:34

Lamport used t+ and then new TLC to

play02:40

verify um a cach coherence protocol and

play02:43

they did two things a they took

play02:45

behaviors coming out of TLC and replayed

play02:48

those behaviors in a hardware simulator

play02:51

yeah they were doing RTL design and I

play02:53

have no idea what what that actually

play02:55

means but they were able to replay those

play02:57

behaviors in the RTL simulator

play03:00

and simultaneously they were also

play03:02

extracting other behaviors or other

play03:04

executions coming out of this Hardware

play03:06

simulator and valid validated those

play03:09

against the t plus specification in

play03:11

other words Trace

play03:13

validation except perhaps that they did

play03:15

this in a concurrence system not a

play03:16

distributed one and the tooling the t

play03:19

plus tooling was pretty custom tailored

play03:20

to this particular effort in 2018 Ron

play03:25

presler wrote this beautiful little note

play03:28

about this technique and General yeah

play03:30

how can you relate um implementation

play03:33

level traces to t plus specifications

play03:36

which I believe is fair to say spark

play03:38

these two papers here and extreme

play03:40

modeling practice out of mongodb by

play03:43

Jesse Davis at all and Jordan Halterman

play03:46

who independently applied it in um in a

play03:49

different system which is now product of

play03:52

Intel I believe and Jesse and all they

play03:55

were a little bit less enthusiastic

play03:57

about Trace validation Jordan Hal I

play03:59

think was a bit more

play04:01

positive they had experience with two

play04:04

systems we now in this work here have

play04:06

experience with seven additional

play04:09

systems the first five actually systems

play04:12

where we did essentially spec driven

play04:14

development using Trace validation

play04:17

several years ago I applied Trace

play04:19

validation to the model uh multi

play04:21

producer multi consumer problem you have

play04:23

a set of producers a set of consumers in

play04:26

the middle there's a fixed size que they

play04:28

communicate through the Que if the Q is

play04:30

full producers have to wait if the uh Q

play04:33

is empty consumers have to wait and if

play04:35

you don't get your synchronization right

play04:38

specifically if you only use one single

play04:40

mutex or then your system and a some

play04:43

some configuration can can deadlock and

play04:46

uh a specification shows that you have

play04:48

to use two mutexes and Trace validation

play04:53

instantaneously um found the

play04:56

discrepancy well that's a concurrence

play04:58

system more recently I applied it to a

play05:01

distribution distributed termination

play05:03

detection algorithm ewd 998 wrot the

play05:06

spec first and then the implementation

play05:09

and even though I was intimately

play05:11

familiar with the specification as a

play05:13

matter of fact this is my running

play05:15

example in t plus workshops I still

play05:17

screwed up in the implementation yeah

play05:20

this token and you will later see what

play05:22

this token actually does it kept going

play05:24

around the ring long after the initiated

play05:27

attack the

play05:28

termination and that was found with

play05:30

Trace validation and then I thought Ah

play05:32

that's an easy fix so I came up with an

play05:34

easy quick fix which Trace validation

play05:36

showed was wrong so I had to do it

play05:39

again and at inria they

play05:42

implemented um a specification by

play05:44

Lamport of 2pc two-phase commit and the

play05:47

implementation had a bug that it used a

play05:50

list data structure instead of a set

play05:52

data structure so now if a resource

play05:55

manager would time out while the

play05:57

transaction manager is waiting for all

play05:59

the resource managers to answer and then

play06:01

the resource manager would subsequently

play06:03

re signal that it's ready it would be

play06:06

counted twice yeah and then the

play06:07

transaction manager could just declare

play06:10

Global

play06:11

commit in a key Value Store

play06:14

specification due to Murad here also

play06:17

implemented by in edena they had a bug

play06:22

in how they initially initialized the

play06:24

local snapshot at the beginning of the

play06:26

transaction again found by Trace

play06:28

validation and last but not least an

play06:31

implementation of vanilla raft based on

play06:34

Diego angaro

play06:36

specification um they made a deliberate

play06:39

mistake to do to wrongly do the the

play06:43

calculation of the quarum and in this

play06:45

particular case due to this broken

play06:47

arithmetic a candidate could receive a

play06:50

quarum even though it didn't have one

play06:52

right and then you would end up with

play06:53

multiple leaders in the same term and

play06:56

all of this was caught by Trace

play06:58

validation but again spec driven

play07:01

development and smaller systems I would

play07:05

say but we've also been applying it to a

play07:08

bigger system at CD I believe it has

play07:11

44,000 stars on GitHub so it's

play07:13

reasonable reasonably popular in the

play07:16

internet might even show up another

play07:18

track here of the

play07:19

conference has almost 900 contributors

play07:23

several releases so it's super mature

play07:25

and the Microsoft product group wants to

play07:28

bring a new feature to ET CD which is

play07:30

called a witness support essentially you

play07:32

bring down the operational cost of

play07:34

running the cluster but instead of using

play07:36

three nodes you only use two nodes plus

play07:39

a

play07:40

witness and they did Spectrum

play07:43

development so they specified the

play07:44

feature they even wrote a pen and paper

play07:46

proof yet to be mechanized with gaps and

play07:50

that's convincing to the ETD maintainers

play07:53

that the algorithm works but how they

play07:56

wondered how do we make sure that also

play07:58

the implementation of this new witness

play07:59

feature is correct and so we decided hey

play08:03

first we will add Trace validation to at

play08:05

CD yeah to vanilla at CD as it is right

play08:08

now that's currently being tracked in

play08:10

111 and it's nearing completion and then

play08:13

in

play08:14

133 uh the ra witness support will be

play08:16

added to uh to at CD based then also on

play08:20

the trace validation and one of the uh

play08:23

maintainers here the Rio thinks that

play08:27

Trace validation and TLC model checking

play08:29

sort of

play08:30

vital also along the way we didn't find

play08:33

a safety Buck inet CD with Trace

play08:35

validation unfortunately dang but we

play08:38

found at least an inefficiency yeah so

play08:41

if the database of some of the nodes has

play08:45

to be um reined so you have to lose or

play08:50

drop a suffix of the database at one Noe

play08:53

and it has to resync with the other

play08:55

nodes the node would go back too far

play08:58

yeah and you would wait waste Cycles to

play09:00

syn it back up again and that was found

play09:02

by the by by trans

play09:04

validation interesting little anecdote

play09:07

here um Fang

play09:09

90 who's I believe one of the founding

play09:12

members of etcd apparently spoke with

play09:15

fport and Jew in 2015 about ways to

play09:19

generate an implementation from a

play09:22

specification and Lamport and Drew were

play09:24

like yeah there have been a few of

play09:26

efforts but they weren't super

play09:28

successful and I'm pretty sure that Finn

play09:30

over there has a different opinion he

play09:31

will talk about it after me so we leave

play09:34

that for him but what um Lamport and

play09:36

Juice addressed it was hey look how

play09:39

about you write a mapping function that

play09:42

takes the implementation State and

play09:44

Compares that to the specification State

play09:47

yeah AKA Trace validation so here some

play09:49

eight years later we finally brought

play09:52

Trace validation to

play09:54

atcd and uh sort of the last experience

play09:58

report in my talk today here is the

play10:01

confidential Consortium framework CCF

play10:03

done by Azure research it's graph

play10:06

inspired it's not vanilla raft um it's

play10:09

raft inspired crash fall tolerant

play10:11

consensus it has Dynamic reconfiguration

play10:14

we learned all about Dynamic

play10:15

reconfiguration this morning and how

play10:17

hard it is it also has cryptographic

play10:21

guarantees built into the consensus uh

play10:24

layer and it is the off Foundation of

play10:27

some a Azure offerings

play10:30

it's perhaps not as mature as uh at CD

play10:34

but still reasonably mature has several

play10:36

releases yeah foundational offering um

play10:39

in aser and it's written in a highly

play10:41

efficient programming language here that

play10:43

everybody

play10:47

loves since

play10:49

2019

play10:50

um or after 2019 actually around 2022

play10:55

the team decided okay we have an

play10:57

implementation now we all want a

play11:00

specification I can only speculate about

play11:03

the reasons and I won't do that but

play11:05

somebody was tasked to write a

play11:07

specification and they essentially

play11:08

followed kelvin's outline here reverse

play11:11

engineer the source code talked to

play11:13

Engineers spent several months yeah

play11:16

figuring out what the system actually

play11:18

does and they model check that

play11:20

specification they found issues

play11:22

subsequently

play11:24

corrected and then in 2023 we thought hm

play11:28

perhaps we can do more let's try Trace

play11:31

validation yeah they happen to have a

play11:35

testing environment yeah some shim some

play11:38

test driver to run multiple nodes on a

play11:41

single computer and have them do things

play11:43

so we took those 15 or so tests they had

play11:47

and obtained the traces the executions

play11:50

coming out of the tests to bring the

play11:52

spec up to speed to update it and that

play11:55

was a great process initially we found

play11:57

lwh hanging things like okay okay in a

play12:00

specification you can only have a send

play12:01

an a pent entries message with a single

play12:04

entry but the implementation may send

play12:06

empty or Zero Entry entries or batches

play12:08

of

play12:09

entries it's okay to abstract it this

play12:12

way on the other hand if you think of

play12:13

batches those entries might be from

play12:15

different terms and suddenly your

play12:17

correctness might be a little bit

play12:18

different yeah we added more and more

play12:22

changes to the high level specification

play12:24

because we knew knew it wasn't like up

play12:26

to date and you can go on GitHub and

play12:29

look at the 20 or so pull requests we

play12:32

got a bit more ambitious later on um in

play12:35

the sense that we added the

play12:36

bootstrapping of notes to the

play12:38

specification of how the bootstrapping

play12:40

is done in the real system how the note

play12:42

membership Works related to the

play12:44

cryptographic guarantees and also how

play12:46

note retirement works and how long nodes

play12:48

have to stay around in order for the

play12:50

system to remain

play12:52

live and then we felt pretty pretty good

play12:55

about our specifications so that we

play12:57

decided now it's time to also do new

play12:59

features based on the specification so

play13:02

we sort of reversed from going backwards

play13:05

to forwards and started to add request

play13:09

propos request vote messages to the high

play13:11

level specification and once we found

play13:13

this worked works we implemented it um

play13:17

in the

play13:19

code later on we also experimented with

play13:22

different network guarantees at the spec

play13:24

level to see what kind of message

play13:26

channels the implementation actually

play13:29

needs what kind of guarantees it needs

play13:31

does it need order delivery or can we

play13:33

get away without order delivery we look

play13:35

at those kinds of questions but most

play13:39

surprisingly we found two data loss

play13:42

issues um during Trace

play13:46

validation so one was because we reuse

play13:50

or CCF reuses the term field in a pent

play13:53

entries messages and then due to some

play13:56

unlucky reordering of of append entries

play14:00

of stale append entries messages and

play14:02

negative append entries message yeah in

play14:04

other words some non-happy path the

play14:07

system could lose data the commit index

play14:09

could be Advanced past a quorum um and

play14:12

then if a note uh crashes data is lost

play14:17

the other one the second one was similar

play14:19

node locally reused its match index

play14:22

variable and that could also lead to

play14:25

unsafely advancing the commit index and

play14:27

again some nonha path

play14:31

Behavior but now the big question

play14:33

obviously is how is it possible that

play14:35

Trace

play14:36

validation yeah that we did starting

play14:39

from passing tests we started with

play14:42

passing tests those were green those

play14:44

tests didn't lose data and they

play14:45

obviously assured that the network

play14:47

doesn't lose data or that the system

play14:48

doesn't lose data well obviously if you

play14:51

do Trace validation you don't do it in

play14:54

isolation instead it's one more tool in

play14:57

your T Plus Engineering ing verification

play15:00

tool Bel you start with a trace that

play15:02

fails and then iteratively you update

play15:05

this high level specification or you

play15:08

find hey I will leave the specification

play15:11

a bit more abstract but in my Trace

play15:13

specification I will formally document

play15:16

this Divergence and I will show you what

play15:18

that actually means later and then at

play15:20

some point later you're at the point

play15:22

that your traces all validate yeah and

play15:25

that's great but you don't stop you keep

play15:29

going and now do full scale

play15:31

verification and that's when we found

play15:33

that oh TLC suddenly finds violations of

play15:35

the core correctness properties of our

play15:38

specification H so this could have been

play15:40

a speack buck yeah just because we find

play15:44

an issue at the level of the

play15:45

specification doesn't necessarily mean

play15:47

that the implementation is also

play15:49

vulnerable so we manually

play15:52

translated TLC's counter example into a

play15:56

new test or into new tests multiple test

play16:00

with which we confirmed that the Bucks

play16:02

are present also in the

play16:03

implementation bucks are great to show

play16:06

that H sorry tests are great to show

play16:08

that bucks are present not so much that

play16:10

there are no bucks right luckily t plus

play16:14

counter examples are typically

play16:16

relatively short so this exercise wasn't

play16:19

super hard but it would be nice to do

play16:21

this in an automated fashion and that's

play16:23

going to resurface in our uh in the

play16:25

future work section of this presentation

play16:27

here okay knowing that the

play16:30

implementation was vulnerable we then

play16:33

obviously came up with a fix at the

play16:35

level of the

play16:37

specification again did a hell of a

play16:39

model checking simulation and everything

play16:41

ideally you could also do ethereum

play16:42

proving at this point again before we

play16:45

rolled the fix in the

play16:50

implementation okay questions so

play16:53

far yeah

play17:02

like you're doing like like 10,000

play17:04

requests a second or staying up for like

play17:06

a couple weeks or something like that

play17:08

how do you test something like that with

play17:09

t where you need like a huge load or no

play17:12

no we don't need a huge load in t so a

play17:16

your specification is highly abstract

play17:19

and removes a lot of the detail that is

play17:21

irrelevant to finding these issues and

play17:23

like Kelvin said by default everything

play17:26

in TLA plus is sort of concurrent the

play17:29

model Checker will check all the

play17:30

possible combinations of these actions

play17:33

it sort of finds these things and you

play17:36

use the small scope hypothesis you don't

play17:39

test this with 40 notes but with three

play17:42

or five notes and then those bucks

play17:45

become more

play17:49

likely okay so I keep going because now

play17:52

it's time for the trace validation mini

play17:54

tutorial based on ewd

play17:57

998 this is really just an excert if you

play17:59

want to see the the long version go to

play18:02

this pull request where I try to

play18:04

summarize uh everything as best as

play18:07

possible in various

play18:09

commits so in order to do Trace

play18:11

validation you have to understand what

play18:13

the actual system is and it's this

play18:15

termination detection um algorithm Zack

play18:19

and other folks here in the audience are

play18:22

experts in this system now it's pretty

play18:25

easy you have a distributed system and

play18:28

they carry out some computation that

play18:31

involves sending messages back and forth

play18:34

so you send a message from node a to

play18:35

node B and note B receives the message

play18:39

and then sometime later if nodes run out

play18:42

of work they decide to deactivate yeah

play18:45

go idle until they receive another

play18:49

message yeah and now detecting

play18:52

termination here seems easy you just

play18:54

look if all the notes are deactivated

play18:57

but if you don't have a control play

play18:59

like in a cloud environment you somehow

play19:02

have to do this inside of a distributed

play19:04

system and that's where our token

play19:07

passing kicks in yeah we have some note

play19:10

who passes a token around a ring a

play19:12

topologic a logical topology of our

play19:16

notes and after multiple rounds of this

play19:18

token passing the initiator can infer

play19:22

based on the state of the token that the

play19:24

system has

play19:25

terminated yeah I'm glossing over lots

play19:28

and lots of detail detail here obviously

play19:29

and I don't expect you to fully grock

play19:31

the algorithm the main takeaway is we

play19:34

have three actions send receive and

play19:36

deactivate and they model asynchronous

play19:39

message delivery and we have two other

play19:42

action initiate token and pass token

play19:45

that model the synchronous or Atomic

play19:48

token passing in our high level

play19:51

specification

play19:53

okay in pictures the ex an execution of

play19:57

the system may look like this

play19:59

now we have six notes Here some of which

play20:01

are active some are inactive they send

play20:04

messages back and forth and this green

play20:07

thing that starts traveling around the

play20:08

ring now is our

play20:11

token and the RDR SDR is the local

play20:15

system blck yeah this sort of the event

play20:18

of actions that happen at each one of

play20:21

the

play20:21

notes okay now you know how this how the

play20:24

system approximately looks

play20:27

like when when and where do I lock in

play20:30

the

play20:30

implementation when and where do I keep

play20:32

systems

play20:35

stay I do this in only two places when a

play20:39

no send node sends a message and when a

play20:43

node receives a message and for no real

play20:46

reason I choose to use Json as my wire

play20:49

and lock uh format could have been any

play20:51

format right but when a node sends a

play20:53

message it writes it to the socket and

play20:55

since it's already serialized I might as

play20:58

well write it to to S3 out into my fancy

play21:01

logging

play21:02

library and receive messages

play21:06

similar so then what does this ominous

play21:10

Trace specification look like that reads

play21:13

the log file into t+ behaviors this is

play21:17

it yeah it only contains the real one is

play21:20

a bit longer I will show it

play21:22

afterwards we extend the original

play21:25

specification because there are so many

play21:27

useful definitions that we just when we

play21:28

reuse in our lowlevel specification we

play21:31

also extend Json and Vector clock

play21:34

because they have useful definitions to

play21:36

order and read our loog files and then

play21:40

the only other variable that this

play21:42

specification declares is the length

play21:46

variable which is our offset into the

play21:49

log yeah how many lines of the log have

play21:51

we already

play21:55

processed and then we have five actions

play21:59

here's only one is sent message that

play22:01

essentially checks if there are still

play22:03

more lock lines to be consumed yeah if

play22:06

length is in the domain of the lock we

play22:08

bump length and check if the event in

play22:12

the lock is a scent event if it's a

play22:14

scent event and the message that was

play22:17

sent its type is a payload PL message

play22:20

well then we know this has to be a sent

play22:22

message action of the high level

play22:24

specification ewd 998 Chen

play22:28

and the argument to that is the identity

play22:31

of our sender

play22:33

node and the angel send message subar

play22:37

makes sure that the variables of the

play22:38

high level specification actually

play22:44

change and then the last line it's

play22:46

technically not strictly necessary but I

play22:50

happen to Define sent message on the

play22:52

high level specification in the typical

play22:55

t plus fashion in other words I Ed

play22:58

existential quantification to select the

play23:00

receiver node yeah send messages defined

play23:03

there exists a node in the set of O

play23:05

nodes such it such that its inbox in the

play23:08

next state contains one more message and

play23:11

with a system of say six nodes we would

play23:14

have six successor States for the

play23:17

current state current state uh for a

play23:19

send message action but clearly we can

play23:22

do better we can prune away five of

play23:24

those six states to limit States space

play23:27

explosion

play23:28

and have this constraint down here do

play23:32

that yeah that's

play23:34

it so let's see ah this is what I wrote

play23:40

about

play23:42

okay this just way too small but I can

play23:46

make it bigger now I only have to find

play23:49

my

play23:50

mouse so here's our S no this is pass

play23:54

token this a send message yeah so this

play23:57

is send message the one we we had on the

play23:59

slide

play24:01

oh okay well you see the five actions

play24:04

here and they all look pretty much the

play24:06

same maybe I have to move around

play24:11

here send messages up here but it's the

play24:14

same pattern or repeat it all over right

play24:17

so let's see if we can now validate a

play24:21

trace that I previously

play24:24

recorded n TLC says oh no sorry no can

play24:29

do this Trace doesn't validate dang so

play24:32

what's the problem

play24:36

here where is

play24:38

this so the problem is that our lock

play24:41

file because we only lock send and

play24:43

receive events only contains the line

play24:47

that says note number two passes the

play24:49

token to note number

play24:51

one when the token is in transit on The

play24:54

Wire we have no control over the wire

play24:56

and sometime lat later there's a lock

play24:58

line that says note number one receives

play25:01

this token from note number

play25:03

two asynchronous message delivery yeah

play25:07

but in our specification token passing

play25:10

is

play25:11

atomic it arrives at the destination in

play25:14

a single step yeah so there is a

play25:17

mismatch here in terms of the alignment

play25:19

of the behaviors but TAA plus has us

play25:22

covered we just add a stuttering step

play25:25

into our um Trace Behavior here

play25:28

by adding a definition called is receive

play25:32

token which you can see doesn't change

play25:35

any of the

play25:36

variables if we go back to our

play25:40

specification if I find my

play25:43

pointer

play25:45

it's this guy here and it says unchanged

play25:49

variables yeah it does it leaves the

play25:51

high level um variables unchanged it

play25:54

just allows us to introduce a stuttering

play25:56

step you could have also pre process the

play25:59

lock and eliminate those lock lines but

play26:02

I think that's ugly I don't like

play26:06

that okay so let's

play26:10

add that

play26:12

conjunct n

play26:16

sorry to our specification and

play26:19

check

play26:21

again and still we we don't manage to uh

play26:25

validate the trace but I think we made

play26:28

progress we now are at step number five

play26:31

now we moved from two to five

play26:35

yeah and now I have to do this again and

play26:38

the problem here is okay previously we

play26:41

had to do had to introduce a stuttering

play26:44

step but now a note is only allowed to

play26:49

send a token if it's

play26:51

inactive but nowhere do we lock that

play26:53

noes deactivate that's also not in our

play26:56

lock yeah this is the one of the h that

play26:58

I mentioned

play27:00

earlier and in order to account for this

play27:02

we have to allow nodes to

play27:05

non-deterministically go to inactive

play27:08

while they pass the token and this is

play27:11

done by composing the high level action

play27:14

deactivate with the highle action pass

play27:17

token with the action composition

play27:20

operator that has recently been added to

play27:23

TLC or support of which has recently

play27:26

been implemented in TLC see so we see

play27:29

there exists an ion node such that i

play27:32

deactivates c. pass token it does this

play27:36

in one

play27:37

step

play27:39

and now if I check this new

play27:46

specification it

play27:48

passes

play27:50

um

play27:53

great so let's go back to here

play28:00

but what is it we actually

play28:02

verify when we do Trace validation we do

play28:05

the intersection right but TLC only has

play28:09

invariance and

play28:10

properties at least historically yeah

play28:13

safety and liveness

play28:14

properties this isn't a safety property

play28:17

what should be the safety property this

play28:18

is not a state level formula that we can

play28:20

check

play28:21

here you might be tempted to check

play28:25

deadlock but I don't know I guess I went

play28:27

to too fast the diameter in model

play28:30

checking was something like 150 but the

play28:33

number of distinct States was 156 or so

play28:36

because there were a few states in the

play28:38

state

play28:39

graph

play28:41

that uh due to

play28:43

nondeterminism which would violate our

play28:45

deadlock check so we would see spirous

play28:47

counter examples because these states

play28:49

don't have successor States and for the

play28:52

same reason we can't use a liveness

play28:53

property like eventually always the

play28:55

length of the loog is equal to the Leng

play28:58

or the length variable is equal to the

play29:00

length of the lock would also produce a

play29:02

spirous counter example what we really

play29:04

want to check is a CTL property there

play29:06

exists a path in the set of all

play29:08

behaviors such that the length variable

play29:11

is equal to the length of the lock but

play29:13

we can't express that in TLA

play29:16

plus the next best workaround if you

play29:20

want something like a counter example is

play29:23

this livess here that's a bit convoluted

play29:26

that I had to abbreviate to fit on the

play29:27

SL

play29:28

which essentially says for as long as we

play29:31

are not at the point where length is

play29:34

equal to the length of the lock TLC

play29:37

still has unexplored sucess unexplored

play29:39

States in its state

play29:41

CU it's very technical very

play29:44

messy but you get some candidate

play29:47

Behavior as a counter example but the

play29:49

real property that you want to check

play29:51

because as I will argue in a minute

play29:53

there are no counter examples here the

play29:56

real property to check is that that the

play29:58

longest behavior that TLC

play30:01

generated the diameter is equal to the

play30:04

length of the loog that says there

play30:06

exists a p a behavior and the set of

play30:08

other behaviors whose length is the

play30:10

length of the

play30:13

lock and

play30:15

then a violation

play30:19

to um Trace validation would be a graph

play30:23

that looks like can I find this is that

play30:26

Trace TR

play30:36

that looks

play30:38

like this

play30:40

here and if you scroll in there are many

play30:45

points obviously you want to do this on

play30:47

a bigger computer um screen rather um

play30:53

you see TLC finds all these Alternatives

play30:55

and some of which have been excluded

play30:57

those are the yellow States they have

play30:59

been excluded due to our last conjunct

play31:02

and sent message but perhaps those will

play31:05

be the real States the real path to

play31:09

fully match the lock and that's why I

play31:11

argue you always have to look at the

play31:13

graph rather than individual counter

play31:16

examples because those B counter example

play31:18

might be a red

play31:20

herring and by the way the t plus

play31:23

debugger can help you figure out where

play31:27

the validation fails because you can set

play31:30

a dedicated space uh break point and

play31:32

then it will hold the moment debugger

play31:34

will hold the moment it starts rejecting

play31:37

uh

play31:47

States okay so in the two ra based

play31:53

systems how much logging did we have to

play31:56

add

play31:58

in the case of at CD we added logging in

play32:00

11 places and in CCF we did so in 15

play32:04

places and it's really essentially STD

play32:06

out yeah there is no convoluted logging

play32:09

library that we make you buy into we

play32:11

also don't require some deterministic

play32:13

hypervisor it's really your system

play32:16

running in your environment just logging

play32:19

yeah we log when messages are sent and

play32:21

received because CCF is written in C++

play32:24

there are multiple places where messages

play32:26

are sent anded that's why there's this

play32:29

difference in number of lock places plus

play32:32

we also decided that we lock in our

play32:36

testing infrastructure when it

play32:39

deliberately loses a

play32:41

message we could have gone without but

play32:44

we found that the nondeterminism that we

play32:47

had to introduce in the trace

play32:49

specification brought the verification

play32:52

validation time up from seconds to a few

play32:54

minutes so we decided okay we will lock

play32:56

this into that to yeah really make uh

play33:00

model checking or Trace validation super

play33:02

fast in general lock when messages are

play33:05

sent and received yeah you won't have

play33:08

issues with inadvertent deadlocking your

play33:11

system because when you after things

play33:13

have been serialized before they get

play33:16

deserialized back into your system there

play33:18

are no locks anymore that can go could

play33:20

go into your way and then ideally

play33:23

additionally also lck any observable

play33:27

node local State

play33:28

changes the nodes going inactive in my

play33:32

ewd 998 example I could have also uh

play33:36

locked to STD out right they just

play33:39

decided for the purpose of this mini

play33:41

tutorial here to not do

play33:44

that also include primed and the

play33:47

unprimed variables to strengthen your

play33:50

matching and try

play33:53

to or only lock constant size variables

play33:58

yeah don't lock the content of your

play34:00

database you don't want it every state

play34:02

right the database to STD out that

play34:04

wouldn't be feasible right goes without

play34:08

saying in order to synchronize your

play34:10

clocks uh sorry synchronize or order

play34:13

causally order your lock files you can

play34:15

use a centralized clock if you have to I

play34:18

find it way more elegant to use logical

play34:22

distributed clock like a vector clock a

play34:24

Lamport

play34:25

clock um perhaps space is prohibitive

play34:29

because you have so many nodes then use

play34:31

a Lamport clock I would even argue you

play34:33

want this distributed clock even if you

play34:35

don't do Trace validation to just make

play34:37

sense of your lock files right if you

play34:41

want to do Vector clocks we have a

play34:43

vector clock module in the community

play34:45

modules it has a Java module override to

play34:47

make things fast we stole the code from

play34:50

the shivas project to sort the vector

play34:53

clocks and as a bonus if you have Vector

play34:57

clocks you can use Ivan's beautiful shiv

play35:00

Vis tool to visualize the nodes of your

play35:03

system and how they communicate Yeah you

play35:05

sort of get that for free don't have

play35:07

time for a demo but it's great and I'm

play35:10

pretty sure Ivan is willing to give you

play35:12

a demo if you want

play35:14

to so to wrap up I hope to have

play35:18

convinced you that the t plus tools are

play35:20

mature enough to narrow the Spectra code

play35:23

Gap yeah I hope we get an an a from Mark

play35:27

for our homework here because yeah as

play35:30

you've seen Trace validation found spect

play35:32

to code divergences in all seven systems

play35:36

we found

play35:37

non-trivial uh bugs in Real World

play35:40

Systems it was also beautiful to reverse

play35:43

engineer CCF yeah updating the

play35:46

specification based on a trace perhaps

play35:49

beautiful as to to optimistic I was

play35:52

staring at this one debugger and this

play35:54

other debugger and trying to compare

play35:55

values perhaps not beauti

play35:58

but it was at least better than just uh

play36:00

throwing arrows in the

play36:02

dark and more importantly from now on

play36:04

forward cicd runs this Trace validation

play36:08

on every commit so if some non-la plus

play36:10

engineer changes CCF in a way that

play36:13

breaks the specification we will likely

play36:15

find it clearly this stuff requires t

play36:18

plus

play36:19

expertise but if you don't have a high

play36:22

LEL specification or if you don't have a

play36:23

t plus expert you don't have a t plus

play36:26

high level specification right so PR

play36:28

validation kind of becomes a mood

play36:31

anyway and I'm kind of proud that this

play36:34

most recent pull request

play36:37

6119 that was opened last week was

play36:40

opened by an engineer he's pretty clever

play36:43

but he only had two days of TL plus

play36:45

training and this pull request brings T

play36:47

TR validation to another specification

play36:49

at

play36:51

cclf okay what's

play36:53

next well does trace validation

play36:56

generalize I think we now have have nine

play36:58

or so systems where it had provided

play37:02

value one thing that you could argue is

play37:05

well most of them were raft and more

play37:08

importantly the raft specification is

play37:10

super low level from the perspective of

play37:12

t plus yeah if you look at the pexa

play37:15

specification it's way more high level

play37:17

and one might wonder if you would still

play37:20

find so many issues with a more high

play37:22

level

play37:23

specification but at least in the case

play37:25

of raft it works and in TLA plus we can

play37:28

always refine specifications even if you

play37:30

start with a high level spec could

play37:32

gradually refine it to bring it closer

play37:34

to the implementation then do Trace

play37:37

validation we haven't really done this

play37:39

at scale yet AKA

play37:42

production that's something to be done

play37:45

and in order to handle this case with

play37:47

the uh the drop messages we implemented

play37:50

a true depth for search mode in

play37:52

TLC that uh made things significantly

play37:56

faster because if you think about it if

play37:58

you only check the intersection it

play37:59

suffices to find one behavior in the

play38:02

intersection then you can stop you don't

play38:03

have to exhaustively change search the

play38:05

state

play38:07

space what's missing here is sort of the

play38:10

companion

play38:11

of um Trace validation which I would say

play38:15

call like something like model based

play38:16

testing that helps you generate diverse

play38:19

sets of behaviors yeah and

play38:22

ideally it would also be able to take

play38:25

counter examples and Translate those

play38:28

into system tests so that you don't have

play38:31

to do this manual translation if the

play38:33

high level verification finds there's

play38:35

something off alternatively you can

play38:37

obviously try fuzzing or chaos

play38:40

engineering to penetrate the dark

play38:42

corners of your implementation State

play38:44

space and with Trace validation you at

play38:47

least have a notion of coverage now

play38:50

because you can look at the spec

play38:52

coverage okay thank you

play38:57

[Applause]

play39:00

thank you do we have

play39:08

questions H it was interesting to see

play39:10

seven out of seven there um so uh kind

play39:14

of to Mark's Point earlier does this

play39:16

starting to feel like something where

play39:19

it's as important to do in terms of

play39:21

rather we're saying hey it's really

play39:23

important to do this spec but handwave

play39:24

handwave not so important to that's

play39:26

going to translate easily so where's

play39:27

your where's your feeling on that of

play39:30

import level of importance after seeing

play39:32

again seven out of

play39:34

seven yeah I think I'm surprised how

play39:37

effective it is um the kind of return on

play39:40

invest here seems to be pretty high um

play39:43

CCF I believe I worked a couple of

play39:46

months on it but majority of the time

play39:50

was spent on updating the t plus tools

play39:53

um if an if a system expert would have

play39:56

written the tra specific it would have

play39:58

been faster I sort of had to understand

play40:00

the system

play40:03

also thanks for a great talk Marcus I

play40:06

was wondering

play40:08

about you know like you've described J

play40:10

validation it's it's I essentially I see

play40:12

it as a onetime cost effort for trying

play40:15

to map like a specific system to a

play40:17

specific model but then there's also

play40:19

these like more General funny language

play40:21

tools like pigo and things that people

play40:23

have been working on that are trying to

play40:25

basically raise the level or you know

play40:27

decrease low of abstraction if you will

play40:29

on the model side and have a kind of a

play40:31

more mechanized mechanized path to

play40:34

implementation I just wondering about

play40:35

the the trade-offs between these and

play40:37

maybe you could speak to you know like

play40:40

when is

play40:41

trash yeah just maybe maybe more

play40:44

abstractly about about Trace validation

play40:46

and maybe pros and cons as compared to

play40:48

that yeah I think I mean it's a great

play40:50

idea to generate the implementation from

play40:52

the specification of that works but I

play40:54

think a current constraint of that say

play40:57

that it has to be a green field project

play40:59

and this was also of Brownfield yeah you

play41:02

have an existing implementation and I

play41:04

think that's where Trace validation sort

play41:06

of has the biggest return on invest um

play41:10

yeah Green Field I would probably try to

play41:11

do something like P go first generate as

play41:14

much as possible be principled in the

play41:16

way how I implement the system and have

play41:18

to manually do it and then would

play41:20

probably get

play41:22

yeah would feel better about the

play41:24

implementation if I would apply that

play41:26

technique

play41:30

additional

play41:35

questions okay y have to decide that du

play41:38

it

play41:41

out Hey so uh recently the whole thing

play41:46

about like

play41:49

subjecting um distributed systems to I

play41:52

guess sort of chaos engineering and

play41:54

traces and stuff uh got shaking up with

play41:58

the publication of the antithesis system

play42:01

I don't know if you heard of antithesis

play42:03

yeah it they basically made a fully

play42:06

deterministic virtual machine um mongod

play42:09

DB worked with them and they said it

play42:10

worked really well um and then there was

play42:13

also something kicking around for the

play42:15

past decade with like um Jepson and

play42:18

stuff which also did something similar

play42:21

have you thought about how like hey

play42:24

maybe t+ could function as like the

play42:27

genitor of traces or something for

play42:29

either like Jepson or I mean I don't

play42:31

think anyone really has access to

play42:33

anthesis yet but something like that

play42:35

that would be my my uh my argument about

play42:39

antithesis um but I think generally all

play42:41

of these systems are in these three dots

play42:43

here fuzzing chaos engineering

play42:45

deterministic hypervisors all the kinds

play42:48

of testing environments you have it's

play42:50

great that there's not like a product

play42:51

out there I think this has been the same

play42:53

ideas has been done multiple times for

play42:55

multiple specific project projects um

play42:58

implemented over and over and over again

play43:00

um I still so from my perspective TLC

play43:04

TLA plus still has the beauty of making

play43:06

you think abstractly about the system

play43:09

yeah like I said I has I have been

play43:11

dismissive of this question how do you

play43:12

implement the system because to some

play43:15

degree I keep saying it's easier to

play43:16

implement something if you know that it

play43:18

works and

play43:19

t+ gives you this guarantee that it

play43:22

works before you set out to implement it

play43:24

and my implementation of ewd 998

play43:27

I mean initially I tried to implement ew

play43:30

ewd 998 without the specification oh I

play43:33

had D plus paper I had very little time

play43:36

I would just implement this and then for

play43:37

a day it's like scratching my head how

play43:39

to do this here in an imperative

play43:41

language and I don't really know how the

play43:43

algorithm works and I essentially gave

play43:45

up and then I wrote this spec and then I

play43:47

still screwed up once but the

play43:49

implementation was done in three hours

play43:50

and it was to 99% correct and this is

play43:54

the value of t plus that you know what

play43:56

you implement

play43:57

and everything else is the icing on the

play44:00

cake we have one more question um maybe

play44:03

we can have Finn set up

play44:05

well so um I was surprised to see

play44:08

several of the specifications were

play44:10

shared memory specifications so your

play44:13

experience is it doesn't matter that

play44:16

much that

play44:18

um specification is message passing well

play44:22

obviously all the implementations are

play44:24

message passing but it doesn't matter

play44:26

that much that the specification is uh

play44:30

um message passing as long as you can

play44:33

find a good uh refinement mapping right

play44:37

so if the specification in its way how

play44:41

it what it assumes about message

play44:42

communication deviates from the

play44:44

implementation you have more work on the

play44:46

trace specification but there are ways

play44:49

stuttering and action composition to

play44:51

deal with that perhaps the biggest

play44:53

example where we used uh the action

play44:55

composition was where and where raft

play44:58

piggybacks term updates on top of

play45:01

ordinary messages and That Was Then

play45:03

always composed into one single message

play45:06

update term C dot or the receive

play45:10

actions great talk and one more thing

play45:12

about the deterministic simulators I

play45:15

think the Marcus already mentioned from

play45:19

the spec you can generate test and using

play45:21

the deterministic simulators uh subject

play45:25

the implementation to the same T in a

play45:29

basically automated way and I think

play45:31

that's a great

play45:35

opportunity let's get give another hand

play45:38

for Marcus thank you for your talk

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
TLA+Trace ValidationDistributed SystemsSystem ExecutionsSpecification GapInriaMicrosoftDistributed ConsensusRaft ProtocolConcurrency IssuesModel CheckingSoftware EngineeringBug DetectionHigh-Level SpecificationsImplementation VerificationDeterministic SimulationCCDFChaos EngineeringFuzzing