Validating System Executions with the TLA+ Tools Markus A Kuppe, Microsoft
Summary
TLDRThe speaker from Inria and Microsoft discusses the use of TLA+ and its tool, TLC, for validating system executions, specifically in distributed systems. The talk revolves around the concept of Trace Validation, a technique that involves logging local events from each node of a distributed system, then combining these logs into a global log to maintain causal order. By comparing the behaviors derived from these logs to a high-level specification, developers can identify discrepancies between the system's intended and actual behaviors. The speaker shares experiences of applying Trace Validation to various systems, including etcd and CCF, where it has uncovered several bugs and improved system specifications. The process not only helps in validating existing systems but also guides the development of new features based on the refined specifications. The talk emphasizes the importance of TLA+ expertise and the effectiveness of Trace Validation in narrowing the gap between specifications and code, ultimately enhancing system reliability.
Takeaways
- 📚 The speaker discusses the use of TLA+ and Trace Validation for validating system executions, emphasizing its effectiveness in narrowing the spec-to-code gap.
- 🔍 Trace Validation is a technique that involves collecting log files from a distributed system and comparing them against a high-level specification to ensure correctness.
- 🌐 The process begins with logging local events, which are then combined into a global log to maintain causal order, essential for validating system behaviors.
- 📈 The speaker shares experiences with seven different systems, highlighting how Trace Validation has uncovered discrepancies and bugs in real-world applications.
- ⚙️ The implementation of logging was straightforward in the discussed systems, requiring minimal changes to the existing codebase.
- 🕒 The validation process is efficient, with the model checker exploring all possible combinations of actions due to the concurrent nature of TLA+.
- 🔧 The speaker suggests that TLA+ expertise is necessary for effective use, but the回报 (return on investment) is high, given the bugs and issues identified.
- 🔗 The use of logical distributed clocks, like vector clocks or Lamport clocks, is recommended for ordering log files, which can also aid in visualization and understanding of system interactions.
- 📉 The speaker addresses challenges in applying Trace Validation, such as the need to refine high-level specifications to align with implementation details.
- 🛠️ Trace Validation serves as a complementary tool to other verification methods like fuzzing and chaos engineering, potentially guiding these processes by providing a notion of coverage.
- ⏪ The technique is not limited to new systems; it's also highly effective for verifying existing or 'brownfield' systems with pre-existing implementations.
- 🔬 The future work includes generalizing Trace Validation for broader application, integrating it with model-based testing, and automating the translation of counterexamples into system tests.
Q & A
What is the main focus of the talk?
-The talk focuses on validating system executions with TLA+ (Temporal Logic of Actions) and how it can be used to close the spec-to-code gap in distributed systems.
What is the significance of maintaining a log file in each node of a distributed system?
-Log files in each node help in recording local events. They are crucial for debugging incidents in production and for creating a global log to maintain causal order during testing.
How does Trace validation work in TLA+?
-Trace validation in TLA+ involves creating a trace specification that reads the log file and generates a set of behaviors conforming to what is observed in the log file. The validation part then compares this set of behaviors to the set of behaviors defined by a high-level specification to check for intersections, indicating acceptance of the execution by the specification.
What is the role of the high-level specification in the validation process?
-The high-level specification serves as a reference model that is compared against the behaviors derived from the log file. It helps in identifying whether the actual system execution conforms to the intended behavior as defined by the specification.
How does the speaker describe the experience with applying Trace validation to seven different systems?
-The speaker describes the experience as effective, with Trace validation finding spec-to-code divergences in all seven systems and uncovering non-trivial bugs in real-world systems.
What is the importance of logging in the context of Trace validation?
-Logging is essential for Trace validation as it provides the necessary data to create the global log file. The speaker suggests logging when messages are sent and received, and any observable node local state changes to ensure accurate validation.
What are the benefits of using a distributed clock for ordering log files?
-A distributed clock, such as a vector clock or a Lamport clock, helps in causally ordering the log files from different nodes, which is necessary for maintaining the correct sequence of events during Trace validation.
What is the role of the TLA+ tools in narrowing the spec-to-code gap?
-TLA+ tools, particularly Trace validation, are shown to be mature enough to help bridge the gap between high-level specifications and actual code implementations by providing a method to validate system executions against their specifications.
How does the speaker suggest improving the Trace validation process?
-The speaker suggests refining the high-level specifications to bring them closer to the implementation, using model-based testing to generate diverse sets of behaviors, and possibly automating the translation of counterexamples into system tests.
What are the challenges faced when applying Trace validation to existing systems?
-The challenges include the need for TLA+ expertise, the effort to map the specific system to the model, and the time spent on updating the TLA+ tools for better integration with existing systems.
How does the speaker view the future of Trace validation?
-The speaker is optimistic about the future of Trace validation, believing it to be a valuable tool for narrowing the spec-to-code gap and improving the reliability of distributed systems.
Outlines
😀 Introduction to TLA+ and Trace Validation
The speaker introduces the topic of validating system executions using TLA+ (Temporal Logic of Actions) and discusses the importance of closing the spec-to-code gap. They mention their initial skepticism about the question but later realized the need for a good answer, especially for those wanting to use TLA+. The talk focuses on distributed systems and the process of collecting local log files to create a global log, maintaining causal order, and using a trace specification to conform system behaviors to the log file. The validation process involves comparing the system's set of behaviors with a high-level specification.
🔍 Trace Validation in Practice
The speaker shares experiences with trace validation, including its application to various systems like a multi-producer multi-consumer problem, a distributed termination detection algorithm, and others. They discuss how trace validation helped identify discrepancies and bugs in implementations, even when the developers were familiar with the specifications. The talk also covers the application of trace validation in spec-driven development and how it has been integrated into larger systems like etcd, a popular distributed key-value store.
🚀 Applying Trace Validation to Larger Systems
The speaker describes the process of applying trace validation to larger and more complex systems, such as the Confidential Consortium Framework (CCF) developed by Azure Research. They discuss the challenges of creating a specification for an existing implementation and how trace validation helped identify issues and refine the high-level specification. The talk also covers the discovery of data loss issues during trace validation and the iterative process of updating the specification to match the implementation more closely.
🔗 Aligning Traces with High-Level Specifications
The speaker explains the process of aligning execution traces with high-level specifications using TLA+. They discuss the need for introducing stuttering steps and action composition to handle the mismatch between asynchronous message delivery in the implementation and atomic token passing in the specification. The talk also covers the use of logical clocks like vector clocks or Lamport clocks to order log files and the importance of locking observable node-local state changes.
📈 Trace Validation as a Verification Tool
The speaker emphasizes the effectiveness of trace validation as a verification tool, highlighting its success in finding spec-to-code divergences in seven systems. They discuss the integration of trace validation into continuous integration/continuous deployment (CI/CD) pipelines to ensure that changes to systems like CCF adhere to the specification. The talk also explores the potential for trace validation to generalize beyond the systems it has been applied to and the possibility of combining it with other testing methodologies.
🤔 Trade-offs and Future Directions
The speaker discusses the trade-offs between using trace validation and other general-purpose programming language tools. They argue that trace validation offers a high return on investment, especially for brownfield projects with existing implementations. The talk also touches on the potential of using deterministic simulators to generate tests and subject implementations to automated validation. The speaker concludes by emphasizing the value of TLA+ in ensuring that an implementation is correct before development begins.
Mindmap
Keywords
💡TLA+
💡Trace Validation
💡Distributed Systems
💡Causal Ordering
💡Raft
💡Vector Clock
💡Model Checking
💡Spect to Code Gap
💡Concurrency
💡Implementation Bugs
💡Action Composition
Highlights
The talk discusses the use of TLA+ and its tool, TLC, for validating system executions, specifically focusing on distributed systems.
The concept of Trace Validation is introduced as a method to narrow the spec-to-code gap in system development.
The importance of logging local events in distributed systems for incident analysis and testing is emphasized.
The process of collecting log files to create a global log and maintain causal order is explained.
The use of TLA+'s Trace Specification to conform system behaviors to observed log file events is described.
The discovery of multiple behaviors in a system due to incomplete information in log files is highlighted.
The validation process involves comparing the system's set of behaviors to a high-level specification.
Previous work by a group around Lamport using TLA+ and TLC to verify a cache coherence protocol is mentioned.
The application of Trace Validation in spec-driven development for systems like etcd and CCF is discussed.
The identification of bugs in real-world systems through Trace Validation is covered, including issues found in etcd and CCF.
The integration of Trace Validation into CI/CD pipelines for ongoing validation of system commits is proposed.
The tutorial on how to perform Trace Validation using EWD 998 as an example is provided.
The importance of logging send and receive events for Trace Validation is emphasized without needing a deterministic hypervisor.
The use of logical distributed clocks, such as vector clocks or Lamport clocks, for ordering log files is suggested.
The potential for TLA+ to function as a generator of traces for deterministic testing environments is explored.
The effectiveness of TLA+ in thinking abstractly about the system and ensuring correctness before implementation is highlighted.
The trade-offs between using TLA+ for Trace Validation and other general-purpose programming languages or tools are discussed.
The future work section suggests generalizing Trace Validation, refining specifications, and integrating with model-based testing or fuzzing.
Transcripts
so yeah welcome I'm going to talk about
validating system executions with TLA
plus and when I say I it's a group of
people at inria in various entities of
Microsoft and the real title of my talk
is obviously a homework assignment an
answer to an homework assignment by our
keynote speaker this morning how do we
close the spect to code Gap uh in the
past I guess I've been one of the people
who has been a bit dismissive of this
question question and but then I
realized yeah uh we leave a lot of
people sort of out there who want to do
t plus but for which we don't have a
good answer so the answer is well pretty
simple pretty straightforward we have a
distributed system and I will primarily
talk about distributed systems
here all of your notes are logging the
local events anyway right each note
maintains a loog file it's get gets
written somewhere to local storage fine
and if there's an incident in production
or then you look at those lock files
anyway so in testing what we do is let's
just collect these lock files and
combine them into one single Global lock
now by some time stamp by some clock so
that the causal order is maintained um
in the log
file and then in t+ the answer to
everything is write another
specification we have a trace
specification uh called Trace that reads
the loog file and creates the set of
behaviors T that conform to what you see
in the loog file so naively you might
think there's just a single behavior in
t this was just a single execution of
your system but perhaps you weren't able
to collect all the information of your
distributed system so you had sort of
holes in your loog file and you had to
guess and then you end up with more than
one behavior in T and the validation
part then is well you have your
beautiful handwritten specification
perhaps reverse engineered from the code
but you compare that its set of
behaviors to the set of behaviors of uh
trace and if the intersection is
nonempty well then luckily this Behavior
this execution was accepted by our high
level specification yeah so you want to
want to be in the intersection
part um this idea isn't new there was
this prior work actually goes back more
than 20 years uh when a group around
Lamport used t+ and then new TLC to
verify um a cach coherence protocol and
they did two things a they took
behaviors coming out of TLC and replayed
those behaviors in a hardware simulator
yeah they were doing RTL design and I
have no idea what what that actually
means but they were able to replay those
behaviors in the RTL simulator
and simultaneously they were also
extracting other behaviors or other
executions coming out of this Hardware
simulator and valid validated those
against the t plus specification in
other words Trace
validation except perhaps that they did
this in a concurrence system not a
distributed one and the tooling the t
plus tooling was pretty custom tailored
to this particular effort in 2018 Ron
presler wrote this beautiful little note
about this technique and General yeah
how can you relate um implementation
level traces to t plus specifications
which I believe is fair to say spark
these two papers here and extreme
modeling practice out of mongodb by
Jesse Davis at all and Jordan Halterman
who independently applied it in um in a
different system which is now product of
Intel I believe and Jesse and all they
were a little bit less enthusiastic
about Trace validation Jordan Hal I
think was a bit more
positive they had experience with two
systems we now in this work here have
experience with seven additional
systems the first five actually systems
where we did essentially spec driven
development using Trace validation
several years ago I applied Trace
validation to the model uh multi
producer multi consumer problem you have
a set of producers a set of consumers in
the middle there's a fixed size que they
communicate through the Que if the Q is
full producers have to wait if the uh Q
is empty consumers have to wait and if
you don't get your synchronization right
specifically if you only use one single
mutex or then your system and a some
some configuration can can deadlock and
uh a specification shows that you have
to use two mutexes and Trace validation
instantaneously um found the
discrepancy well that's a concurrence
system more recently I applied it to a
distribution distributed termination
detection algorithm ewd 998 wrot the
spec first and then the implementation
and even though I was intimately
familiar with the specification as a
matter of fact this is my running
example in t plus workshops I still
screwed up in the implementation yeah
this token and you will later see what
this token actually does it kept going
around the ring long after the initiated
attack the
termination and that was found with
Trace validation and then I thought Ah
that's an easy fix so I came up with an
easy quick fix which Trace validation
showed was wrong so I had to do it
again and at inria they
implemented um a specification by
Lamport of 2pc two-phase commit and the
implementation had a bug that it used a
list data structure instead of a set
data structure so now if a resource
manager would time out while the
transaction manager is waiting for all
the resource managers to answer and then
the resource manager would subsequently
re signal that it's ready it would be
counted twice yeah and then the
transaction manager could just declare
Global
commit in a key Value Store
specification due to Murad here also
implemented by in edena they had a bug
in how they initially initialized the
local snapshot at the beginning of the
transaction again found by Trace
validation and last but not least an
implementation of vanilla raft based on
Diego angaro
specification um they made a deliberate
mistake to do to wrongly do the the
calculation of the quarum and in this
particular case due to this broken
arithmetic a candidate could receive a
quarum even though it didn't have one
right and then you would end up with
multiple leaders in the same term and
all of this was caught by Trace
validation but again spec driven
development and smaller systems I would
say but we've also been applying it to a
bigger system at CD I believe it has
44,000 stars on GitHub so it's
reasonable reasonably popular in the
internet might even show up another
track here of the
conference has almost 900 contributors
several releases so it's super mature
and the Microsoft product group wants to
bring a new feature to ET CD which is
called a witness support essentially you
bring down the operational cost of
running the cluster but instead of using
three nodes you only use two nodes plus
a
witness and they did Spectrum
development so they specified the
feature they even wrote a pen and paper
proof yet to be mechanized with gaps and
that's convincing to the ETD maintainers
that the algorithm works but how they
wondered how do we make sure that also
the implementation of this new witness
feature is correct and so we decided hey
first we will add Trace validation to at
CD yeah to vanilla at CD as it is right
now that's currently being tracked in
111 and it's nearing completion and then
in
133 uh the ra witness support will be
added to uh to at CD based then also on
the trace validation and one of the uh
maintainers here the Rio thinks that
Trace validation and TLC model checking
sort of
vital also along the way we didn't find
a safety Buck inet CD with Trace
validation unfortunately dang but we
found at least an inefficiency yeah so
if the database of some of the nodes has
to be um reined so you have to lose or
drop a suffix of the database at one Noe
and it has to resync with the other
nodes the node would go back too far
yeah and you would wait waste Cycles to
syn it back up again and that was found
by the by by trans
validation interesting little anecdote
here um Fang
90 who's I believe one of the founding
members of etcd apparently spoke with
fport and Jew in 2015 about ways to
generate an implementation from a
specification and Lamport and Drew were
like yeah there have been a few of
efforts but they weren't super
successful and I'm pretty sure that Finn
over there has a different opinion he
will talk about it after me so we leave
that for him but what um Lamport and
Juice addressed it was hey look how
about you write a mapping function that
takes the implementation State and
Compares that to the specification State
yeah AKA Trace validation so here some
eight years later we finally brought
Trace validation to
atcd and uh sort of the last experience
report in my talk today here is the
confidential Consortium framework CCF
done by Azure research it's graph
inspired it's not vanilla raft um it's
raft inspired crash fall tolerant
consensus it has Dynamic reconfiguration
we learned all about Dynamic
reconfiguration this morning and how
hard it is it also has cryptographic
guarantees built into the consensus uh
layer and it is the off Foundation of
some a Azure offerings
it's perhaps not as mature as uh at CD
but still reasonably mature has several
releases yeah foundational offering um
in aser and it's written in a highly
efficient programming language here that
everybody
loves since
2019
um or after 2019 actually around 2022
the team decided okay we have an
implementation now we all want a
specification I can only speculate about
the reasons and I won't do that but
somebody was tasked to write a
specification and they essentially
followed kelvin's outline here reverse
engineer the source code talked to
Engineers spent several months yeah
figuring out what the system actually
does and they model check that
specification they found issues
subsequently
corrected and then in 2023 we thought hm
perhaps we can do more let's try Trace
validation yeah they happen to have a
testing environment yeah some shim some
test driver to run multiple nodes on a
single computer and have them do things
so we took those 15 or so tests they had
and obtained the traces the executions
coming out of the tests to bring the
spec up to speed to update it and that
was a great process initially we found
lwh hanging things like okay okay in a
specification you can only have a send
an a pent entries message with a single
entry but the implementation may send
empty or Zero Entry entries or batches
of
entries it's okay to abstract it this
way on the other hand if you think of
batches those entries might be from
different terms and suddenly your
correctness might be a little bit
different yeah we added more and more
changes to the high level specification
because we knew knew it wasn't like up
to date and you can go on GitHub and
look at the 20 or so pull requests we
got a bit more ambitious later on um in
the sense that we added the
bootstrapping of notes to the
specification of how the bootstrapping
is done in the real system how the note
membership Works related to the
cryptographic guarantees and also how
note retirement works and how long nodes
have to stay around in order for the
system to remain
live and then we felt pretty pretty good
about our specifications so that we
decided now it's time to also do new
features based on the specification so
we sort of reversed from going backwards
to forwards and started to add request
propos request vote messages to the high
level specification and once we found
this worked works we implemented it um
in the
code later on we also experimented with
different network guarantees at the spec
level to see what kind of message
channels the implementation actually
needs what kind of guarantees it needs
does it need order delivery or can we
get away without order delivery we look
at those kinds of questions but most
surprisingly we found two data loss
issues um during Trace
validation so one was because we reuse
or CCF reuses the term field in a pent
entries messages and then due to some
unlucky reordering of of append entries
of stale append entries messages and
negative append entries message yeah in
other words some non-happy path the
system could lose data the commit index
could be Advanced past a quorum um and
then if a note uh crashes data is lost
the other one the second one was similar
node locally reused its match index
variable and that could also lead to
unsafely advancing the commit index and
again some nonha path
Behavior but now the big question
obviously is how is it possible that
Trace
validation yeah that we did starting
from passing tests we started with
passing tests those were green those
tests didn't lose data and they
obviously assured that the network
doesn't lose data or that the system
doesn't lose data well obviously if you
do Trace validation you don't do it in
isolation instead it's one more tool in
your T Plus Engineering ing verification
tool Bel you start with a trace that
fails and then iteratively you update
this high level specification or you
find hey I will leave the specification
a bit more abstract but in my Trace
specification I will formally document
this Divergence and I will show you what
that actually means later and then at
some point later you're at the point
that your traces all validate yeah and
that's great but you don't stop you keep
going and now do full scale
verification and that's when we found
that oh TLC suddenly finds violations of
the core correctness properties of our
specification H so this could have been
a speack buck yeah just because we find
an issue at the level of the
specification doesn't necessarily mean
that the implementation is also
vulnerable so we manually
translated TLC's counter example into a
new test or into new tests multiple test
with which we confirmed that the Bucks
are present also in the
implementation bucks are great to show
that H sorry tests are great to show
that bucks are present not so much that
there are no bucks right luckily t plus
counter examples are typically
relatively short so this exercise wasn't
super hard but it would be nice to do
this in an automated fashion and that's
going to resurface in our uh in the
future work section of this presentation
here okay knowing that the
implementation was vulnerable we then
obviously came up with a fix at the
level of the
specification again did a hell of a
model checking simulation and everything
ideally you could also do ethereum
proving at this point again before we
rolled the fix in the
implementation okay questions so
far yeah
like you're doing like like 10,000
requests a second or staying up for like
a couple weeks or something like that
how do you test something like that with
t where you need like a huge load or no
no we don't need a huge load in t so a
your specification is highly abstract
and removes a lot of the detail that is
irrelevant to finding these issues and
like Kelvin said by default everything
in TLA plus is sort of concurrent the
model Checker will check all the
possible combinations of these actions
it sort of finds these things and you
use the small scope hypothesis you don't
test this with 40 notes but with three
or five notes and then those bucks
become more
likely okay so I keep going because now
it's time for the trace validation mini
tutorial based on ewd
998 this is really just an excert if you
want to see the the long version go to
this pull request where I try to
summarize uh everything as best as
possible in various
commits so in order to do Trace
validation you have to understand what
the actual system is and it's this
termination detection um algorithm Zack
and other folks here in the audience are
experts in this system now it's pretty
easy you have a distributed system and
they carry out some computation that
involves sending messages back and forth
so you send a message from node a to
node B and note B receives the message
and then sometime later if nodes run out
of work they decide to deactivate yeah
go idle until they receive another
message yeah and now detecting
termination here seems easy you just
look if all the notes are deactivated
but if you don't have a control play
like in a cloud environment you somehow
have to do this inside of a distributed
system and that's where our token
passing kicks in yeah we have some note
who passes a token around a ring a
topologic a logical topology of our
notes and after multiple rounds of this
token passing the initiator can infer
based on the state of the token that the
system has
terminated yeah I'm glossing over lots
and lots of detail detail here obviously
and I don't expect you to fully grock
the algorithm the main takeaway is we
have three actions send receive and
deactivate and they model asynchronous
message delivery and we have two other
action initiate token and pass token
that model the synchronous or Atomic
token passing in our high level
specification
okay in pictures the ex an execution of
the system may look like this
now we have six notes Here some of which
are active some are inactive they send
messages back and forth and this green
thing that starts traveling around the
ring now is our
token and the RDR SDR is the local
system blck yeah this sort of the event
of actions that happen at each one of
the
notes okay now you know how this how the
system approximately looks
like when when and where do I lock in
the
implementation when and where do I keep
systems
stay I do this in only two places when a
no send node sends a message and when a
node receives a message and for no real
reason I choose to use Json as my wire
and lock uh format could have been any
format right but when a node sends a
message it writes it to the socket and
since it's already serialized I might as
well write it to to S3 out into my fancy
logging
library and receive messages
similar so then what does this ominous
Trace specification look like that reads
the log file into t+ behaviors this is
it yeah it only contains the real one is
a bit longer I will show it
afterwards we extend the original
specification because there are so many
useful definitions that we just when we
reuse in our lowlevel specification we
also extend Json and Vector clock
because they have useful definitions to
order and read our loog files and then
the only other variable that this
specification declares is the length
variable which is our offset into the
log yeah how many lines of the log have
we already
processed and then we have five actions
here's only one is sent message that
essentially checks if there are still
more lock lines to be consumed yeah if
length is in the domain of the lock we
bump length and check if the event in
the lock is a scent event if it's a
scent event and the message that was
sent its type is a payload PL message
well then we know this has to be a sent
message action of the high level
specification ewd 998 Chen
and the argument to that is the identity
of our sender
node and the angel send message subar
makes sure that the variables of the
high level specification actually
change and then the last line it's
technically not strictly necessary but I
happen to Define sent message on the
high level specification in the typical
t plus fashion in other words I Ed
existential quantification to select the
receiver node yeah send messages defined
there exists a node in the set of O
nodes such it such that its inbox in the
next state contains one more message and
with a system of say six nodes we would
have six successor States for the
current state current state uh for a
send message action but clearly we can
do better we can prune away five of
those six states to limit States space
explosion
and have this constraint down here do
that yeah that's
it so let's see ah this is what I wrote
about
okay this just way too small but I can
make it bigger now I only have to find
my
mouse so here's our S no this is pass
token this a send message yeah so this
is send message the one we we had on the
slide
oh okay well you see the five actions
here and they all look pretty much the
same maybe I have to move around
here send messages up here but it's the
same pattern or repeat it all over right
so let's see if we can now validate a
trace that I previously
recorded n TLC says oh no sorry no can
do this Trace doesn't validate dang so
what's the problem
here where is
this so the problem is that our lock
file because we only lock send and
receive events only contains the line
that says note number two passes the
token to note number
one when the token is in transit on The
Wire we have no control over the wire
and sometime lat later there's a lock
line that says note number one receives
this token from note number
two asynchronous message delivery yeah
but in our specification token passing
is
atomic it arrives at the destination in
a single step yeah so there is a
mismatch here in terms of the alignment
of the behaviors but TAA plus has us
covered we just add a stuttering step
into our um Trace Behavior here
by adding a definition called is receive
token which you can see doesn't change
any of the
variables if we go back to our
specification if I find my
pointer
it's this guy here and it says unchanged
variables yeah it does it leaves the
high level um variables unchanged it
just allows us to introduce a stuttering
step you could have also pre process the
lock and eliminate those lock lines but
I think that's ugly I don't like
that okay so let's
add that
conjunct n
sorry to our specification and
check
again and still we we don't manage to uh
validate the trace but I think we made
progress we now are at step number five
now we moved from two to five
yeah and now I have to do this again and
the problem here is okay previously we
had to do had to introduce a stuttering
step but now a note is only allowed to
send a token if it's
inactive but nowhere do we lock that
noes deactivate that's also not in our
lock yeah this is the one of the h that
I mentioned
earlier and in order to account for this
we have to allow nodes to
non-deterministically go to inactive
while they pass the token and this is
done by composing the high level action
deactivate with the highle action pass
token with the action composition
operator that has recently been added to
TLC or support of which has recently
been implemented in TLC see so we see
there exists an ion node such that i
deactivates c. pass token it does this
in one
step
and now if I check this new
specification it
passes
um
great so let's go back to here
but what is it we actually
verify when we do Trace validation we do
the intersection right but TLC only has
invariance and
properties at least historically yeah
safety and liveness
properties this isn't a safety property
what should be the safety property this
is not a state level formula that we can
check
here you might be tempted to check
deadlock but I don't know I guess I went
to too fast the diameter in model
checking was something like 150 but the
number of distinct States was 156 or so
because there were a few states in the
state
graph
that uh due to
nondeterminism which would violate our
deadlock check so we would see spirous
counter examples because these states
don't have successor States and for the
same reason we can't use a liveness
property like eventually always the
length of the loog is equal to the Leng
or the length variable is equal to the
length of the lock would also produce a
spirous counter example what we really
want to check is a CTL property there
exists a path in the set of all
behaviors such that the length variable
is equal to the length of the lock but
we can't express that in TLA
plus the next best workaround if you
want something like a counter example is
this livess here that's a bit convoluted
that I had to abbreviate to fit on the
SL
which essentially says for as long as we
are not at the point where length is
equal to the length of the lock TLC
still has unexplored sucess unexplored
States in its state
CU it's very technical very
messy but you get some candidate
Behavior as a counter example but the
real property that you want to check
because as I will argue in a minute
there are no counter examples here the
real property to check is that that the
longest behavior that TLC
generated the diameter is equal to the
length of the loog that says there
exists a p a behavior and the set of
other behaviors whose length is the
length of the
lock and
then a violation
to um Trace validation would be a graph
that looks like can I find this is that
Trace TR
that looks
like this
here and if you scroll in there are many
points obviously you want to do this on
a bigger computer um screen rather um
you see TLC finds all these Alternatives
and some of which have been excluded
those are the yellow States they have
been excluded due to our last conjunct
and sent message but perhaps those will
be the real States the real path to
fully match the lock and that's why I
argue you always have to look at the
graph rather than individual counter
examples because those B counter example
might be a red
herring and by the way the t plus
debugger can help you figure out where
the validation fails because you can set
a dedicated space uh break point and
then it will hold the moment debugger
will hold the moment it starts rejecting
uh
States okay so in the two ra based
systems how much logging did we have to
add
in the case of at CD we added logging in
11 places and in CCF we did so in 15
places and it's really essentially STD
out yeah there is no convoluted logging
library that we make you buy into we
also don't require some deterministic
hypervisor it's really your system
running in your environment just logging
yeah we log when messages are sent and
received because CCF is written in C++
there are multiple places where messages
are sent anded that's why there's this
difference in number of lock places plus
we also decided that we lock in our
testing infrastructure when it
deliberately loses a
message we could have gone without but
we found that the nondeterminism that we
had to introduce in the trace
specification brought the verification
validation time up from seconds to a few
minutes so we decided okay we will lock
this into that to yeah really make uh
model checking or Trace validation super
fast in general lock when messages are
sent and received yeah you won't have
issues with inadvertent deadlocking your
system because when you after things
have been serialized before they get
deserialized back into your system there
are no locks anymore that can go could
go into your way and then ideally
additionally also lck any observable
node local State
changes the nodes going inactive in my
ewd 998 example I could have also uh
locked to STD out right they just
decided for the purpose of this mini
tutorial here to not do
that also include primed and the
unprimed variables to strengthen your
matching and try
to or only lock constant size variables
yeah don't lock the content of your
database you don't want it every state
right the database to STD out that
wouldn't be feasible right goes without
saying in order to synchronize your
clocks uh sorry synchronize or order
causally order your lock files you can
use a centralized clock if you have to I
find it way more elegant to use logical
distributed clock like a vector clock a
Lamport
clock um perhaps space is prohibitive
because you have so many nodes then use
a Lamport clock I would even argue you
want this distributed clock even if you
don't do Trace validation to just make
sense of your lock files right if you
want to do Vector clocks we have a
vector clock module in the community
modules it has a Java module override to
make things fast we stole the code from
the shivas project to sort the vector
clocks and as a bonus if you have Vector
clocks you can use Ivan's beautiful shiv
Vis tool to visualize the nodes of your
system and how they communicate Yeah you
sort of get that for free don't have
time for a demo but it's great and I'm
pretty sure Ivan is willing to give you
a demo if you want
to so to wrap up I hope to have
convinced you that the t plus tools are
mature enough to narrow the Spectra code
Gap yeah I hope we get an an a from Mark
for our homework here because yeah as
you've seen Trace validation found spect
to code divergences in all seven systems
we found
non-trivial uh bugs in Real World
Systems it was also beautiful to reverse
engineer CCF yeah updating the
specification based on a trace perhaps
beautiful as to to optimistic I was
staring at this one debugger and this
other debugger and trying to compare
values perhaps not beauti
but it was at least better than just uh
throwing arrows in the
dark and more importantly from now on
forward cicd runs this Trace validation
on every commit so if some non-la plus
engineer changes CCF in a way that
breaks the specification we will likely
find it clearly this stuff requires t
plus
expertise but if you don't have a high
LEL specification or if you don't have a
t plus expert you don't have a t plus
high level specification right so PR
validation kind of becomes a mood
anyway and I'm kind of proud that this
most recent pull request
6119 that was opened last week was
opened by an engineer he's pretty clever
but he only had two days of TL plus
training and this pull request brings T
TR validation to another specification
at
cclf okay what's
next well does trace validation
generalize I think we now have have nine
or so systems where it had provided
value one thing that you could argue is
well most of them were raft and more
importantly the raft specification is
super low level from the perspective of
t plus yeah if you look at the pexa
specification it's way more high level
and one might wonder if you would still
find so many issues with a more high
level
specification but at least in the case
of raft it works and in TLA plus we can
always refine specifications even if you
start with a high level spec could
gradually refine it to bring it closer
to the implementation then do Trace
validation we haven't really done this
at scale yet AKA
production that's something to be done
and in order to handle this case with
the uh the drop messages we implemented
a true depth for search mode in
TLC that uh made things significantly
faster because if you think about it if
you only check the intersection it
suffices to find one behavior in the
intersection then you can stop you don't
have to exhaustively change search the
state
space what's missing here is sort of the
companion
of um Trace validation which I would say
call like something like model based
testing that helps you generate diverse
sets of behaviors yeah and
ideally it would also be able to take
counter examples and Translate those
into system tests so that you don't have
to do this manual translation if the
high level verification finds there's
something off alternatively you can
obviously try fuzzing or chaos
engineering to penetrate the dark
corners of your implementation State
space and with Trace validation you at
least have a notion of coverage now
because you can look at the spec
coverage okay thank you
[Applause]
thank you do we have
questions H it was interesting to see
seven out of seven there um so uh kind
of to Mark's Point earlier does this
starting to feel like something where
it's as important to do in terms of
rather we're saying hey it's really
important to do this spec but handwave
handwave not so important to that's
going to translate easily so where's
your where's your feeling on that of
import level of importance after seeing
again seven out of
seven yeah I think I'm surprised how
effective it is um the kind of return on
invest here seems to be pretty high um
CCF I believe I worked a couple of
months on it but majority of the time
was spent on updating the t plus tools
um if an if a system expert would have
written the tra specific it would have
been faster I sort of had to understand
the system
also thanks for a great talk Marcus I
was wondering
about you know like you've described J
validation it's it's I essentially I see
it as a onetime cost effort for trying
to map like a specific system to a
specific model but then there's also
these like more General funny language
tools like pigo and things that people
have been working on that are trying to
basically raise the level or you know
decrease low of abstraction if you will
on the model side and have a kind of a
more mechanized mechanized path to
implementation I just wondering about
the the trade-offs between these and
maybe you could speak to you know like
when is
trash yeah just maybe maybe more
abstractly about about Trace validation
and maybe pros and cons as compared to
that yeah I think I mean it's a great
idea to generate the implementation from
the specification of that works but I
think a current constraint of that say
that it has to be a green field project
and this was also of Brownfield yeah you
have an existing implementation and I
think that's where Trace validation sort
of has the biggest return on invest um
yeah Green Field I would probably try to
do something like P go first generate as
much as possible be principled in the
way how I implement the system and have
to manually do it and then would
probably get
yeah would feel better about the
implementation if I would apply that
technique
additional
questions okay y have to decide that du
it
out Hey so uh recently the whole thing
about like
subjecting um distributed systems to I
guess sort of chaos engineering and
traces and stuff uh got shaking up with
the publication of the antithesis system
I don't know if you heard of antithesis
yeah it they basically made a fully
deterministic virtual machine um mongod
DB worked with them and they said it
worked really well um and then there was
also something kicking around for the
past decade with like um Jepson and
stuff which also did something similar
have you thought about how like hey
maybe t+ could function as like the
genitor of traces or something for
either like Jepson or I mean I don't
think anyone really has access to
anthesis yet but something like that
that would be my my uh my argument about
antithesis um but I think generally all
of these systems are in these three dots
here fuzzing chaos engineering
deterministic hypervisors all the kinds
of testing environments you have it's
great that there's not like a product
out there I think this has been the same
ideas has been done multiple times for
multiple specific project projects um
implemented over and over and over again
um I still so from my perspective TLC
TLA plus still has the beauty of making
you think abstractly about the system
yeah like I said I has I have been
dismissive of this question how do you
implement the system because to some
degree I keep saying it's easier to
implement something if you know that it
works and
t+ gives you this guarantee that it
works before you set out to implement it
and my implementation of ewd 998
I mean initially I tried to implement ew
ewd 998 without the specification oh I
had D plus paper I had very little time
I would just implement this and then for
a day it's like scratching my head how
to do this here in an imperative
language and I don't really know how the
algorithm works and I essentially gave
up and then I wrote this spec and then I
still screwed up once but the
implementation was done in three hours
and it was to 99% correct and this is
the value of t plus that you know what
you implement
and everything else is the icing on the
cake we have one more question um maybe
we can have Finn set up
well so um I was surprised to see
several of the specifications were
shared memory specifications so your
experience is it doesn't matter that
much that
um specification is message passing well
obviously all the implementations are
message passing but it doesn't matter
that much that the specification is uh
um message passing as long as you can
find a good uh refinement mapping right
so if the specification in its way how
it what it assumes about message
communication deviates from the
implementation you have more work on the
trace specification but there are ways
stuttering and action composition to
deal with that perhaps the biggest
example where we used uh the action
composition was where and where raft
piggybacks term updates on top of
ordinary messages and That Was Then
always composed into one single message
update term C dot or the receive
actions great talk and one more thing
about the deterministic simulators I
think the Marcus already mentioned from
the spec you can generate test and using
the deterministic simulators uh subject
the implementation to the same T in a
basically automated way and I think
that's a great
opportunity let's get give another hand
for Marcus thank you for your talk
Voir Plus de Vidéos Connexes
Chapter 1 - Lesson 5 (Part 1)
Operating System Design & Implementation
Understanding Java EE : What Is An Enterprise Application? #bscit #bsccs #java
Intro to Replication - Systems Design "Need to Knows" | Systems Design 0 to 1 with Ex-Google SWE
FSE100 - Analysis
Stephen Belanger | Node.js observability with diagnostics_channel and AsyncLocalStorage
5.0 / 5 (0 votes)