SREcon24 Americas - 20 Years of SRE: Highs and Lows

USENIX
18 Apr 202427:27

Summary

TLDRThe speaker reflects on the evolution of Site Reliability Engineering (SRE) over 20 years, emphasizing its roots in startups and engineering solutions to operational challenges. They discuss the growth of SRE, its integration into various sectors, and the importance of its principles in preventing burnout and toil. The talk also addresses the challenges faced by SRE, including career pipeline issues, adoption failures, and the persistent perception of operations as a low-status function, advocating for a continued application of software techniques to operations.

Takeaways

  • ๐Ÿ“š The script reflects on the evolution of the Site Reliability Engineering (SRE) role over the past 20 years, emphasizing the importance of reliability in AI and the tech industry.
  • ๐ŸŽจ The speaker appreciates the artistic representation of '20 years of SRE' in the style of an ancient Irish medieval illuminated manuscript, highlighting the value of creativity in technical fields.
  • ๐Ÿ—ฃ๏ธ The speaker identifies themselves as a key figure in popularizing SRE, with their book on the subject being widely recognized and available in commercial spaces.
  • ๐ŸŒŸ The talk is dedicated to the speaker's stepmother, Helen Gray, who recently passed away, adding a personal touch to the professional discussion.
  • ๐Ÿ” The speaker clarifies that their perspective on SRE is personal and may not cover all aspects or align with every professional's view, urging the audience to consider this limited viewpoint.
  • ๐Ÿ˜ The 'Elephant in the Server Room' is identified as Google, indicating the company's significant influence on the SRE field and its practices.
  • ๐Ÿš€ The script challenges the narrative that SRE is incompatible with the fast-paced, resource-strapped environment of startups, suggesting that SRE principles can thrive in such conditions.
  • ๐Ÿ› ๏ธ SRE is portrayed as an engineering-driven field, where solutions are iteratively improved upon, rather than being seen as a static, large-scale operation.
  • ๐ŸŒ The speaker discusses the widespread adoption of SRE principles across various sectors and company sizes, indicating the model's versatility and relevance.
  • ๐Ÿ“ˆ The growth of the SRE market is noted, with the speaker observing an increase in demand for SRE knowledge and practices, despite the availability of free resources.
  • ๐Ÿ† The script highlights the impact of SRE on broader societal issues, with examples of SRE professionals making significant contributions to social causes and ethical considerations in technology.

Q & A

  • What is the significance of the '20 years of ESS' phrase mentioned in the script?

    -The phrase '20 years of ESS' is used as a reflective opportunity rather than a precise timeline. It's meant to spark a discussion on the evolution and impact of Site Reliability Engineering (SRE) over the past two decades.

  • What is the author's connection to the popularization of SRE?

    -The author is responsible for popularizing SRE, largely through the publication of a book on the subject, which has been found on commercial shelves and has contributed to the widespread understanding of SRE principles.

  • Why is the talk dedicated to the author's stepmother, Helen Gray?

    -The talk is dedicated to Helen Gray as a personal tribute, acknowledging her passing and the personal significance this event holds for the author.

  • What does the author mean by 'the Elephant in the server room'?

    -The phrase 'the Elephant in the server room' is a metaphor for the obvious yet often ignored or unaddressed issue in the industry, which in this context refers to Google's influence and its relationship with SRE.

  • How does the author describe the evolution of Google's approach to system management?

    -The author describes Google's evolution as moving from a simple list of machines to more complex systems like babysitter, Borg, and others, emphasizing the importance of incremental improvements and engineering solutions to manage system reliability.

  • What is the author's view on the relationship between SRE and startups?

    -The author believes that SRE is often misunderstood as being incompatible with the fast-paced, resource-constrained environment of startups. However, he argues that SRE principles can be effectively applied in startups, where the focus is on incremental improvements and engineering solutions to problems.

  • What is the author's perspective on the adoption of SRE across different sectors?

    -The author notes that SRE has permeated various sectors, including entertainment, food delivery, education, and government, indicating a broad acceptance and application of SRE principles beyond just large multinational corporations.

  • Why does the author mention the book 'Site Reliability Engineering' continues to sell well?

    -The author points out that despite the content of the book being freely available, its continued sales indicate a demand for high-quality, curated information on SRE, suggesting that the model and practices it discusses are still relevant and sought after.

  • What does the author suggest about the impact of SRE on general engineering and business consciousness?

    -The author suggests that SRE ideas have not only permeated general engineering practices but have also influenced business consciousness, as evidenced by references to SRE in business strategy reports from prominent organizations like Gartner and Forrester.

  • How does the author address the issue of SRE's role in social and ethical contexts?

    -The author highlights instances where SRE professionals have used their skills to address broader social issues, such as the US healthcare system and the #MeToo movement, emphasizing that SRE is not just about technical solutions but also about ethical responsibility and societal impact.

  • What challenges does the author identify in the SRE career pipeline?

    -The author identifies challenges in the SRE career pipeline, particularly for junior level professionals, suggesting that the field can be intimidating and that more needs to be done to encourage and facilitate entry-level participation in SRE roles.

  • What is the author's view on the need for quantitative models in SRE?

    -The author believes there is a need for more quantitative models in SRE to provide a numeric framework for understanding the value of SRE work and to demonstrate the impact of SRE practices on organizational success.

  • What does the author consider the most urgent problem facing SRE today?

    -The author considers the persistent idea that operations, and by extension SRE, is of low status as the most urgent problem. This perception can hinder investment in reliability and user experience, despite their proven value.

  • How does the author summarize the impact of SRE over the past 20 years?

    -The author summarizes the impact of SRE as the radical idea that it is legitimate to apply software techniques and systems thinking to operations, an idea that has been revolutionary but remains surprisingly radical even in 2024.

Outlines

00:00

๐ŸŽ‰ Reflecting on SRE's Evolution and Personal Dedication

The speaker begins by acknowledging the 20-year milestone of Error-Correcting Code (ECC), using it as a moment for reflection rather than focusing on the accuracy of the timeframe. They mention the importance of reliability in AI, as illustrated by a personal anecdote and a preference for a medieval manuscript style over a previous attempt. The talk is dedicated to the speaker's stepmother, Helen Gray, who passed away recently. The speaker clarifies that their perspective is personal and may not cover everyone's favorite aspects of SRE due to limited view. They introduce the concept of the 'Elephant in the Server Room,' a metaphor for an obvious issue that is being ignored, which in this case is Google's influence on SRE. The speaker also discusses the evolution of the SRE role and the perception changes, challenging the narrative that Google's size and influence make SRE inflexible and machine-like. Instead, they propose an alternative view, considering Google's startup origins and the SRE's adaptability and engineering mindset.

05:01

๐Ÿ”ง The Engineering Mindset Behind SRE's Growth

The speaker delves into the engineering mindset that has driven SRE's growth, emphasizing the iterative process of replacing 'slightly less terrible' solutions with better ones. They recount the early days at Google, where the need for managing clusters led to the creation of 'babysitter,' a system that, despite its flaws, was an improvement over the previous ad-hoc methods. The speaker argues that SRE's success is rooted in this engineering approach, which focuses on incremental improvements and adaptability, rather than being a static, large-scale operation. They also touch on the importance of storytelling in shaping the understanding of SRE's role and history, highlighting the need for accurate narratives that reflect the reality of SRE's dynamic and problem-solving nature.

10:01

๐ŸŒ SRE's Broad Impact and Integration into Various Sectors

The speaker highlights the widespread adoption of SRE principles across various sectors, from entertainment to government, indicating the model's versatility and relevance. They note the continued interest in SRE material, as evidenced by the ongoing sales of related books, despite the content being freely available. This suggests a demand for high-quality information on the subject. The speaker also observes that SRE concepts are not only permeating general engineering practices but also influencing business consciousness, as seen in the attention from prominent IT strategy organizations like Gartner and Forrester. They emphasize the importance of SRE's contribution to engineering and societal contexts, citing examples of SRE professionals making significant impacts beyond their roles, such as influencing healthcare systems and initiating social movements.

15:03

๐Ÿ› ๏ธ SRE's Role in Engineering and Social Change

The speaker discusses the significant engineering contributions of SRE teams over the past two decades, noting their vital involvement in the development of successful Google systems that have had a broad impact, both internally and externally through citations in academic papers. They also reflect on SRE's social influence, with professionals from the field appearing on Time magazine covers for their roles in fixing broken systems and initiating change. The speaker ponders the reasons behind SRE's inclination towards positive social impact, suggesting it may be due to the profession's focus on user experience, a strong mission orientation, or the historical influence of early SRE professionals. They conclude that SRE's ability to prevent burnout and toil through its practices is a significant achievement in itself.

20:04

๐Ÿšง Challenges and Failures in SRE's Adoption and Identity

The speaker addresses the challenges faced by SRE, particularly concerning the career pipeline for junior level professionals and the intimidation factor for outsiders. They express concern over the failures in SRE program adoption within companies that should be prime candidates for such practices. The speaker regrets the lack of organizational post-mortems to understand these failures better. They also touch on the dilution of the SRE term and the potential overemphasis on identity within the profession. The speaker advocates for a focus on generating more models, especially quantitative ones, to understand and communicate the value of SRE work better. They highlight the need to counteract the idea that reliability no longer matters due to economic shifts and to continue investing in user experience.

25:06

๐ŸŒŸ The Radical Notion of Applying Software Techniques to Operations

In the concluding remarks, the speaker reflects on the past 20 years of SRE and the industry's current state. They reiterate the radical idea that it is entirely legitimate to apply software techniques and systems thinking to the operations domain. Despite the progress made, the speaker asserts that this concept remains radical in 2024. They encourage challenging the notion that operations are of low status and emphasize the societal bias against maintaining systems and executing patterns. The speaker ends on a reflective note, acknowledging the significant journey SRE has undertaken and the need to continue pushing for its recognition and integration in various domains.

Mindmap

Keywords

๐Ÿ’กSRE (Site Reliability Engineering)

SRE, or Site Reliability Engineering, is a discipline that incorporates aspects of software engineering to improve the reliability, scalability, and performance of production systems. It is central to the video's theme as it discusses the evolution and impact of SRE over the past 20 years. The script mentions how SRE has grown from a concept to a widely adopted practice across various industries, with the speaker reflecting on its beginnings and its role in handling system management and automation.

๐Ÿ’กReliability

Reliability in the context of the video refers to the dependability of a system or service to perform its intended function without failure. It is a core aspect of SRE, where the goal is to ensure that systems remain available and consistent. The script uses the phrase '20 years of ESS' to reflect on the history of SRE's focus on reliability, and how it has been a driving factor in the development of systems and practices within Google and beyond.

๐Ÿ’กStartups

Startups are new businesses that are typically characterized by limited resources, rapid growth, and a high level of uncertainty. The script discusses the relationship between startups and SRE, highlighting that Google, like many startups, had to navigate through relentless pressure and change. The speaker argues against the perception that SRE is incompatible with the fast-paced, chaotic environment of startups, suggesting that SRE principles can be effectively applied in such settings.

๐Ÿ’กEngineering

Engineering in this video is presented as the process of creating solutions to problems, often through the application of scientific, economic, social, and practical knowledge. The script emphasizes the engineering mindset of SREs, who are not just maintaining systems but actively working to improve them. The speaker shares anecdotes about how SREs at Google engineered their way out of problems, leading to the development of systems like babysitter, Borg, and others.

๐Ÿ’กToil

Toil in the video refers to the manual, repetitive, and often tedious work that can lead to burnout and inefficiency. The speaker discusses the cursed knowledge that SREs have gained from understanding that much of the toil associated with system management is preventable. The script suggests that one of SRE's achievements is helping the industry recognize and address the issue of toil.

๐Ÿ’กBurnout

Burnout is a state of chronic stress that can result from overwork and lack of balance between work and personal life. The video addresses the issue of burnout in the context of SRE work, suggesting that the profession has made strides in understanding and preventing it. The script implies that by applying SRE principles, the industry can mitigate the risk of burnout for those involved in system management.

๐Ÿ’กEvolution

Evolution in the video refers to the development and changes that have occurred in the SRE field over the past two decades. The speaker reflects on how SRE has evolved from its early days at Google to become a recognized discipline within the tech industry. The script discusses the evolution of SRE roles, practices, and the perception of the field.

๐Ÿ’กPostmortem

A postmortem in the context of the video is an analysis conducted after a significant event, such as a system failure, to understand what happened and how to prevent similar occurrences in the future. The script mentions the importance of postmortems in SRE for learning and improvement, and the speaker expresses a desire for more organizational post-mortems related to SRE adoption failures.

๐Ÿ’กCareer Pipeline

The career pipeline refers to the progression of career stages and opportunities within a profession. The video discusses the challenges of the SRE career pipeline, particularly for junior level individuals. The script suggests that the profession needs to do more to attract and develop talent, ensuring that those entering the field can grow and advance.

๐Ÿ’กQuantitative Models

Quantitative models in the video are numerical frameworks used to measure and understand complex phenomena. The speaker calls for more quantitative models in SRE to better understand and articulate the value of the work done by SREs. The script implies that such models could help in demonstrating the impact of SRE practices on system reliability and user experience.

๐Ÿ’กOperations

Operations in the video refers to the ongoing processes and systems that support the functioning of a business or service. The speaker discusses the persistent undervaluation of operations within the industry, suggesting that there is a societal bias against investing in the maintenance and reliability of systems. The script argues for a change in perception to recognize the importance of operations and the role of SRE within it.

Highlights

20 years of ESS reflection without fixating on the exact phrase, using it as an opportunity to look back on the past.

The importance of reliability in AI illustrated by comparing two different renderings of '20 years of ESS'.

The speaker's role in popularizing SRE through his book, visible on commercial shelves.

Personal dedication of the talk to the speaker's stepmother Helen Gray, who passed away last month.

Clarification that the talk represents a personal view, not influenced by any multinational or department.

The 'Elephant in the server room' analogy for the often overlooked entity of Google in SRE discussions.

Contrasting the perception of Google as a relentless, infallible giant with the reality of its startup origins.

The misconception that SRE is incompatible with the fast-paced, chaotic nature of startups.

The narrative of SRE evolving from the need to engineer solutions to immediate problems in startups.

The story of 'babysitter', an early system at Google that, despite its flaws, was a step up from previous methods.

The role of SRE in the development of Google's production systems like Borg and work q.

The idea that SRE principles have permeated general engineering and business consciousness.

The observation that SRE practices are being adopted across various sectors and company sizes.

The ongoing success and demand for SRE material, despite it being available for free.

The contribution of SREs to significant Google systems and their wider industry impact through publications.

SRE's role in fixing broken systems and its connection to ethical considerations and social impact.

Concerns about the SRE career pipeline, particularly for junior level professionals.

The issue of SRE program adoption failures and the lack of industry post-mortems to learn from them.

The need for more quantitative models to understand the value and impact of SRE work.

The challenge of maintaining the relevance of SRE in a changing economic climate and the importance of user experience.

The persistent societal bias against operations work and the low status perception of operational roles.

The radical idea that software techniques and systems thinking can be applied to the operations domain, still considered radical after 20 years.

Transcripts

play00:01

[Music]

play00:16

so first things first starting with the

play00:18

title 20 years of ESS long time let's

play00:21

not fixate on the accuracy of that

play00:25

particular phrase uh I think we should

play00:27

use it as an opportunity to reflect on

play00:29

the past and in the spirit of this and

play00:32

also in the spirit of the fact that s

play00:34

Recon Americas makes me miss my national

play00:37

holiday St Patrick's Day March the 17th

play00:39

every single year here is mid Journey's

play00:43

rendering of 20 years of ESS in the

play00:45

style of a ancient Irish medieval

play00:48

illuminated manuscript which I greatly

play00:51

prefer to the uh rendering of the first

play00:53

attempt uh which I think nicely

play00:56

illustrates the importance of

play00:57

reliability in uh a I okay so uh we've

play01:02

got a lot to cover 20 years in 20

play01:04

minutes plus questions so by linear

play01:06

extrapolation that's one year per minute

play01:09

uh so I'll talk very quickly obviously

play01:11

uh who am I and why should you care uh I

play01:15

suppose I'm probably more responsible

play01:16

than most for popularizing s in terms of

play01:20

this uh this book which you can see in

play01:23

the picture on the left I am very happy

play01:25

to find on the commercial shelves of a

play01:28

Bookshop somewhere down in the value

play01:30

actually uh so seen a few things uh

play01:34

there's a picture of me ripping it up

play01:36

which is part of a previous s Recon

play01:38

opening plary if I recall correctly

play01:40

here's a picture of me writing another

play01:41

one the other photo on a personal note

play01:44

is my stepmother Helen Gray who passed

play01:45

away last month and to whom this talk is

play01:48

dedicated and on that personal note I

play01:51

want to emphasize that this is a

play01:53

personal view of a talk right so there's

play01:57

no giant multinational there's no PE

play01:59

Department P department is me so in

play02:02

other words if I'm missing your favorite

play02:04

thing it's because of my limited view

play02:06

and limited perspective so just want to

play02:09

make that clear but with all of those

play02:12

preliminaries out of the way anyone take

play02:15

a guess what this

play02:17

is elephant in the server room it's the

play02:20

elephant in the server room accid

play02:22

waiting to

play02:24

happen so popular English phrase meaning

play02:26

the obvious thing known as talking B

play02:30

and the Elephant in the server room is

play02:31

of course

play02:32

Google the uh image here is a internal

play02:38

meme that got leaked to BuzzFeed if I

play02:39

recall correctly so there's going to be

play02:41

a certain amount to Google in this

play02:42

presentation you are unfortunately going

play02:44

to have to be okay with that but I began

play02:48

my Reflections on 20 years by thinking

play02:51

about the evolution of the role and the

play02:53

change in perceptions of this role and I

play02:56

realized that I kind of have an internal

play02:58

story about what s is and how it's

play03:00

changed which is different to many other

play03:03

people and in fact I think there's one

play03:05

key story which has grown up around

play03:07

Google and relationship with s that I

play03:11

think it's necessary to attack actually

play03:13

um so I appreciate this may be shocking

play03:17

to you but sometimes stories aren't

play03:20

true or different sets of stories are

play03:23

available depending on the evidence you

play03:25

select from so I'll take one story to

play03:28

start with which is that is Google is

play03:32

smooth and Relentless and infallible and

play03:35

gigantic and so on and so forth and S as

play03:39

a result inherits those properties and

play03:42

so we'd think of s as kind of having a

play03:44

machine like quality right so giant

play03:47

resources giant machine giant problems

play03:50

so

play03:51

on but other stories are

play03:54

available to uh do so I'll have to talk

play03:58

a little bit about startups so startups

play04:01

what are startups like startups are like

play04:03

eating glass as Shawn Parker and a

play04:05

number of other people said various

play04:08

other entrepreneurs have said uh

play04:09

startups have a lot of characteristics

play04:11

that make them kind of not really like

play04:13

ordinary life so there isn't really a

play04:15

steady state things are changing all the

play04:17

time and if they aren't that's bad

play04:19

Relentless pressure and so on the

play04:21

shocking news or the different story is

play04:23

that Google was a startup once as well

play04:28

and so one thing that's kind of

play04:30

developed over time is that people seem

play04:32

to think that s is somehow in opposition

play04:35

to or incompatible with constant change

play04:38

fast moving startups in general and so

play04:41

on I think there's a bunch of reasons

play04:43

that people believe this but I want to

play04:45

hone in on one for brevity's sake might

play04:49

not be your favorite reason but it is

play04:51

this that people seem to think that

play04:52

startup chaos Relentless pressure and

play04:55

not enough resources being

play04:57

incompatible is function of I I

play05:00

suppose Clarity or the the the the

play05:05

pressure to produce means that you

play05:08

output bad Solutions because you haven't

play05:10

got the time to do better Solutions and

play05:13

so now that I'm saying that finishing

play05:15

your slides five minutes before you get

play05:17

on stage produces your best work all of

play05:20

the

play05:21

time uh necessary short-term Clarity

play05:25

around choices has produced some of my

play05:27

best work although for the sake of

play05:30

engineering uh coherence I feel I should

play05:32

tell you it has also produced some of my

play05:34

worst

play05:35

work but there is this conflict right

play05:39

and uh I

play05:41

think it's curious to me to hear how the

play05:44

story of ere has become for want of a

play05:47

better term very big company because in

play05:50

my experience one very real Loop that

play05:53

drives work in startups is the

play05:57

following which is putting the slightly

play05:59

less terrible thing in place before

play06:02

tomorrow's terrible thing

play06:04

happens and so the way you get from one

play06:07

of these things to another is by

play06:09

engineering and in my opinion in my

play06:12

story SRE grows from this not as a break

play06:15

on progress not as a necessary component

play06:19

of a planet spanning multinational not

play06:22

as an infinitely large smoothly working

play06:25

machine but you're doing things while

play06:27

they're useful then when they're not you

play06:30

stop doing them you start doing

play06:31

different things

play06:32

instead the story is really driven by

play06:36

engineering and so let me tell you a

play06:39

story about the early early days which I

play06:41

was not present for all of uh to be

play06:45

clear so I am relying on on research and

play06:48

stories but there was a file at one

play06:51

stage called cluster.

play06:53

py Google production starts off very

play06:56

very simply there's essentially a list

play06:58

of machines and specific software runs

play07:00

on specific machines so there's AR raay

play07:03

defined in some python somewhere and

play07:05

it's like okay the indexing is on

play07:06

machine one comma machine 3 comma

play07:08

whatever okay fine so it turns out

play07:10

machines go down change rol so on and so

play07:14

forth and how do people handle this well

play07:17

they Fork the file and add their own

play07:20

stuff until eventually there's one fork

play07:23

for every

play07:24

FTE or well not quite that bad but you

play07:27

know close to that So eventually

play07:29

something better is needed and can't buy

play07:32

anything to do this right it's all

play07:33

internal stuff so there's a system built

play07:37

called I think babysitter that has been

play07:40

talked about publicly uh and it's really

play07:43

based around the astonishing realization

play07:45

that if you can provide some mechanism

play07:47

to change the arrays at runtime you

play07:49

don't actually need to Fork the script

play07:50

anyway very successful so this is the

play07:53

key Point babysitter is terrible but

play07:56

it's incrementally less terrible than

play07:57

the thing that was before it and so in

play08:00

due course it it buys us time to put in

play08:03

other things work q and Borg and a bunch

play08:06

of other kind of systems come along and

play08:07

Sol

play08:08

various uh components of the cluster

play08:11

scheduling

play08:13

workload and so I would say that

play08:16

S as an organization would be centrally

play08:19

involved in this but actually this is

play08:21

too early like s doesn't even exist at

play08:23

this point but s thinking

play08:26

does there's a set of people who care

play08:29

about production no matter what they're

play08:32

called and their inclination is to

play08:34

engineer their way out of problems often

play08:37

engineering directly into more problems

play08:39

but definitely engineering their way out

play08:41

of

play08:43

problems

play08:44

okay so I've been talking a lot about

play08:47

stories playing a role in how you

play08:49

understand yourself obviously every good

play08:50

superhero needs an origin story alas so

play08:54

do the evil ones one uh is the one of a

play08:58

better term official story so Ben

play09:01

trainer sloth does a great job of

play09:03

organizational alignment and

play09:04

storytelling and so on uh following on

play09:06

from the work uh I discovered my

play09:08

research of ma Bloomfield who's manager

play09:11

of the very earliest production teams in

play09:13

Google wanted to mention her name

play09:15

publicly but other stories are available

play09:19

other selections from the evidence are

play09:21

available item two is another story h i

play09:25

it's not as well sourced right so it

play09:27

basically says is that from very early

play09:31

on there is a sense that Google doesn't

play09:34

want to hire folks who are just going to

play09:36

follow playbooks for the act of system

play09:38

management want people who could

play09:39

actually write software automate the

play09:42

playbooks etc etc I can't verify this I

play09:45

have been told it I think it does play

play09:47

into the situational awareness for toil

play09:50

that Google SRE in particular has uh and

play09:54

that maybe not a lot of other places

play09:55

have actually uh but three is kind of

play09:59

what I've been talking about what I want

play10:01

to surface which is I think Beyond any

play10:03

confusion around naming or title or role

play10:07

or any of that stuff is the core facts

play10:10

that a when you get people who care

play10:12

about production and B when they're

play10:14

allowed to when they can engineer stuff

play10:18

with to in production you get magic

play10:23

happening and so when I reflect back on

play10:26

the gigantic progress of SRE over the 20

play10:29

years which is actually like it is

play10:32

really something this thing we do from

play10:34

the before the earliest days that it had

play10:38

this name to this gigantic thing it's

play10:40

become and how utterly bizarre that

play10:42

seems in retrospect I try to hang on to

play10:44

those two facts A and B to keep me

play10:48

grounded okay uh prizing over let's move

play10:52

on from history somewhat uh and earliest

play10:56

days to themes and ideas highlights and

play11:00

low lights we can pick out of the course

play11:02

of

play11:03

history uh so let's start out with the

play11:06

obvious One S teams are everywhere this

play11:09

set of logos that I stuck up there is

play11:12

absolutely not a definitive list it is

play11:14

the list of places that had jobs open

play11:17

when on Saturday I opened the tab and

play11:19

did the

play11:21

search but it gives you a flavor of the

play11:24

places that s has made its way into and

play11:26

so from a sectoral point of view it's

play11:29

like entertainment food delivery

play11:30

education retail government trading

play11:32

driverless

play11:34

cards I also have reason to believe that

play11:38

according to conference data at least s

play11:41

is not the sole province of Planet

play11:44

spanning multinationals but has made its

play11:46

way into smmes mid-range companies and

play11:50

so on the middle range having ticken Deb

play11:52

sufficient uh significantly overtime

play11:55

which I think is a good sign for the

play11:57

profession so

play11:59

s material continues to sell well I am

play12:02

not looking happy because of this uh I

play12:05

make no money from this book uh why is

play12:08

S3 material continuing to sell well

play12:11

actually the model is new and I know you

play12:15

are sitting in your chair going well

play12:17

actually it's 20 years old you just told

play12:20

us of course the industry is large and

play12:23

it takes a lot of time for ideas to

play12:26

propagate and models in particular even

play12:29

if it's only the top tier you care about

play12:31

but I would say that if your world is

play12:34

just devops or iil or platform or

play12:37

whatever there is a genuine Gap into

play12:40

which SRE can fit there is a legitimate

play12:43

oh excellent will this collapse

play12:44

underneath me no it won't I'll move over

play12:46

here that'll make it better uh so if

play12:50

your world is devops iil platform enge

play12:53

Etc the SR model does have something to

play12:56

say to you about how systems should be

play12:59

managed how organizations should be run

play13:03

or structured so on I think net net the

play13:06

market continues to grow and people want

play13:09

to continue to learn about it I mean

play13:11

that's why you're here that's why

play13:13

there's a huge proportion of new folks

play13:15

which I was very interested to to

play13:18

see uh but also and this is a bit weird

play13:22

but it is true the content of these

play13:25

books is available for free and yet yet

play13:29

I know they still get bought they still

play13:32

get read online in various ways that

play13:35

actually cost money as opposed to not

play13:36

costing money so that suggests to me

play13:39

that there's still significant demand

play13:41

for what's regarded as high quality

play13:45

information so uh this book's title for

play13:49

those who can't read it is okay let's do

play13:53

your stupid idea which is the implicit

play13:55

title of a number of engineering

play13:56

conversations I have been in

play14:00

the general point I am trying to make is

play14:02

that there's a ton of ideas about how to

play14:06

understand the problem of software and

play14:09

systems management creation design and

play14:11

so on outlined in our body of

play14:15

knowledge and it would be one set of

play14:18

consequences for the universe if that

play14:20

knowledge was kind of scoped to us and

play14:23

jealously guarded and so on so forth but

play14:24

actually I don't think it's that way I

play14:26

think it has

play14:27

escaped and it's obviously hard to come

play14:29

up with objective measures for this in

play14:31

some sense but hey I've got 20 minutes

play14:35

so we're on to anecdotes at the

play14:38

frequency I have noticed of explaining

play14:40

certain ESS ideas to kind of shall we

play14:45

say generic software Engineers or

play14:46

product Engineers uh has dropped so I

play14:50

find that I don't have to Define for

play14:53

example slos as often as I did several

play14:56

years ago Etc the knowledge is permeated

play14:59

getting out there another thing I see is

play15:02

that cross referencing to ESS ideas

play15:05

whether in the form of the book or

play15:06

papers or talks or whatever is also

play15:08

increasing over time I do think there's

play15:11

room for science in uh in looking at

play15:13

this but certainly my experience is that

play15:17

these two effects are are definitely

play15:20

ongoing but as well as s

play15:23

ideas permeating General

play15:27

engineering I think also some of them

play15:29

have made their way into General

play15:31

business Consciousness which is also

play15:33

pretty weird so uh Gartner and farer for

play15:37

those who don't know are kind of very

play15:39

large and important uh it organizations

play15:43

it strategy organizations kind of might

play15:45

have heard of the Gartner magic qu

play15:47

quadrant and a bunch of other things

play15:49

anyway they tell other organizations

play15:51

what to think about

play15:53

it and one of the consequences of the

play15:57

massive popularity the thing that we're

play15:59

doing is that the Machinery of business

play16:03

sat up and took notice and uh I won't

play16:07

use the phrase churns out but writes a

play16:09

lot of stuff about SRE and hence you see

play16:12

articles like this from Forester and

play16:14

claims like this from

play16:16

Gartner uh as I say both well-known and

play16:20

extremely Salient

play16:22

organizations I think in particular I

play16:26

suppose you you could query the number

play16:28

numbers in the Gartner statement that by

play16:31

2027 75% of enter Enterprises will use

play16:34

ESS practices up from 10% in 22 but I

play16:38

think whatever about the the state of

play16:41

those numbers s has created that

play16:45

conversation has revolutionized that

play16:47

conversation like we're not having that

play16:50

that sentence isn't written in a world

play16:52

where s doesn't

play16:53

happen and I think it's also interesting

play16:56

to note that uh this Forest article from

play17:00

2018 doesn't conclude 46% of them apply

play17:04

of Google's s principles apply directly

play17:07

to your Enterprise what about the rest

play17:09

no they're terrible don't use them it

play17:12

concludes that a bunch of the rest are

play17:13

also applicable under different kind of

play17:16

circumstances so in some sense the IT

play17:19

industry has looked at this and has

play17:23

picked the things that apply and has has

play17:25

found that it it actually works for them

play17:30

uh yeah I think another point I would

play17:34

say just hammering on the engineering

play17:37

piece again

play17:38

is uh s teams over the past 20 years

play17:41

have got a lot of engineering

play17:43

done again this is my personal

play17:46

experience my very limited view my

play17:48

myopia but as an example here are two

play17:51

papers describing Google systems that I

play17:53

had sight of what the SRE contribution

play17:57

was and it was vital to both of them and

play18:00

both of those systems were very

play18:01

successful internally I think might even

play18:04

I'm I'm pretty sure the one on the right

play18:06

is still working and if the one on the

play18:08

left isn't working bad things are

play18:09

happening right now um externally like

play18:13

these papers have gone on to be cited

play18:16

hundreds of times and like there's

play18:19

there's huge impact from these things um

play18:21

I highlight the SRE contributor names as

play18:23

I remember them but of course I might

play18:25

forget them etc etc

play18:29

so again from my personal point of

play18:32

view it's also been great to see so many

play18:35

sres doing work that benefits others

play18:39

aware of a wider kind of social context

play18:41

um or again for want of a better term

play18:44

ethical I'm actually fascinated by this

play18:46

and I've thought about it a lot over the

play18:48

years like why why is this is it because

play18:53

that S as a

play18:55

profession kind of thinks about the user

play18:58

or is geared around the positive

play19:00

experience of somebody else on a

play19:02

day-to-day basis or is it because of the

play19:05

strong sense of above and beyond or

play19:07

Mission

play19:08

orientation that Google s certainly had

play19:11

early on or maybe the historical

play19:14

accident of a bunch of people who joined

play19:15

early on either way it is notable to me

play19:19

at least that for such a young

play19:21

profession we've already had two Time

play19:24

magazine covers with s people on them on

play19:27

the left the us Healthcare System such

play19:30

as it was rescued by SES I think Mikey

play19:34

Dickerson who you may hear from later on

play19:37

uh goes on to lead a substantial

play19:40

government Department to try and do more

play19:41

of the same and on the right it's Susan

play19:44

Fowler ESS at Uber whose memo helps to

play19:47

kick off the me2 movement uh and I think

play19:50

again for me for people who feel s has

play19:53

nothing to do with this uh that actually

play19:56

fixing broken systems and fixing broken

play19:58

organiz ational systems and

play19:59

organizational postmortem are very

play20:01

definitely part of what s does I

play20:04

surfaced these two examples because

play20:06

they're convenient graphically I could

play20:08

mention others uh Liz Fong Jones and

play20:11

kiwi farms and a bunch of other stuff

play20:12

prominently again for my limited

play20:15

perspective uh I would also note that

play20:17

when um generic software Engineers have

play20:19

made it to the cover of time it's not

play20:21

necessarily because of the benefit to

play20:22

other

play20:23

people

play20:25

so continuing on that topic

play20:29

uh I just want to explore a little bit

play20:32

more in the context of this quote of a

play20:36

review of the ESS book that Chris Jones

play20:38

pointed me out

play20:40

recently which is the ESS book has

play20:42

granted me the cursed knowledge cursed

play20:46

knowledge to see that the misery of toil

play20:49

and burnout misery sorry is entirely

play20:52

preventable and I think s at its best

play20:56

has a long history of doing things

play20:58

different L because it is the right

play21:00

thing to do not because we did it that

play21:03

way previously or we're always going to

play21:05

do it this way or

play21:07

whatever but here if the only thing we

play21:10

did as a profession if literally the

play21:13

only thing we accomplished was to help

play21:15

the industry understand that

play21:17

actually it might be cursed knowledge

play21:19

but actually burn out and toil and so on

play21:22

are preventable it would be worth it on

play21:24

its

play21:26

own now

play21:29

okay low lights so uh in true Centrist

play21:33

dad fashion time for the low lights uh

play21:36

first thing I will start with is the

play21:40

career

play21:41

pipeline the career pipeline I think is

play21:44

particularly problematic in SRE for

play21:46

junior level folks uh in fact I'm aware

play21:49

of one org where s it was suggested s

play21:53

should only be defined for senior level

play21:55

and up H and I've said elsewhere that

play21:58

one close analog for S is Staff engineer

play22:02

like those kind of crosscutting concerns

play22:04

that you often see staff Engineers

play22:07

blessed with uh are can resonate in the

play22:11

S world as well but if we can't ingest

play22:15

Folks at the bottom of the profession

play22:16

and get them to the top and if we can't

play22:18

take all of the folks who want to do

play22:20

that then we have some trouble and S is

play22:24

intimidating from the outside I mean

play22:26

despite the wealth of educational

play22:28

material available we should do more

play22:30

about

play22:32

that uh this was a ATM in New Orleans

play22:36

where I refrained from pushing either of

play22:37

those buttons though I was

play22:39

tempted device initialize in particular

play22:43

uh it is useful it's useful balance I

play22:46

think to say that as well as the success

play22:48

in adoption that we've outlined earlier

play22:51

we've also seen plenty of failures in s

play22:54

program adoption and in companies that

play22:56

would normally be considered kind of

play22:57

fertile ground for this kind kind of

play22:58

thing so large well resourced

play23:00

technically capable a lot at stake and

play23:03

so on my personal regret from these is

play23:05

not that the adoption failed but that

play23:08

it's not in the nature of the industry

play23:10

to write organizational post-mortems and

play23:13

I would love to see uh love to see those

play23:16

for the adoption

play23:17

failures it is also kind of unpopular

play23:20

puffing opinion of me uh but as well as

play23:23

the genuine problems that are caused by

play23:25

the mishandling of the term s dilution

play23:29

of what it means problems with the

play23:30

implementation and so on we worry about

play23:33

that a lot in my opinion possibly worry

play23:35

about it identity too much maybe we

play23:37

should do less of

play23:39

that in

play23:42

particular I've I've spoken about this

play23:44

in detail in other kind of talks uh

play23:46

including other s Keynotes so I won't

play23:48

labor this point um but I will say that

play23:52

I I don't feel we're really generating

play23:54

enough models about what it is that we

play23:57

are doing or trying to do um I

play24:00

specifically highlight mathematical

play24:02

there uh because I do think we have a

play24:04

lot of qualitative models we could

play24:07

always use more and that's totally fine

play24:09

um but I I think we need some more

play24:11

quantitative models about what it is

play24:12

we're doing and in particular

play24:14

quantitative models about being able to

play24:17

put generic numeric Frameworks on the

play24:20

value of what it is we do on how much

play24:23

our work matters and I think my personal

play24:26

intuition is there's a lot of Runway for

play24:28

this and we just aren't really getting

play24:31

started uh it's our most urgent problem

play24:33

in my

play24:35

opinion uh you will be familiar with the

play24:37

fact that zero interest rates have gone

play24:39

away my mortgage certainly is very

play24:41

familiar with that fact thank you uh it

play24:44

has changed a bunch of behaviors in

play24:47

companies aligning with layoffs

play24:50

attitudes to expenditure for operations

play24:52

keeping things going and so on again

play24:54

I've spoken about this before won't

play24:55

embellish it but I will say that that

play24:58

attacking the ideas that reliability no

play25:01

longer matters because growth is

play25:03

unavailable which is an argument I've

play25:05

heard and because we can't put a good

play25:10

number on the worth of the user

play25:11

experience we won't invest in it those

play25:14

are kind of key ideas to attack and for

play25:17

what it's worth I say abstract economic

play25:20

bases there because I'm pretty sure the

play25:21

customers still want their stuff to work

play25:24

whatever the abstract economic rationale

play25:28

is

play25:29

is and I think also there are several

play25:31

patterns which are pretty prevalent in

play25:34

human behavior and one of those is the

play25:35

tendency to confuse Doctrine with

play25:37

identity and so it's particularly

play25:39

natural if your world has been

play25:41

significantly

play25:42

disrupted uh you may end up clinging to

play25:45

to Doctrine in some sense if you don't

play25:46

have an easy replacement to

play25:49

hand but my final low light and I do

play25:52

think it's an important one is

play25:56

underpinning some of the earlier low

play25:59

lights at least at decision maker level

play26:02

is the persistent idea that operations

play26:04

is low

play26:06

status so of course worth is socially

play26:10

constructed to a large

play26:12

extent and it's also true that we have a

play26:15

societal bias against keeping things

play26:18

going or kind of paying for keeping

play26:19

things going a bias a kind of

play26:21

intellectual viice in some sense against

play26:23

good pattern execution rather than

play26:25

creation of new patterns and so on and I

play26:28

can stand here and say that I think we

play26:30

need to fix this which is

play26:32

true but in a talk which is mostly about

play26:35

reflection I'm thinking about the

play26:37

relationship with s and the past 20

play26:40

years and all of that does Put Me In

play26:43

Mind of one potential summary of the

play26:46

past 20

play26:48

years which is towi the radical idea

play26:52

that it is legitimate to apply software

play26:55

techniques and systems thinking to the

play26:58

operations

play26:59

domain and in 2024 I stand here and I

play27:03

look at the past 20 years and I look at

play27:05

the state of the industry in the state

play27:06

of the world and I say that that idea is

play27:10

still

play27:12

radical thanks for your time come and

play27:14

say up

play27:15

boo thank you very much

Rate This
โ˜…
โ˜…
โ˜…
โ˜…
โ˜…

5.0 / 5 (0 votes)

Related Tags
Site ReliabilityEngineering EvolutionTech ImpactOperationsGoogle SREStartup CultureSystems ThinkingBurnout PreventionCareer PipelineEthical Tech