Infrastructure as code

Google Cloud Tech
22 Aug 202411:39

Summary

TLDRIn this insightful discussion, Steve McGhee, a former Google SRE, shares his expertise on building reliable systems in the cloud. He introduces Infrastructure as Code (IaC) as a solution to the common issue of broken services due to permission tweaks, demonstrating how IaC tools like Terraform can revert to a known good state quickly. McGhee offers practical advice on implementing IaC, setting up a CI/CD pipeline, and establishing a reconciliation loop to prevent infrastructure drift, all aimed at enhancing project resilience and developer efficiency.

Takeaways

  • 😀 Infrastructure as Code (IAC) uses version control for infrastructure, similar to how it's done for application code.
  • 🛠️ IAC tools like Terraform allow you to define the desired state of your infrastructure and automatically reconcile any differences.
  • 🔄 IAC promotes idempotency in infrastructure changes, ensuring that running the same configuration repeatedly won't cause unintended side effects.
  • 📝 Describing infrastructure in runnable files enables precise updates and collaboration, akin to source code management.
  • 🔧 Using IAC can quickly restore service after an accidental change or outage, by reapplying the last known good configuration.
  • 👥 IAC supports a collaborative environment where infrastructure changes are reviewed and approved through pull requests.
  • 🚫 Imperative scripts, like shell scripts with `gcloud` commands, are less safe and flexible compared to declarative IAC languages.
  • 🔄 The ability to recreate environments rapidly with IAC is crucial for scenarios like developer onboarding or setting up new test environments.
  • 🗃️ Handling data with IAC is complex and should be managed with dedicated database tools for backups and restorations.
  • 🔄 A reconciliation loop is an advanced IAC practice to detect and respond to changes made outside the IAC system, preventing infrastructure drift.
  • 🔄 Integrating IAC with CI/CD pipelines can be done in various ways, with dry runs being a best practice to avoid unintended consequences.

Q & A

  • What issue did Martin Omander encounter with his Google Cloud project?

    -Martin Omander faced a problem where tuning the permissions in his Google Cloud project broke his Cloud Run service, and he couldn't revert to a working state due to the absence of an undo button.

  • What is Steve McGhee's background in the field of site reliability?

    -Steve McGhee has over a decade of experience as a site reliability engineer (SRE) at Google, where he worked on Search, Android, YouTube, and Cloud. He now helps developers build reliable systems in the Cloud.

  • What is Infrastructure as Code (IAC) and how does it relate to version control?

    -Infrastructure as Code (IAC) is the concept of managing and provisioning infrastructure through machine-readable scripts, similar to using version control for your infrastructure. It allows for safer and more systematic management of infrastructure changes.

  • How can IAC help in restoring a broken application?

    -IAC can help by maintaining a record of the last known good configuration in the form of runnable files. In case of a failure, these files can be used to quickly restore the service to its previous working state without the need for manual troubleshooting.

  • What is the difference between using an imperative language and a declarative language in IAC?

    -An imperative language specifies how to achieve a task through a series of steps, while a declarative language specifies what the end state should be, and the tool figures out the steps to reach that state. Declarative languages are often used in IAC for their idempotent nature, ensuring no unintended side effects when run multiple times.

  • How can IAC assist in setting up new environments for developers or testing?

    -IAC allows for the quick creation of new environments by using the infrastructure files to replicate the necessary settings. This can be done in minutes and is especially useful when a new developer joins a team or when setting up temporary test environments.

  • What is a reconciliation loop in the context of IAC?

    -A reconciliation loop is a process that regularly checks the current state of the infrastructure against the desired state defined in the IAC files. It helps detect and address any discrepancies that may have been made outside of the IAC process, thus preventing infrastructure drift.

  • How does IAC integrate with a CI/CD pipeline?

    -IAC can be integrated into a CI/CD pipeline by either running it as part of the deployment process or managing infrastructure updates separately. It's important to perform a dry run to evaluate and approve proposed changes before they are applied.

  • What is the significance of using a declarative language like Terraform for IAC?

    -Using a declarative language like Terraform for IAC allows for specifying the desired state of the infrastructure. Terraform then determines the necessary changes to achieve this state, making the process safer and more efficient than imperative scripts.

  • How can someone with an existing project that has evolved over time start using IAC?

    -One can start using IAC by using tools that can inspect the current GCP project and create IAC descriptions of the current state. They can then test these files in a controlled environment and gradually expand their IAC implementation as they gain confidence.

  • What are the key takeaways from the discussion on IAC in the video script?

    -The key takeaways are to describe infrastructure in runnable files checked into source control, work towards being able to recover from changes in minutes, and eventually set up a reconciliation loop to prevent infrastructure drift.

Outlines

00:00

🛠️ Infrastructure as Code for Reliability

In this segment, Martin Omander discusses the importance of Infrastructure as Code (IaC) with Steve McGhee, a former Google SRE. Steve explains how IaC, similar to version control for infrastructure, can prevent and resolve issues like the one Martin experienced with his broken Cloud Run service. He demonstrates using Terraform to revert to a previous configuration, emphasizing the benefits of IaC for maintaining and updating cloud infrastructure efficiently and safely. The conversation highlights the process of applying Terraform to restore service accounts and other components, showcasing a practical approach to infrastructure management.

05:01

🔄 The Power of Idempotency and IaC

This paragraph delves into the concept of idempotency in infrastructure management, where changes can be made repeatedly without causing unintended effects. Steve McGhee discusses the advantages of using declarative languages like Terraform over imperative scripts, which can lead to errors if not carefully managed. The discussion moves to the practical aspects of IaC, such as quickly setting up new environments for developers or testing, and the importance of handling data with dedicated tools. The paragraph also introduces the concept of a reconciliation loop, a proactive measure to detect and address changes made outside of IaC, ensuring infrastructure remains in the desired state.

10:01

🌐 Implementing IaC and Integrating with CI/CD

The final paragraph provides actionable advice on how to start implementing IaC. Steve McGhee suggests beginning with a service you're familiar with and gradually expanding as confidence grows. He also addresses how IaC can be integrated into existing CI/CD pipelines, offering two approaches: running IaC as part of the pipeline with careful consideration of potential consequences, or managing infrastructure updates separately. The paragraph concludes with considerations for organizations with a dedicated infrastructure team and how they can leverage IaC to standardize processes and reduce the need for reactive maintenance.

Mindmap

Keywords

💡Google Cloud

Google Cloud is a suite of cloud computing services that runs on the same infrastructure that Google uses internally for its end-user products, such as Google Search, Gmail, file storage, and YouTube. In the video, Martin's Google Cloud project is the context for the discussion on infrastructure management and the challenges he faces with permissions and Cloud Run services.

💡Infrastructure as Code (IaC)

Infrastructure as Code is a methodology where infrastructure is described, managed, and provisioned using code and software development practices. It is central to the video's theme, as Steve McGhee explains how IaC could have prevented Martin's issue and provides insights on its benefits, such as version control for infrastructure and quick recovery from misconfigurations.

💡Cloud Run

Cloud Run is a fully managed serverless platform that automatically scales your stateless containers. In the script, Cloud Run is the service that Martin is using and which is broken due to permission issues, highlighting the importance of managing cloud services carefully.

💡Service Accounts

Service accounts in Google Cloud are special types of Google accounts that belong to an application or a virtual machine (VM), instead of an individual user. They are used to provide identity and authentication for applications. In the video, Steve discusses the impact of accidentally removing a role from a service account, which breaks the web app.

💡Terraform

Terraform is an IaC tool developed by HashiCorp that allows for the creation, modification, and versioning of infrastructure. It is used in the script as an example of how to manage infrastructure with code, enabling Steve to restore the last known good configuration of his project.

💡Version Control

Version control is a system that records changes to a file or set of files over time so that developers can recall specific versions later. In the context of the video, Steve McGhee suggests using version control for infrastructure, just as it's used for application code, to maintain a history of changes and facilitate collaboration.

💡Reconciliation Loop

A reconciliation loop is a process that ensures the actual state of the infrastructure matches the desired state defined in the IaC files. It is mentioned in the video as a best practice to prevent infrastructure drift, where changes made outside of the IaC tool could lead to discrepancies in the infrastructure state.

💡Idempotency

Idempotency in the context of IaC refers to the property of operations that can be applied multiple times without changing the result beyond the initial application. Steve McGhee explains that changes made using declarative tools like Terraform are idempotent, making them safer for infrastructure management.

💡CI/CD Pipeline

CI/CD stands for Continuous Integration/Continuous Deployment and refers to a set of practices in software development designed to improve the process of integrating changes from multiple contributors into a single software project. In the video, Steve discusses how IaC can be integrated with a CI/CD pipeline to streamline infrastructure and application deployment.

💡Infrastructure Drift

Infrastructure drift refers to the divergence of the actual infrastructure state from the intended state, often due to manual changes or external factors. The video emphasizes the importance of preventing infrastructure drift through the use of IaC and reconciliation loops to maintain consistency and reliability.

💡Declarative Language

A declarative language in the context of IaC is a way of specifying the desired state of the environment without explicitly defining the steps to achieve it. Steve McGhee contrasts this with imperative scripts, which detail the steps to be taken, and explains that declarative tools like Terraform allow for safer and more efficient infrastructure management.

Highlights

Steve McGhee, a former Google SRE, shares insights on building reliable systems in the cloud.

Infrastructure as Code (IaC) is introduced as a solution to manage cloud infrastructure effectively.

IaC allows for version control of infrastructure, similar to application code.

Terraform is used to demonstrate how to restore a broken Cloud Run service to a previous working state.

The importance of having a last known good configuration for quick recovery is emphasized.

Alternatives to Terraform, like Poulomi, are mentioned for infrastructure management.

Advice on starting with IaC by describing infrastructure in runnable files checked into source control.

The concept of setting up a system to recover from infrastructure changes within minutes is discussed.

The idea of a reconciliation loop to prevent infrastructure drift is introduced.

Steve explains the difference between imperative and declarative languages in IaC tools.

The benefits of idempotent changes in IaC for safety and reducing side effects are highlighted.

The ability to recreate deleted projects quickly using IaC is explored.

Steve suggests using IaC to create new environments for developers or testing.

The complexity of managing database data in IaC and the need for dedicated tools is mentioned.

How to handle changes made outside of IaC tools and the importance of maintaining infrastructure integrity.

The relationship between IaC and CI/CD pipelines, and the options for integrating IaC into development workflows.

The role of a separate infrastructure team in organizations and how they can leverage IaC.

Tools that can generate IaC descriptions from existing GCP projects to help get started with IaC.

Key takeaways on implementing IaC in projects to ensure reliability and quick recovery from infrastructure issues.

Transcripts

play00:00

MARTIN OMANDER: Steve, I was tuning the permissions

play00:02

in my Google Cloud project, and that broke my Cloud Run service.

play00:06

There's no undo button, so I can't get back

play00:09

to a working state.

play00:10

STEVE MCGHEE: Well, infrastructure as code

play00:12

might have prevented that.

play00:13

Let's see how.

play00:14

[MUSIC PLAYING]

play00:23

MARTIN OMANDER: Welcome back to the show, Steve.

play00:25

I'm so happy you're here to tell us about reliability.

play00:28

You have some solid experience.

play00:30

STEVE MCGHEE: Thanks, Martin.

play00:31

Yeah.

play00:32

I was a site reliability engineer, or SRE, inside

play00:35

Google for over a decade.

play00:37

I worked on Search, Android, YouTube, and Cloud.

play00:40

Now my job is helping developers understand

play00:42

how to build reliable systems in the Cloud.

play00:45

MARTIN OMANDER: And you said infrastructure as code

play00:47

could help me project settings?

play00:49

STEVE MCGHEE: Yeah.

play00:50

Infrastructure as code or IAC is like using version control

play00:54

for your infrastructure.

play00:55

We all know it's safer to keep the application

play00:57

code in source control.

play00:59

It's a good idea to do the same for your infrastructure.

play01:02

For example, I have a web app here

play01:04

that uses Cloud Run, a Firestore database,

play01:07

and a few other components.

play01:09

Let's say I'm playing around with the service account

play01:11

settings and I accidentally remove a role from one of them.

play01:16

Now my users will get an error message, the web app is broken.

play01:20

MARTIN OMANDER: Yeah, that's exactly what

play01:22

happened to my application.

play01:23

STEVE MCGHEE: Right.

play01:24

The application is broken now, and we're losing money.

play01:26

Everyone is stressed, and the application

play01:28

needs to come back online ASAP.

play01:31

It's hard to do troubleshooting in this kind of environment.

play01:34

MARTIN OMANDER: Yep.

play01:35

Been there, done that.

play01:36

STEVE MCGHEE: So instead of troubleshooting,

play01:38

let's run the Terraform file that

play01:40

describes our last known good configuration.

play01:43

I run Terraform apply in this terminal, then I confirm,

play01:47

and off it goes.

play01:48

This may take a few minutes.

play01:50

MARTIN OMANDER: And now it's restoring the service accounts

play01:52

in your project?

play01:53

STEVE MCGHEE: Yeah.

play01:54

It's restoring the service accounts, the Cloud Run

play01:56

services, my database settings, any virtual machines

play01:59

I might have, and so on.

play02:01

And now it's done.

play02:02

Let's try accessing the web app again.

play02:05

The app is working again.

play02:06

We didn't have to do any troubleshooting

play02:08

in a stressful situation.

play02:09

MARTIN OMANDER: Ah, excellent.

play02:11

And you use Terraform for this?

play02:12

STEVE MCGHEE: Yeah.

play02:13

I use Terraform, but there are alternatives

play02:15

like Poulomi and others.

play02:17

Some Cloud providers also offer their own tools.

play02:19

The one thing they have in common

play02:21

is that they update the infrastructure in a project,

play02:23

like you might do in the Cloud Console or on the command line

play02:26

with the gcloud command.

play02:28

MARTIN OMANDER: I don't have infrastructure

play02:29

as code in my project now, and I'm a little overwhelmed.

play02:33

How can I get started?

play02:34

STEVE MCGHEE: Sure.

play02:35

First, describe your infrastructure as runnable files

play02:38

that you can check into source control.

play02:40

It's fine to do this for part of your system

play02:42

and expand as you grow more comfortable.

play02:45

Then your goal is to set things up so that if someone changes

play02:49

the infrastructure in your project,

play02:50

you'd be able to recover in minutes.

play02:53

Finally, this may be weeks or months down the road.

play02:56

Set up a reconciliation loop to prevent infrastructure drift.

play03:01

MARTIN OMANDER: All right.

play03:02

Tell me more about that first item on your list.

play03:04

STEVE MCGHEE: OK.

play03:05

By having your infrastructure described in runnable files,

play03:08

you can see the desired state of your infrastructure.

play03:10

For example, if your project includes Cloud Run services,

play03:13

this desired state would describe things

play03:15

like in which region the service is deployed, if each service is

play03:19

open to anonymous users or if it's authenticated,

play03:22

and what service accounts are used.

play03:24

Now that these settings are in runnable files,

play03:27

you can update them with precision

play03:29

and collaborate around them, just like you

play03:31

would with source code.

play03:32

MARTIN OMANDER: Could you give us an example of that?

play03:35

STEVE MCGHEE: Sure.

play03:35

Let's say you have a Cloud Run service that is public

play03:38

and you described it using Terraform.

play03:40

Here's what it would look like.

play03:41

Now, let's say in the next release,

play03:43

this Cloud Run service should be an authenticated service that

play03:47

can only be called by other services, not by end users.

play03:50

A developer would update the Terraform file accordingly

play03:53

and submit it to source control.

play03:55

MARTIN OMANDER: And then that change

play03:56

would use the same approval process

play03:58

as any other code change?

play04:00

STEVE MCGHEE: Yes.

play04:01

The developer would create a pull request,

play04:03

and it would be reviewed by another developer who might ask

play04:05

questions or propose changes.

play04:07

When the change has been approved

play04:09

and it's time to release the next version of the system,

play04:11

the Terraform file would be run and change

play04:14

the authentication setting for the Cloud Run service.

play04:17

MARTIN OMANDER: Back when I worked in startups,

play04:19

we used to document the production environment

play04:21

very carefully.

play04:23

Is that the same thing?

play04:24

STEVE MCGHEE: Not really.

play04:25

A document still has to be interpreted

play04:27

by a person who might make mistakes or misunderstand.

play04:31

This is less ambiguous and less prone to errors

play04:34

if you specify the infrastructure

play04:35

in these runnable files.

play04:37

MARTIN OMANDER: I have a bunch of gcloud commands

play04:39

in a runnable shell script.

play04:40

Is that infrastructure as code?

play04:42

STEVE MCGHEE: Kind of.

play04:43

But that shell script does the same set of steps

play04:46

every time, regardless of the current state of the running

play04:49

system.

play04:50

We call that an imperative language.

play04:52

Terraform, Poulomi, and some other tools like them

play04:56

use declarative languages.

play04:57

That means that you specify what your environment should

play05:00

look like.

play05:01

The tool then figures out what needs to change, if anything.

play05:05

MARTIN OMANDER: That sounds like idempotency.

play05:07

STEVE MCGHEE: That's right.

play05:08

These changes are idempotent.

play05:10

That is, they can be run again and again

play05:12

without any unintended side effects.

play05:14

That makes them safer than a simple imperative script which

play05:17

might go off and make the same VM over

play05:20

and over every time you ran it if you weren't careful.

play05:23

MARTIN OMANDER: Ah.

play05:23

Very good.

play05:25

Item number two on your list has to do with recreating

play05:28

a deleted project.

play05:29

STEVE MCGHEE: Well, hopefully you

play05:30

have some safeguards in place so that your projects can't

play05:33

be easily deleted by mistake, but it's a good litmus test.

play05:36

Ask yourself, how long would it take

play05:38

to recreate your production application

play05:40

if its project was deleted?

play05:42

You can reduce that time significantly with IAC.

play05:45

MARTIN OMANDER: Got it.

play05:46

STEVE MCGHEE: So once you have those runnable infrastructure

play05:48

files, you can use them to create new environments.

play05:51

For example, when a new developer joins your team,

play05:54

you may want to set up a new development project for them.

play05:57

If you have runnable infrastructure files,

play05:59

you can do that in minutes.

play06:01

MARTIN OMANDER: I guess the same goes for new test environments?

play06:03

STEVE MCGHEE: You're right, it does.

play06:05

For example, you might want to set up a short lived test

play06:07

environment for a major new feature of your application.

play06:10

IAC makes that easy.

play06:12

MARTIN OMANDER: What about data in a database?

play06:14

That would be needed too to set up a new project, right?

play06:17

STEVE MCGHEE: That's right.

play06:18

Data is more complex, so it's best

play06:19

to use dedicated tools for that.

play06:21

For example, you may use database tools

play06:23

to back up your production data at regular intervals,

play06:26

and you probably have a different set

play06:27

of data for development that doesn't include

play06:30

sensitive production data.

play06:31

You'd use IAC to create your infrastructure,

play06:34

then you'd use a database tool to restore data into it.

play06:37

MARTIN OMANDER: Makes sense.

play06:38

Let's say I have created an infrastructure file so I can

play06:41

restore my system in minutes.

play06:43

Then the last item on your list mentions a reconciliation loop.

play06:47

What's that?

play06:48

STEVE MCGHEE: So this is an extension of IAC,

play06:50

but many teams find it useful.

play06:52

Here's how it works.

play06:53

People might make changes to your production or test

play06:56

environments outside of IAC.

play06:58

For example, someone in your organization

play07:00

might notice that a service account has too many privileges.

play07:03

They might change that manually in the Cloud Console.

play07:06

MARTIN OMANDER: And that could break the application?

play07:08

STEVE MCGHEE: Yes, it could break the application.

play07:10

But even if it doesn't, we can no longer

play07:12

be sure what our infrastructure looks like or when it changed.

play07:16

We need a way to notice changes made outside of our IAC tool.

play07:20

MARTIN OMANDER: And what should we do if we notice a change?

play07:23

STEVE MCGHEE: Well, you could report the change

play07:25

or you could overwrite it.

play07:26

It's up to you which you prefer.

play07:28

You may want to start with just reporting.

play07:30

By sending a diff between the desired

play07:33

environment and the actual environment, it's like an alert.

play07:35

MARTIN OMANDER: And how would I get started

play07:37

with this reconciliation loop?

play07:39

STEVE MCGHEE: Well, you'd set up a nightly check like in a Cron.

play07:42

Then you would run the check more often

play07:44

to see if you see some benefits.

play07:46

One option is to run it against a subset of your infrastructure

play07:49

if that subset is extra important.

play07:52

MARTIN OMANDER: All right.

play07:53

I have some questions for you, Steve.

play07:54

STEVE MCGHEE: Go ahead, Martin.

play07:56

MARTIN OMANDER: So I already have a CI/CD pipeline

play07:58

based on your recommendations from my previous video.

play08:01

How does infrastructure as code relate to that CI/CD pipeline?

play08:04

STEVE MCGHEE: OK.

play08:05

There are two approaches.

play08:07

Either run IAC as part of the CI/CD pipeline.

play08:10

There would be two passes, one to update

play08:13

the infrastructure, another to deploy applications to it.

play08:16

But you can imagine there could be unintended consequences

play08:19

this way.

play08:20

So you may want to run the dry run command every time

play08:23

and provide the developer an opportunity

play08:25

to evaluate if they want the proposed infrastructure

play08:28

changes to be applied or not.

play08:29

MARTIN OMANDER: Got it.

play08:30

And what's the other approach?

play08:32

STEVE MCGHEE: The other approach is

play08:33

to update infrastructure separately, like just

play08:35

on initial creation and then periodically after that.

play08:39

These might be suggested by security or compliance teams,

play08:42

and they could even be managed separately.

play08:45

Again, the best practice here would

play08:46

be to do the dry run first like Terraform plan,

play08:49

and inform the app owners of the intended changes

play08:52

before they are applied.

play08:54

You have to find the right approach that's

play08:55

right for you, your application, and your team.

play08:58

MARTIN OMANDER: And in some organizations,

play09:00

there is a separate infrastructure team.

play09:03

Does that change things?

play09:04

STEVE MCGHEE: Then IAC may run outside of that CI/CD pipeline.

play09:08

The infrastructure team could choose to release updates

play09:10

on a different cycle, but they'd still use IAC when they do that.

play09:14

The infrastructure experts can deliver templates

play09:17

to the developers so the developers

play09:18

don't have to become infrastructure

play09:20

experts themselves.

play09:21

For example, they may create a Terraform template

play09:24

for an internal-only Cloud Run service.

play09:26

That saves time for the developers

play09:28

and it means that the infrastructure is standardized

play09:30

across applications.

play09:32

The infrastructure team will do less firefighting and more

play09:35

designing of fire trucks.

play09:37

MARTIN OMANDER: I love that analogy, Steve.

play09:39

So my Google Cloud project has evolved over time

play09:42

as I've made dozens and dozens of tweaks over the years.

play09:46

How do I even get started with infrastructure as code?

play09:49

STEVE MCGHEE: Well, there are tools

play09:50

that can inspect your GCP project

play09:52

and create IAC descriptions of your current state.

play09:55

You can get started by running this gcloud command.

play09:58

There are also third party tools which are more specialized

play10:01

and may be a better fit for you.

play10:02

Use one of these tools to get your current state.

play10:05

Focus on one service that you know well, like Cloud Run,

play10:08

and make sure the IAC files work really

play10:10

well for that service in a test environment,

play10:13

then expand to other parts of your system

play10:15

as you gain confidence.

play10:16

MARTIN OMANDER: All right, Steve.

play10:18

That was a lot of information.

play10:19

What are the takeaways?

play10:21

STEVE MCGHEE: First, describe your infrastructure

play10:23

as runnable files that you can check into source control.

play10:26

It's fine to do this for part of your system

play10:28

and then expand as you grow more comfortable.

play10:30

Then keep working on this until you've set things

play10:33

up so that if someone changes the infrastructure

play10:35

in your project, you'd be able to recover in minutes.

play10:39

Finally, and this may be weeks or months down the road,

play10:42

set up a reconciliation loop to prevent infrastructure drift.

play10:46

MARTIN OMANDER: Sounds good, Steve.

play10:48

Thanks for sharing this with us.

play10:49

STEVE MCGHEE: Thanks for having me, Martin.

play10:51

MARTIN OMANDER: And thank you, everyone, for watching.

play10:53

If you have questions for Steve or me,

play10:56

please enter them in the comments below.

play10:58

Also, let me know if there are other serverless topics you'd

play11:02

like to see in future episodes.

play11:04

I read every single comment.

play11:07

Until next time.

play11:09

[MUSIC PLAYING]

Rate This

5.0 / 5 (0 votes)

関連タグ
Infrastructure as CodeGoogle CloudReliabilitySRECloud RunService AccountsVersion ControlTerraformCI/CDDevOpsAutomation
英語で要約が必要ですか?