Infrastructure as code
Summary
TLDRIn this insightful discussion, Steve McGhee, a former Google SRE, shares his expertise on building reliable systems in the cloud. He introduces Infrastructure as Code (IaC) as a solution to the common issue of broken services due to permission tweaks, demonstrating how IaC tools like Terraform can revert to a known good state quickly. McGhee offers practical advice on implementing IaC, setting up a CI/CD pipeline, and establishing a reconciliation loop to prevent infrastructure drift, all aimed at enhancing project resilience and developer efficiency.
Takeaways
- 😀 Infrastructure as Code (IAC) uses version control for infrastructure, similar to how it's done for application code.
- 🛠️ IAC tools like Terraform allow you to define the desired state of your infrastructure and automatically reconcile any differences.
- 🔄 IAC promotes idempotency in infrastructure changes, ensuring that running the same configuration repeatedly won't cause unintended side effects.
- 📝 Describing infrastructure in runnable files enables precise updates and collaboration, akin to source code management.
- 🔧 Using IAC can quickly restore service after an accidental change or outage, by reapplying the last known good configuration.
- 👥 IAC supports a collaborative environment where infrastructure changes are reviewed and approved through pull requests.
- 🚫 Imperative scripts, like shell scripts with `gcloud` commands, are less safe and flexible compared to declarative IAC languages.
- 🔄 The ability to recreate environments rapidly with IAC is crucial for scenarios like developer onboarding or setting up new test environments.
- 🗃️ Handling data with IAC is complex and should be managed with dedicated database tools for backups and restorations.
- 🔄 A reconciliation loop is an advanced IAC practice to detect and respond to changes made outside the IAC system, preventing infrastructure drift.
- 🔄 Integrating IAC with CI/CD pipelines can be done in various ways, with dry runs being a best practice to avoid unintended consequences.
Q & A
What issue did Martin Omander encounter with his Google Cloud project?
-Martin Omander faced a problem where tuning the permissions in his Google Cloud project broke his Cloud Run service, and he couldn't revert to a working state due to the absence of an undo button.
What is Steve McGhee's background in the field of site reliability?
-Steve McGhee has over a decade of experience as a site reliability engineer (SRE) at Google, where he worked on Search, Android, YouTube, and Cloud. He now helps developers build reliable systems in the Cloud.
What is Infrastructure as Code (IAC) and how does it relate to version control?
-Infrastructure as Code (IAC) is the concept of managing and provisioning infrastructure through machine-readable scripts, similar to using version control for your infrastructure. It allows for safer and more systematic management of infrastructure changes.
How can IAC help in restoring a broken application?
-IAC can help by maintaining a record of the last known good configuration in the form of runnable files. In case of a failure, these files can be used to quickly restore the service to its previous working state without the need for manual troubleshooting.
What is the difference between using an imperative language and a declarative language in IAC?
-An imperative language specifies how to achieve a task through a series of steps, while a declarative language specifies what the end state should be, and the tool figures out the steps to reach that state. Declarative languages are often used in IAC for their idempotent nature, ensuring no unintended side effects when run multiple times.
How can IAC assist in setting up new environments for developers or testing?
-IAC allows for the quick creation of new environments by using the infrastructure files to replicate the necessary settings. This can be done in minutes and is especially useful when a new developer joins a team or when setting up temporary test environments.
What is a reconciliation loop in the context of IAC?
-A reconciliation loop is a process that regularly checks the current state of the infrastructure against the desired state defined in the IAC files. It helps detect and address any discrepancies that may have been made outside of the IAC process, thus preventing infrastructure drift.
How does IAC integrate with a CI/CD pipeline?
-IAC can be integrated into a CI/CD pipeline by either running it as part of the deployment process or managing infrastructure updates separately. It's important to perform a dry run to evaluate and approve proposed changes before they are applied.
What is the significance of using a declarative language like Terraform for IAC?
-Using a declarative language like Terraform for IAC allows for specifying the desired state of the infrastructure. Terraform then determines the necessary changes to achieve this state, making the process safer and more efficient than imperative scripts.
How can someone with an existing project that has evolved over time start using IAC?
-One can start using IAC by using tools that can inspect the current GCP project and create IAC descriptions of the current state. They can then test these files in a controlled environment and gradually expand their IAC implementation as they gain confidence.
What are the key takeaways from the discussion on IAC in the video script?
-The key takeaways are to describe infrastructure in runnable files checked into source control, work towards being able to recover from changes in minutes, and eventually set up a reconciliation loop to prevent infrastructure drift.
Outlines
🛠️ Infrastructure as Code for Reliability
In this segment, Martin Omander discusses the importance of Infrastructure as Code (IaC) with Steve McGhee, a former Google SRE. Steve explains how IaC, similar to version control for infrastructure, can prevent and resolve issues like the one Martin experienced with his broken Cloud Run service. He demonstrates using Terraform to revert to a previous configuration, emphasizing the benefits of IaC for maintaining and updating cloud infrastructure efficiently and safely. The conversation highlights the process of applying Terraform to restore service accounts and other components, showcasing a practical approach to infrastructure management.
🔄 The Power of Idempotency and IaC
This paragraph delves into the concept of idempotency in infrastructure management, where changes can be made repeatedly without causing unintended effects. Steve McGhee discusses the advantages of using declarative languages like Terraform over imperative scripts, which can lead to errors if not carefully managed. The discussion moves to the practical aspects of IaC, such as quickly setting up new environments for developers or testing, and the importance of handling data with dedicated tools. The paragraph also introduces the concept of a reconciliation loop, a proactive measure to detect and address changes made outside of IaC, ensuring infrastructure remains in the desired state.
🌐 Implementing IaC and Integrating with CI/CD
The final paragraph provides actionable advice on how to start implementing IaC. Steve McGhee suggests beginning with a service you're familiar with and gradually expanding as confidence grows. He also addresses how IaC can be integrated into existing CI/CD pipelines, offering two approaches: running IaC as part of the pipeline with careful consideration of potential consequences, or managing infrastructure updates separately. The paragraph concludes with considerations for organizations with a dedicated infrastructure team and how they can leverage IaC to standardize processes and reduce the need for reactive maintenance.
Mindmap
Keywords
💡Google Cloud
💡Infrastructure as Code (IaC)
💡Cloud Run
💡Service Accounts
💡Terraform
💡Version Control
💡Reconciliation Loop
💡Idempotency
💡CI/CD Pipeline
💡Infrastructure Drift
💡Declarative Language
Highlights
Steve McGhee, a former Google SRE, shares insights on building reliable systems in the cloud.
Infrastructure as Code (IaC) is introduced as a solution to manage cloud infrastructure effectively.
IaC allows for version control of infrastructure, similar to application code.
Terraform is used to demonstrate how to restore a broken Cloud Run service to a previous working state.
The importance of having a last known good configuration for quick recovery is emphasized.
Alternatives to Terraform, like Poulomi, are mentioned for infrastructure management.
Advice on starting with IaC by describing infrastructure in runnable files checked into source control.
The concept of setting up a system to recover from infrastructure changes within minutes is discussed.
The idea of a reconciliation loop to prevent infrastructure drift is introduced.
Steve explains the difference between imperative and declarative languages in IaC tools.
The benefits of idempotent changes in IaC for safety and reducing side effects are highlighted.
The ability to recreate deleted projects quickly using IaC is explored.
Steve suggests using IaC to create new environments for developers or testing.
The complexity of managing database data in IaC and the need for dedicated tools is mentioned.
How to handle changes made outside of IaC tools and the importance of maintaining infrastructure integrity.
The relationship between IaC and CI/CD pipelines, and the options for integrating IaC into development workflows.
The role of a separate infrastructure team in organizations and how they can leverage IaC.
Tools that can generate IaC descriptions from existing GCP projects to help get started with IaC.
Key takeaways on implementing IaC in projects to ensure reliability and quick recovery from infrastructure issues.
Transcripts
MARTIN OMANDER: Steve, I was tuning the permissions
in my Google Cloud project, and that broke my Cloud Run service.
There's no undo button, so I can't get back
to a working state.
STEVE MCGHEE: Well, infrastructure as code
might have prevented that.
Let's see how.
[MUSIC PLAYING]
MARTIN OMANDER: Welcome back to the show, Steve.
I'm so happy you're here to tell us about reliability.
You have some solid experience.
STEVE MCGHEE: Thanks, Martin.
Yeah.
I was a site reliability engineer, or SRE, inside
Google for over a decade.
I worked on Search, Android, YouTube, and Cloud.
Now my job is helping developers understand
how to build reliable systems in the Cloud.
MARTIN OMANDER: And you said infrastructure as code
could help me project settings?
STEVE MCGHEE: Yeah.
Infrastructure as code or IAC is like using version control
for your infrastructure.
We all know it's safer to keep the application
code in source control.
It's a good idea to do the same for your infrastructure.
For example, I have a web app here
that uses Cloud Run, a Firestore database,
and a few other components.
Let's say I'm playing around with the service account
settings and I accidentally remove a role from one of them.
Now my users will get an error message, the web app is broken.
MARTIN OMANDER: Yeah, that's exactly what
happened to my application.
STEVE MCGHEE: Right.
The application is broken now, and we're losing money.
Everyone is stressed, and the application
needs to come back online ASAP.
It's hard to do troubleshooting in this kind of environment.
MARTIN OMANDER: Yep.
Been there, done that.
STEVE MCGHEE: So instead of troubleshooting,
let's run the Terraform file that
describes our last known good configuration.
I run Terraform apply in this terminal, then I confirm,
and off it goes.
This may take a few minutes.
MARTIN OMANDER: And now it's restoring the service accounts
in your project?
STEVE MCGHEE: Yeah.
It's restoring the service accounts, the Cloud Run
services, my database settings, any virtual machines
I might have, and so on.
And now it's done.
Let's try accessing the web app again.
The app is working again.
We didn't have to do any troubleshooting
in a stressful situation.
MARTIN OMANDER: Ah, excellent.
And you use Terraform for this?
STEVE MCGHEE: Yeah.
I use Terraform, but there are alternatives
like Poulomi and others.
Some Cloud providers also offer their own tools.
The one thing they have in common
is that they update the infrastructure in a project,
like you might do in the Cloud Console or on the command line
with the gcloud command.
MARTIN OMANDER: I don't have infrastructure
as code in my project now, and I'm a little overwhelmed.
How can I get started?
STEVE MCGHEE: Sure.
First, describe your infrastructure as runnable files
that you can check into source control.
It's fine to do this for part of your system
and expand as you grow more comfortable.
Then your goal is to set things up so that if someone changes
the infrastructure in your project,
you'd be able to recover in minutes.
Finally, this may be weeks or months down the road.
Set up a reconciliation loop to prevent infrastructure drift.
MARTIN OMANDER: All right.
Tell me more about that first item on your list.
STEVE MCGHEE: OK.
By having your infrastructure described in runnable files,
you can see the desired state of your infrastructure.
For example, if your project includes Cloud Run services,
this desired state would describe things
like in which region the service is deployed, if each service is
open to anonymous users or if it's authenticated,
and what service accounts are used.
Now that these settings are in runnable files,
you can update them with precision
and collaborate around them, just like you
would with source code.
MARTIN OMANDER: Could you give us an example of that?
STEVE MCGHEE: Sure.
Let's say you have a Cloud Run service that is public
and you described it using Terraform.
Here's what it would look like.
Now, let's say in the next release,
this Cloud Run service should be an authenticated service that
can only be called by other services, not by end users.
A developer would update the Terraform file accordingly
and submit it to source control.
MARTIN OMANDER: And then that change
would use the same approval process
as any other code change?
STEVE MCGHEE: Yes.
The developer would create a pull request,
and it would be reviewed by another developer who might ask
questions or propose changes.
When the change has been approved
and it's time to release the next version of the system,
the Terraform file would be run and change
the authentication setting for the Cloud Run service.
MARTIN OMANDER: Back when I worked in startups,
we used to document the production environment
very carefully.
Is that the same thing?
STEVE MCGHEE: Not really.
A document still has to be interpreted
by a person who might make mistakes or misunderstand.
This is less ambiguous and less prone to errors
if you specify the infrastructure
in these runnable files.
MARTIN OMANDER: I have a bunch of gcloud commands
in a runnable shell script.
Is that infrastructure as code?
STEVE MCGHEE: Kind of.
But that shell script does the same set of steps
every time, regardless of the current state of the running
system.
We call that an imperative language.
Terraform, Poulomi, and some other tools like them
use declarative languages.
That means that you specify what your environment should
look like.
The tool then figures out what needs to change, if anything.
MARTIN OMANDER: That sounds like idempotency.
STEVE MCGHEE: That's right.
These changes are idempotent.
That is, they can be run again and again
without any unintended side effects.
That makes them safer than a simple imperative script which
might go off and make the same VM over
and over every time you ran it if you weren't careful.
MARTIN OMANDER: Ah.
Very good.
Item number two on your list has to do with recreating
a deleted project.
STEVE MCGHEE: Well, hopefully you
have some safeguards in place so that your projects can't
be easily deleted by mistake, but it's a good litmus test.
Ask yourself, how long would it take
to recreate your production application
if its project was deleted?
You can reduce that time significantly with IAC.
MARTIN OMANDER: Got it.
STEVE MCGHEE: So once you have those runnable infrastructure
files, you can use them to create new environments.
For example, when a new developer joins your team,
you may want to set up a new development project for them.
If you have runnable infrastructure files,
you can do that in minutes.
MARTIN OMANDER: I guess the same goes for new test environments?
STEVE MCGHEE: You're right, it does.
For example, you might want to set up a short lived test
environment for a major new feature of your application.
IAC makes that easy.
MARTIN OMANDER: What about data in a database?
That would be needed too to set up a new project, right?
STEVE MCGHEE: That's right.
Data is more complex, so it's best
to use dedicated tools for that.
For example, you may use database tools
to back up your production data at regular intervals,
and you probably have a different set
of data for development that doesn't include
sensitive production data.
You'd use IAC to create your infrastructure,
then you'd use a database tool to restore data into it.
MARTIN OMANDER: Makes sense.
Let's say I have created an infrastructure file so I can
restore my system in minutes.
Then the last item on your list mentions a reconciliation loop.
What's that?
STEVE MCGHEE: So this is an extension of IAC,
but many teams find it useful.
Here's how it works.
People might make changes to your production or test
environments outside of IAC.
For example, someone in your organization
might notice that a service account has too many privileges.
They might change that manually in the Cloud Console.
MARTIN OMANDER: And that could break the application?
STEVE MCGHEE: Yes, it could break the application.
But even if it doesn't, we can no longer
be sure what our infrastructure looks like or when it changed.
We need a way to notice changes made outside of our IAC tool.
MARTIN OMANDER: And what should we do if we notice a change?
STEVE MCGHEE: Well, you could report the change
or you could overwrite it.
It's up to you which you prefer.
You may want to start with just reporting.
By sending a diff between the desired
environment and the actual environment, it's like an alert.
MARTIN OMANDER: And how would I get started
with this reconciliation loop?
STEVE MCGHEE: Well, you'd set up a nightly check like in a Cron.
Then you would run the check more often
to see if you see some benefits.
One option is to run it against a subset of your infrastructure
if that subset is extra important.
MARTIN OMANDER: All right.
I have some questions for you, Steve.
STEVE MCGHEE: Go ahead, Martin.
MARTIN OMANDER: So I already have a CI/CD pipeline
based on your recommendations from my previous video.
How does infrastructure as code relate to that CI/CD pipeline?
STEVE MCGHEE: OK.
There are two approaches.
Either run IAC as part of the CI/CD pipeline.
There would be two passes, one to update
the infrastructure, another to deploy applications to it.
But you can imagine there could be unintended consequences
this way.
So you may want to run the dry run command every time
and provide the developer an opportunity
to evaluate if they want the proposed infrastructure
changes to be applied or not.
MARTIN OMANDER: Got it.
And what's the other approach?
STEVE MCGHEE: The other approach is
to update infrastructure separately, like just
on initial creation and then periodically after that.
These might be suggested by security or compliance teams,
and they could even be managed separately.
Again, the best practice here would
be to do the dry run first like Terraform plan,
and inform the app owners of the intended changes
before they are applied.
You have to find the right approach that's
right for you, your application, and your team.
MARTIN OMANDER: And in some organizations,
there is a separate infrastructure team.
Does that change things?
STEVE MCGHEE: Then IAC may run outside of that CI/CD pipeline.
The infrastructure team could choose to release updates
on a different cycle, but they'd still use IAC when they do that.
The infrastructure experts can deliver templates
to the developers so the developers
don't have to become infrastructure
experts themselves.
For example, they may create a Terraform template
for an internal-only Cloud Run service.
That saves time for the developers
and it means that the infrastructure is standardized
across applications.
The infrastructure team will do less firefighting and more
designing of fire trucks.
MARTIN OMANDER: I love that analogy, Steve.
So my Google Cloud project has evolved over time
as I've made dozens and dozens of tweaks over the years.
How do I even get started with infrastructure as code?
STEVE MCGHEE: Well, there are tools
that can inspect your GCP project
and create IAC descriptions of your current state.
You can get started by running this gcloud command.
There are also third party tools which are more specialized
and may be a better fit for you.
Use one of these tools to get your current state.
Focus on one service that you know well, like Cloud Run,
and make sure the IAC files work really
well for that service in a test environment,
then expand to other parts of your system
as you gain confidence.
MARTIN OMANDER: All right, Steve.
That was a lot of information.
What are the takeaways?
STEVE MCGHEE: First, describe your infrastructure
as runnable files that you can check into source control.
It's fine to do this for part of your system
and then expand as you grow more comfortable.
Then keep working on this until you've set things
up so that if someone changes the infrastructure
in your project, you'd be able to recover in minutes.
Finally, and this may be weeks or months down the road,
set up a reconciliation loop to prevent infrastructure drift.
MARTIN OMANDER: Sounds good, Steve.
Thanks for sharing this with us.
STEVE MCGHEE: Thanks for having me, Martin.
MARTIN OMANDER: And thank you, everyone, for watching.
If you have questions for Steve or me,
please enter them in the comments below.
Also, let me know if there are other serverless topics you'd
like to see in future episodes.
I read every single comment.
Until next time.
[MUSIC PLAYING]
تصفح المزيد من مقاطع الفيديو ذات الصلة
Day-16 | Infrastructure as Code | #terraform #IaC
DE Zoomcamp 1.3.1 - Introduction to Terraform Concepts & GCP Pre-Requisites
Implementing Infrastructure as Code with Terraform | AWS Cloud Resume Challenge - Part 6
Mastering Terraform: Scenario-Based Interview Questions & Solutions | Terraform Interview Mastery
Terraform Interview Questions | Terraform Scenario Questions | DevOps Interview Series | Terraform
What is AWS Cloudformation? Pros and Cons?
5.0 / 5 (0 votes)