That time Google Cloud Platform bricked the Internet…

Fireship

16 Jun 202503:56

Summary

TLDRLast week, a major outage occurred when a bug in Google Cloud's code caused massive disruptions across the internet. Affected services included Snapchat, Spotify, Discord, and even Google's own services like Gmail and Drive. The issue was due to a faulty feature added to an API management service that went untested during staging. Once triggered, the bug caused widespread failures, taking hours to fully resolve. This incident not only led to financial penalties but also tarnished Google Cloud's reputation in a competitive market. The video emphasizes the importance of thorough testing and error handling in software development to prevent such disasters.

Takeaways

😀 Google Cloud caused a significant internet outage last week, affecting services like Snapchat, Spotify, and Discord.
😀 The outage resulted from a bug in Google Cloud's API management system after a policy change was added on May 29th, 2025.
😀 Cloudflare's KV service was hit, causing nearly 100% error rates for over 2 hours, which cascaded and brought down various websites.
😀 Google Cloud's services, including Gmail, Google Calendar, and Drive, were also affected, showing the scale of the failure.
😀 Outages of this scale can cost companies millions in damages and trigger financial compensation through SLA credits.
😀 Google Cloud’s failure severely impacts its reputation, particularly since it is already third in market share, behind AWS and Azure.
😀 Sundar Pichai mentioned that AI now writes over 30% of Google’s code, leading to speculation about AI’s role in the failure, though human error is more likely.
😀 The bug was caused by a policy change that was inserted on June 12th, 2025, which triggered a crash loop in the API management binary.
😀 The bug was dormant until the policy change, highlighting a lack of proper error handling and testing during staging.
😀 Google had a rollback button in place, but it took 40 minutes to start the process and 4 hours to fully stabilize the system.
😀 PostHog, the video sponsor, is a platform for building better products with features like analytics, AI-powered product assistants, and more.

Q & A

What caused the massive outage in June 2025?
-The outage was caused by a bug in the code deployed to Google Cloud's API management service. A new quota policy check was added, which triggered an API management binary crash after a global policy change.
Which major services were affected by the Google Cloud outage?
-The outage affected several major services including Gmail, Google Calendar, Google Drive, and Google Meet. It also impacted third-party platforms like Snapchat, Spotify, and Discord.
How long did the outage last, and how did Google respond?
-The outage lasted for over four hours. Google was able to initiate a rollback after 40 minutes, but it took around four hours for the systems to fully stabilize.
What is a Service Level Agreement (SLA), and how does it relate to this outage?
-An SLA is a contract that guarantees a certain level of uptime for cloud services. In this case, Google Cloud violated its SLA, which could lead to financial compensation for affected customers, but the real damage came to its reputation.
What was the main technical flaw that caused the outage?
-The main technical flaw was the lack of error handling in the newly added quota policy check, which caused a null pointer exception that led to the API management service crashing.
What is the role of the API management service in Google Cloud?
-The API management service handles incoming API requests, ensuring they are authorized and managing quota and policy information. It also ensures that this information is replicated across Google Cloud’s global data centers.
How did the code's dormant bug become active?
-The bug remained dormant until a global policy change was made on June 12th, which triggered the faulty code, leading to a crash loop in the API management binary.
How does Google Cloud’s market position relate to this incident?
-Google Cloud is in third place in the cloud services market, behind AWS and Azure. The outage significantly damaged its reputation, which could worsen its position in the competitive cloud market.
What is the importance of error handling in cloud infrastructure?
-Error handling is crucial in cloud infrastructure to prevent system crashes. In this case, the lack of proper error handling in the code led to a null pointer exception that caused widespread service disruptions.
What was the key lesson from this Google Cloud outage?
-The key lesson is the critical importance of thorough testing, proper error handling, and ensuring that all code paths are adequately checked before deployment to prevent such large-scale outages.