CrowdStrike Outage Explained by Keith Barker CCIE

CBT Nuggets

23 Jul 202410:52

Summary

TLDRIn July 2024, a CrowdStrike incident led to over 8 million Windows computers experiencing the 'Blue Screen of Death' (BSOD), affecting numerous services globally. Keith Barker explains the technical analogy of a castle's security rings to describe how the incident occurred due to a faulty update in the kernel mode of CrowdStrike's Falcon software. He outlines the resolution process involving safe mode and Microsoft's recovery tool updates, and emphasizes the importance of better Quality Assurance (QA) to prevent such widespread system failures.

Takeaways

😕 A CrowdStrike incident in July 2024 caused over 8 million Windows computers to display the 'Blue Screen of Death' (BSOD), impacting services worldwide.
🕵️‍♂️ The incident affected not only individual computers but also critical services like businesses, airlines, hospitals, causing widespread disruptions.
🏰 Keith Barker used a castle analogy to explain system security, with 'Area Zero' being the most secure part, analogous to 'Ring Zero' in computer systems.
🛡 In computer systems, 'Ring Zero' is the most secure area, where the operating system and critical functions run, while 'Ring One' is less secure and where most applications operate.
🔍 The CrowdStrike Falcon software, designed as an advanced anti-malware program, runs in 'Ring Zero' (kernel mode) for direct access to system resources.
🚫 The BSOD occurred due to an update in the Falcon software that introduced a faulty file, causing system failures when running in kernel mode.
🔄 The faulty update was identified as 'C-5Z291Das', which, when deployed, led to system crashes and the BSOD due to its operation in a critical system area.
🛠️ To resolve the issue, users are advised to boot into safe mode, remove the problematic Falcon update files, and then reboot the system.
🔒 Additional complications in recovery may arise for systems using BitLocker, requiring extra steps for recovery.
🛑 Microsoft updated their recovery tool on July 22nd to assist IT admins with two repair options: booting from WinPE or recovering from safe mode.
🔄 The incident could have been avoided with better Quality Assurance (QA) on software updates, or by running Falcon not in kernel mode but as a user application to prevent system-wide crashes.

Q & A

What is the main topic of the video by Keith Barker?
-The main topic of the video is the CrowdStrike incident that occurred in July 2024, which affected over 8 million Windows computers and caused a widespread 'Blue Screen of Death'.
What is the 'Blue Screen of Death' (BSOD)?
-The 'Blue Screen of Death' (BSOD) is an error screen displayed on Windows computers when a critical system error occurs, often resulting in a system crash and requiring a reboot.
How did the CrowdStrike incident impact people who did not personally experience a BSOD?
-The incident impacted people indirectly by causing disruptions to services such as businesses, airlines, hospitals, and other critical services around the world, leading to missed flights and appointments due to system downtimes.
What is the analogy used by Keith Barker to explain the security breach that led to the CrowdStrike incident?
-Keith Barker uses the analogy of a castle with different security areas to explain the breach. In this analogy, 'Area Zero' represents the most secure area (ring zero in computer systems), while 'Area One' represents the outer perimeter (ring one in computer systems).
What are 'ring zero' and 'ring one' in the context of computer systems?
-In computer systems, 'ring zero' refers to the most secure area, also known as kernel mode, where the operating system and critical functions run. 'Ring one' is the less secure area, also known as user mode, where most applications run.
What is the role of the CrowdStrike Falcon software?
-CrowdStrike Falcon is an anti-malware program that runs on Windows computers to help identify and prevent malware. It is designed to be highly efficient in catching and preventing malicious activities.
Why did the Falcon software cause the 'Blue Screen of Death'?
-The Falcon software caused the 'Blue Screen of Death' because it was running in kernel mode (ring zero), and an update to the software introduced an incorrect file that caused the application to fail, leading to a system crash.
What is the Windows Hardware Qualified Lab (WHQL) and its significance in the CrowdStrike incident?
-WHQL is a certification process where Microsoft tests and approves third-party software and drivers. The Falcon software was certified through WHQL, indicating that it was tested and approved by Microsoft, but the incident occurred due to an issue with an update post-certification.
How can a computer affected by the CrowdStrike incident be resolved?
-To resolve the issue, a computer can be rebooted into safe mode, where the updated files causing the problem can be identified and deleted. Afterward, the system can be rebooted normally.
What complications might arise during the recovery process for servers or systems using BitLocker?
-For servers without a GUI or systems using BitLocker, additional steps are required for recovery. Servers may require scripting to make changes, and BitLocker systems may need the decryption keys to proceed with the recovery process.
How could the CrowdStrike incident have been avoided?
-The incident could have been avoided with better Quality Assurance (QA) on the updates for the Falcon software or by running the Falcon software as an application in user mode (ring one) instead of kernel mode (ring zero), which would limit the impact of a crash to the application itself rather than the entire system.