CrowdStrike Outage Explained by Keith Barker CCIE
Summary
TLDRIn July 2024, a CrowdStrike incident led to over 8 million Windows computers experiencing the 'Blue Screen of Death' (BSOD), affecting numerous services globally. Keith Barker explains the technical analogy of a castle's security rings to describe how the incident occurred due to a faulty update in the kernel mode of CrowdStrike's Falcon software. He outlines the resolution process involving safe mode and Microsoft's recovery tool updates, and emphasizes the importance of better Quality Assurance (QA) to prevent such widespread system failures.
Takeaways
- 😕 A CrowdStrike incident in July 2024 caused over 8 million Windows computers to display the 'Blue Screen of Death' (BSOD), impacting services worldwide.
- 🕵️♂️ The incident affected not only individual computers but also critical services like businesses, airlines, hospitals, causing widespread disruptions.
- 🏰 Keith Barker used a castle analogy to explain system security, with 'Area Zero' being the most secure part, analogous to 'Ring Zero' in computer systems.
- 🛡 In computer systems, 'Ring Zero' is the most secure area, where the operating system and critical functions run, while 'Ring One' is less secure and where most applications operate.
- 🔍 The CrowdStrike Falcon software, designed as an advanced anti-malware program, runs in 'Ring Zero' (kernel mode) for direct access to system resources.
- 🚫 The BSOD occurred due to an update in the Falcon software that introduced a faulty file, causing system failures when running in kernel mode.
- 🔄 The faulty update was identified as 'C-5Z291Das', which, when deployed, led to system crashes and the BSOD due to its operation in a critical system area.
- 🛠️ To resolve the issue, users are advised to boot into safe mode, remove the problematic Falcon update files, and then reboot the system.
- 🔒 Additional complications in recovery may arise for systems using BitLocker, requiring extra steps for recovery.
- 🛑 Microsoft updated their recovery tool on July 22nd to assist IT admins with two repair options: booting from WinPE or recovering from safe mode.
- 🔄 The incident could have been avoided with better Quality Assurance (QA) on software updates, or by running Falcon not in kernel mode but as a user application to prevent system-wide crashes.
Q & A
What is the main topic of the video by Keith Barker?
-The main topic of the video is the CrowdStrike incident that occurred in July 2024, which affected over 8 million Windows computers and caused a widespread 'Blue Screen of Death'.
What is the 'Blue Screen of Death' (BSOD)?
-The 'Blue Screen of Death' (BSOD) is an error screen displayed on Windows computers when a critical system error occurs, often resulting in a system crash and requiring a reboot.
How did the CrowdStrike incident impact people who did not personally experience a BSOD?
-The incident impacted people indirectly by causing disruptions to services such as businesses, airlines, hospitals, and other critical services around the world, leading to missed flights and appointments due to system downtimes.
What is the analogy used by Keith Barker to explain the security breach that led to the CrowdStrike incident?
-Keith Barker uses the analogy of a castle with different security areas to explain the breach. In this analogy, 'Area Zero' represents the most secure area (ring zero in computer systems), while 'Area One' represents the outer perimeter (ring one in computer systems).
What are 'ring zero' and 'ring one' in the context of computer systems?
-In computer systems, 'ring zero' refers to the most secure area, also known as kernel mode, where the operating system and critical functions run. 'Ring one' is the less secure area, also known as user mode, where most applications run.
What is the role of the CrowdStrike Falcon software?
-CrowdStrike Falcon is an anti-malware program that runs on Windows computers to help identify and prevent malware. It is designed to be highly efficient in catching and preventing malicious activities.
Why did the Falcon software cause the 'Blue Screen of Death'?
-The Falcon software caused the 'Blue Screen of Death' because it was running in kernel mode (ring zero), and an update to the software introduced an incorrect file that caused the application to fail, leading to a system crash.
What is the Windows Hardware Qualified Lab (WHQL) and its significance in the CrowdStrike incident?
-WHQL is a certification process where Microsoft tests and approves third-party software and drivers. The Falcon software was certified through WHQL, indicating that it was tested and approved by Microsoft, but the incident occurred due to an issue with an update post-certification.
How can a computer affected by the CrowdStrike incident be resolved?
-To resolve the issue, a computer can be rebooted into safe mode, where the updated files causing the problem can be identified and deleted. Afterward, the system can be rebooted normally.
What complications might arise during the recovery process for servers or systems using BitLocker?
-For servers without a GUI or systems using BitLocker, additional steps are required for recovery. Servers may require scripting to make changes, and BitLocker systems may need the decryption keys to proceed with the recovery process.
How could the CrowdStrike incident have been avoided?
-The incident could have been avoided with better Quality Assurance (QA) on the updates for the Falcon software or by running the Falcon software as an application in user mode (ring one) instead of kernel mode (ring zero), which would limit the impact of a crash to the application itself rather than the entire system.
Outlines
😲 The CrowdStrike Incident Overview
Keith Barker introduces the CrowdStrike incident that occurred in July 2024, affecting over 8 million Windows computers worldwide. He outlines the three main questions to be addressed: what happened, why it happened, and how to resolve it. Barker explains the widespread impact, including service interruptions in businesses, airlines, hospitals, and more, highlighting the 'Blue Screen of Death' (BSOD) as the primary symptom. He uses the analogy of a castle to describe the secure areas within a computer system, with 'Area Zero' being the most secure, akin to 'Ring Zero' in computer terms, where critical functions operate.
🛡️ The Technical Explanation of the Incident
Barker delves into the technical aspects of the CrowdStrike incident, explaining the roles of 'Ring Zero' and 'Ring One' in a computer system's architecture. He describes how applications operate in 'User Mode' or 'Ring One' and how core system functions, including those of CrowdStrike's Falcon software, operate in 'Kernel Mode' or 'Ring Zero'. The incident occurred due to an update in the Falcon software, which, running in kernel mode, caused a system-wide halt when it failed, leading to the BSOD. Barker also discusses the certification process of the Falcon software through the Windows Hardware Qualified Lab (WHQL) and the consequences of an update gone wrong on July 19th.
🔄 The Resolution and Prevention of Future Incidents
The resolution of the incident involves rebooting affected computers into 'Safe Mode' to remove the problematic update files from the Falcon software, which were identified by a specific file signature. Barker acknowledges the complexity of this process, especially for servers without a graphical user interface (GUI) attached or systems using BitLocker. He also mentions the updated Microsoft recovery tool released on July 22nd, which aids in the repair process. To prevent such incidents, Barker suggests better quality assurance (QA) for software updates and the consideration of running security software like Falcon in 'User Mode' instead of 'Kernel Mode' to avoid system-wide crashes.
Mindmap
Keywords
💡CrowdStrike
💡Blue Screen of Death (BSOD)
💡Ring Zero and Ring One
💡Kernel Mode
💡User Mode
💡Falcon Software
💡WHQL
💡Dynamic Updates
💡Safe Mode
💡BitLocker
💡QA (Quality Assurance)
Highlights
Introduction to the CrowdStrike incident that affected over 8 million Windows computers in July 2024.
Impact of the incident extended beyond direct blue screen of death occurrences, affecting services like airlines and hospitals worldwide.
An analogy of a castle is used to explain the secure areas within a computer system, comparing outer perimeters to 'ring one' and the innermost secure area to 'ring zero'.
Explanation of the difference between kernel mode (ring zero) and user mode (ring one) in operating systems.
The CrowdStrike Falcon software, designed as an advanced anti-malware program, inadvertently caused the blue screen of death due to its operation in kernel mode.
Falcon software's certification through the Windows Hardware Qualified Lab (WHQL) signifies Microsoft's approval but does not prevent issues with updates.
The specific update on July 19th introduced a faulty file, causing the Falcon driver to fail and result in the blue screen of death.
Resolution involves booting into safe mode to delete the problematic update files, which can be challenging for servers without GUIs or systems using BitLocker.
Microsoft's update to their recovery tool on July 22nd provided additional support for IT admins to expedite the repair process.
The incident could have been avoided with better Quality Assurance (QA) on software updates or by not running Falcon in kernel mode.
The importance of thorough testing in QA to prevent system-wide failures like the CrowdStrike incident.
The potential trade-off between running security software in kernel mode for enhanced functionality and the increased risk of system crashes.
The incident's impact on hundreds of millions of people globally, highlighting the interconnectedness of digital systems in daily life.
The role of dynamic updates in software, which can introduce risks if not properly tested, as seen in the CrowdStrike incident.
The complexity of resolving system-wide issues when they occur, especially in environments without direct user interfaces.
The significance of the CrowdStrike incident as a case study in the balance between security software effectiveness and system stability.
Transcripts
hello and welcome my name is Keith
Barker and I'd like to give you a high
Lev overview of what you need to know
about the crowd strike incident that
happened in July 2024 and these are the
three main questions I'd like to cover
with you right now first of all what
happened secondly why did it happen and
third and fairly important to people are
still cleaning up after this how do we
resolve it so let's begin with question
number one what exactly happened to over
8 million Windows computers and are you
impacted first of all it would look like
the result looks like this this is a
representation of the blue screen of
death or as his friends call it BS o d
for the acronym for blue screen of death
and for the question of are you impacted
the answer is yes even if you personally
didn't have a blue screen of death on
your Windows computer it's very likely
there was interruptions to other
services that you very likely were
attempting or were going to use things
like businesses and Airlines and
hospitals and various critical services
around the world so I have friends who
Miss flights I had children that missed
doctor's appointments all because those
systems were down due to their blue
screen of death so again whether this
happened to you personally or you were
impacted by somebody else's computer
systems having blue screens of death
this impacted hundreds of millions of
people all over the planet in one way or
another so when something like that
happens one of the questions might be
well how in the world did this occur so
let's tackle that next and a great way
to understand why this happened would be
to use the analogy of a castle so let's
imagine we have a castle right here so
this would be the perimeter and then
within the castle there's other secure
areas it's very likely that's where we
keep the king or the royalty whoever it
happens to to be in the most secure area
so we'll call this area zero the most
secure area and the Outer Perimeter here
we'll go ahead and call that area one so
we can think of this area zero here in
the castle the innermost part of it as
the most secure area now if somebody
needs access to the king or to the staff
of the king or to the royalty what can
happen is a request can be made for that
access maybe we need to get a decision
from the king or some other request
needs to be made that request is made
and then inside this most secure area
the decisions are made and then the
results are handed back and the concept
is if something negative happens out
here in area one let's say we have I'm
going to go ahead and draw an X to
represent something negative happening
hopefully that's not going to impact the
area zero because there's extra security
to get to area zero the goal is to not
let those negative events impact the
area zero the most secure area where the
king and the treasures are all kept or
in the case of a queen where the queen
and all the treasures are kept so how
does this story of the castle with a
most secure area and a less secure area
apply to why this blue screen of death
happened as part of the crowd strike
incident in July of 2024 and here's how
it applies I'm going to draw the same
diagram again except this time I'm going
to call this ring one and think of it
like area one for the castle the outside
perimeter but in a computer system it's
referred to as ring one and then the
operating system and the most critical
functions are going to be running in a
separate and more secure area called
ring zero and just like our analogy of
the castle and area one and area zero
inside of a computer system we have
these similar areas except they're
referred to as the most areas here's
ring zero and the less secure area or
the outside perimeter here is referred
to as ring one and here at ring zero the
operating system is handling core
functions for the operating system
itself so when we have other
applications such as Microsoft Office
applications or other programs or
browsers that we're running they're
running in ring one it's also referred
to as user mode so these little red
boxes I'm going to refer to as
applications think of them like user
applications that are running in ring
one now for those applications to work
they need resources for example they
need to get to memory or they may need
to write to the dis or they may need to
write to the network or make requests
from the network so when those apps need
resources they're interacting and making
this request to the ring zero components
and right here at ring zero we have
things like memory management and access
to the hardware and all the super secure
services that the operating system is in
charge of and that would also include
things such as drivers so when an
application needs resources it's making
those requests and then hopefully those
requests are being granted back to the
applications from the operating system
another common term for ring zero and
ring one are kernel mode and I'll match
set the same color here and just think
of Kernel mode as the most secure area
of the operating system that gets direct
access to the hardware and resources and
again that's where the operating system
runs and drivers run in ring zero or in
kernel mode and ring one where most
applications run that's referred to as
user mode some parentheses I'll just
shout out ring zero for kernel mode and
next to user mode I'll go aad and put
ring one just as a reminder now what's
the benefit of having two of these
different modes kernel mode at ring zero
and user mode well the benefit is if we
have an application that goes sideways
it has a problem or an issue hopefully
we just want that application to die by
itself and not take down the entire
system and that's normally the case with
applications that are run in user mode
for example let's imagine we're running
our favorite network-based game on our
computer it's running in user mode in
ring level one and it crashes the intent
is for just that application to crash
because of its problem and not to take
down the entire operating system and get
a blue screen of death so let me clean
this up just a teeny bit and let's talk
about why
the Falcon application let's take a look
at what that is why that from crowd
strike caused the blue screen of death
crowd strike makes some software that
runs on a Windows computer that helps to
identify and prevent malware so think of
it like an anti-malware program on
steroids it's really really efficient
until it brings down the computer but
for the moment let's just go ahead and
label out that their product called
Falcon again think of Falcon like an
antivirus or antimalware software and as
far as the impact goes the blue screen
of death happened to individuals and
computer systems where they were running
the Falcon service from crowd strike now
the reason that the Falcon software
caused the blue screen of death was
because had two things that were
currently working at the same time
number one the Falcon software is
running in kernel mode here at ring zero
so if there's a problem with the
software and it ruins effectively the
kernel that's going to cause the blue
screen of death instead of the computer
still trying to continue which if
there's problems at ring zero with
access to memory for example two
different programs right into the same
memory space or walking on top of each
other the operating system is designed
to Halt instead of continuing which
could lead to further data corruption so
their code the Falcon code is being run
as a device driver here at ring zero in
kernel mode and there's probably some
great reasons why they're doing that one
would be they want more access and
direct access to make sure that their
antimalware and anti virus Etc software
is currently working correctly and able
to go ahead and really catch everything
that's trying to happen and it's not
like they just walked up and said hey
can we do this with Microsoft they went
through a process called
whql which is an acronym for Windows
Hardware qualified lab and they were
certified which effectively means that
Microsoft tested and worked with their
software validated it and said yep
you're good to go we approve of this and
here's the rub even though the Falcon
software was certified through whql so
let's go ahead and manage this is Falcon
right here so it's been certified it's
been signed by Microsoft that it's good
to go the underlying components of
Falcon periodically get updates and
that's where the whole thing went South
so let's imagine the Falcon software
itself has some subordinate files think
of them like files that the Falcon
software itself uses as part of its
operation and let's go ahead and put
file one and file two and file three and
so if they need to update some
components of the Falcon driver their
software effectively what they can do is
update those files and then they have an
updated component as part of the Falcon
software and that's deployed to clients
that are using this system from crow
called Falcon through Dynamic updates
well even though the Falcon software was
certified by Microsoft in the event that
they do an update and they have a
corrupt or incorrect file that they're
using as part of the Falcon software
because it's running in kernel mode at
ring zero if there's a problem with one
of those supporting files that's being
used by Falcon that could and did cause
a problem as part of the update that
happened on July 19th with these files
that were being used by the Falcon
driver which again is our code wearing
at ring zero in kernel mode it can be
identified as C- then 5 Z
291 Das and then some extension which
won't matter tooo much because it's this
update that caused the problem so to
answer the question why did this happen
customer systems that were using crowd
strikes Falcon software when they got
the dynamic update the update had an
incorrect file and as a result the
incorrect file caused the application to
fail and because it was at ring zero or
kernel mode that caused the blue screen
of death so next let's turn our
attention to how it is being resolved
now they can push out a new update and
they have since the actual incident
happened but if a computer boots up to
the blue screen of death it's not going
to be able to continue with any other
kind of update because at that point
it's halted and the process for
correcting this if you're sitting at the
computer that has a blue screen of death
would be to reboot the computer into
what's known as safe mode and then in
safe mode you drill down to the file
structure find any of the updated files
from Falcon with this
00000000 291 and then delete that file
or any files with that 291 update and
then go ahead and reboot and that works
great most of the time except for the
fact that a lot of servers don't have a
guei connected to them for example a GUI
is a graphical user interface a lot of
servers don't have a screen attached to
them all the time and as a result if you
have hundreds or thousands of servers
that are running in a Data Center and
they don't have screens that some you
can just walk up to and work with it may
take a little more time or scripting to
actually make that change another
complication comes in if we're using a
security feature on the file system
called bit Locker so if a system is
using bit Locker there are some
additional steps and depending if you
have the keys or not additional steps
beyond that to get fully recovered just
be aware there is some additional work
to do if you're currently using bit
Locker also as part of the recovery
process to Aid that Microsoft updated
their Microsoft recovery tool on July
22nd and that recovery tool provides two
repair options to help it admins
expedite the repair process so one of
those options is booting from wind PE to
facilitate the repair and the other
option is doing the recover from safe
mode and I guess the final thing we
should chat about is how could this be
avoided and the answer is QA so even
though the Falcon driver or the Falcon
code itself which was running at ring
zero is certified via whql from
Microsoft the updates that they were
doing obviously were not thoroughly
tested enough to prevent the blue screen
of death so the two solutions would be
one is better QA on the updates for the
Falcon software or secondly they could
run the Falcon software not in ring zero
in kernel mode but rather run it as an
application which might degrade some of
its deficiencies in identifying malware
but at least if it crashed it would only
crash itself and it wouldn't cause the
entire system because it's not a kernel
zero application it wouldn't cause the
entire system to crash so thanks for
joining me in this video as we've
addressed three elements number one what
happened number two why it happened and
third how it's being resolved and until
next time I'm Keith Barker and stay safe
Ver Más Videos Relacionados
What is 'Blue screen of death' due to Crowdstrike error | Latest English News | WION
Global Cyber Outage: How did Microsoft Crash Worldwide? | Vantage with Palki Sharma
CrowdStrike IT Outage Explained by a Windows Developer
Blue Screen of Death(BSOD) | CrowdStrike’s Mistake: Inside the Microsoft Outage |Must Watch
Real men test in production… The truth about the CrowdStrike disaster
Special report: Major computer outages occur worldwide
5.0 / 5 (0 votes)