CrowdStrike IT Outage Explained by a Windows Developer
Summary
TLDRDave, a retired Microsoft software engineer turned plumber, dives into the recent global Windows blue screen issues caused by a faulty CrowdStrike update. He explains the difference between kernel and user mode, the importance of kernel mode for security software, and the risks of executing unsigned code. Dave also offers a practical solution for fixing affected machines by booting into safe mode and removing the problematic driver file, providing insight into the resilience of modern operating systems.
Takeaways
- 👋 Dave introduces himself as a retired software engineer from Microsoft with experience dating back to MS DOS and Windows 95.
- 💻 He explains the CrowdStrike issue, focusing on the differences between kernel mode and user mode, and why the machines are blue screening.
- 🔍 The CrowdStrike blue screens are due to a bad update in their software, causing issues when a kernel driver like CrowdStrike fails.
- 🚨 Kernel mode is critical because it controls hardware interaction, memory management, and core functionalities of the OS.
- 🛠️ Dave shares his experience debugging blue screens, explaining how bugs in kernel mode can crash the entire system, unlike user mode crashes which only affect the application.
- 🧪 At Microsoft, stress tests were run nightly to catch bugs early, with test engineers writing tests to expose weaknesses in the system.
- 🔄 CrowdStrike's Falcon sensor operates in kernel mode to monitor application behavior for security threats, requiring robust and thorough testing.
- ❌ The recent issue was caused by a dynamic definition file that was supposed to update the CrowdStrike driver but instead contained invalid data, leading to system crashes.
- 🔧 Fixing the issue involves booting into safe mode and deleting the problematic update file from the system's drivers folder.
- 📚 Dave concludes by promoting his book about living a successful life on the autism spectrum and encourages viewers to subscribe to his channel for more content.
Q & A
Who is Dave and what is his background?
-Dave is a retired software engineer from Microsoft who has experience dating back to the MS DOS and Windows 95 days. He is now a plumber but still has a deep understanding of Windows development and debugging.
What issue is Dave discussing in the video?
-Dave is discussing the CrowdStrike issue, which has been causing blue screens on Windows machines worldwide due to a bad update to CrowdStrike's software.
What is the main difference between kernel mode and user mode in operating systems?
-Kernel mode is a more privileged mode where the operating system and device drivers run, having access to the entire system memory map and hardware. User mode is where applications run with limited access to system resources, ensuring that application crashes do not affect the entire system.
Why is running code in kernel mode considered risky?
-Running code in kernel mode is risky because if there is a bug in the kernel code, it can cause the entire system to crash, as it has access to all system resources and data structures.
What is the role of the WHQL certification in ensuring the robustness of drivers?
-The WHQL (Windows Hardware Quality Labs) certification ensures that drivers have been thoroughly tested by the vendor, passed the Windows Hardware Lab Kit testing on various platforms, and are digitally signed by Microsoft as being compatible with the Windows operating system.
Why might CrowdStrike choose not to go through the WHQL certification for every update?
-CrowdStrike might choose not to go through the WHQL certification for every update to ensure that their customers get the latest protection as soon as new threats emerge, avoiding the delay that comes with the certification process.
What is the CrowdStrike Falcon sensor, and why does it need to run in kernel mode?
-The CrowdStrike Falcon sensor is a security product that analyzes a wide range of application behavior to proactively detect new attacks. It needs to run in kernel mode to have complete and unfettered access to system data structures and services to perform its job effectively.
What is the problem with executing unsigned code in kernel mode?
-Executing unsigned code in kernel mode is problematic because it can lead to system crashes if there is a bug. Unsigned code has not been verified for stability and security, increasing the risk of system instability or security vulnerabilities.
How can one access a crash dump report to analyze the cause of a system crash?
-A crash dump report can be accessed by configuring the system to generate crash dump files. These files can provide detailed information about the state of the system at the time of the crash, including the offending instruction and the values of system registers.
What steps can be taken to fix a machine that has crashed due to the CrowdStrike issue?
-To fix a machine that has crashed due to the CrowdStrike issue, one needs to boot the machine into safe mode, navigate to the Windows system32 drivers directory, find and delete the problematic CrowdStrike driver file (usually with a pattern of 'C' followed by a series of zeros and '2 91.cist'), and then reboot the system.
What is the significance of marking a driver as a 'boot driver' in Windows?
-Marking a driver as a 'boot driver' in Windows signifies that the driver is essential for the startup of the Windows operating system. If such a driver crashes, the system may not boot properly, as it is considered a critical component of the boot process.
Outlines
👨💻 Dave's Introduction to CrowdStrike Blue Screen Issue
Dave, a retired Microsoft software engineer, introduces himself and sets the stage for a discussion on the CrowdStrike blue screen issue. He provides a brief background on his experience with Windows development and blue screen errors. Dave outlines his intention to explain the CrowdStrike issue, the difference between kernel mode and user mode, and the implications of a failed kernel driver. He also shares his unique perspective, gained from being stranded in New York City, on the global impact of the CrowdStrike software update that led to widespread blue screens.
🔍 Deep Dive into Kernel Mode and CrowdStrike's Role
This paragraph delves into the technical aspects of kernel mode versus user mode, explaining the fundamental differences and the critical nature of kernel mode operations. Dave discusses the role of CrowdStrike's Falcon sensor, a security product that operates in kernel mode to monitor application behavior for potential threats. He highlights the importance of the Windows Hardware Quality Labs (WHQL) certification for device drivers and the risks associated with running unsigned code in kernel mode. The paragraph also speculates on the potential causes of the CrowdStrike blue screen issue, suggesting that the dynamic definition files may have been compromised, leading to system instability.
🛠️ Troubleshooting and Resolving the CrowdStrike Blue Screen
Dave provides a practical guide to troubleshooting and resolving the CrowdStrike blue screen issue. He explains the importance of understanding the role of boot drivers and how CrowdStrike's driver was marked as essential for the Windows operating system to start. He offers a step-by-step solution for users to fix their machines by booting into safe mode and manually removing the problematic CrowdStrike driver files. Dave also touches on the broader question of why Windows isn't more resilient to such issues and the limitations of the system's recovery options.
Mindmap
Keywords
💡CrowdStrike
💡Blue Screen
💡Kernel Mode
💡User Mode
💡Device Driver
💡Ring Zero
💡WHQL Certification
💡Boot Driver
💡Parameter Validation
💡Crash Dump
💡Safe Mode
Highlights
Dave, a retired software engineer from Microsoft, explains the CrowdStrike issue and its impact on machines.
The CrowdStrike blue screens are a result of a bad update to the CrowdStrike software.
Understanding the key difference between kernel mode and user mode is crucial for grasping the issue.
Kernel mode executes at a higher privilege level, managing core system functions like hardware interaction and memory management.
User mode is where applications run, with limited access to system resources, ensuring stability and security.
A kernel driver failure, like the one in CrowdStrike, can lead to a system-wide crash, unlike user mode crashes.
CrowdStrike's Falcon sensor operates in kernel mode to detect and prevent malware attacks proactively.
CrowdStrike's approach involves running code in kernel mode, which is risky due to the potential for system-wide impact.
The WHQL certification process ensures that drivers are robust and trustworthy, but CrowdStrike updates may bypass this for agility.
CrowdStrike's dynamic definition files, which should contain malware definitions, were found to be all zeros, causing the crash.
The lack of resilience and inadequate error checking in CrowdStrike's driver is a significant issue.
CrowdStrike marked their driver as a boot driver, making it essential for the system to start, which exacerbates the problem.
Fixing the issue involves booting into safe mode and manually removing the problematic CrowdStrike driver file.
The absence of the problematic update file resolves the issue without causing additional problems.
Dave's experience at Microsoft involved handling similar issues, providing valuable insights into the current CrowdStrike situation.
The importance of robust testing and certification for kernel mode drivers is highlighted by the CrowdStrike incident.
Dave's video offers a detailed explanation and practical solution for dealing with the CrowdStrike blue screen issue.
Transcripts
hey I'm Dave welcome to my shop I'm Dave
plumber a retired software engineer from
Microsoft going back to the MS DOS at
Windows 95 days and thanks to my time as
a Windows developer today I'm going to
explain what the crowd strike issue
actually is the key difference in curdle
mode and why these machines are blue
screening as well as how to fix it if
you come across one now I've got a lot
of experience working up to blue screens
and having them set the tempo of my day
but this Friday was a little different
however first off I'm retired now so I
don't debug a lot of blue screens and
second I was traveling in New York City
which left me temporarily stranded as
the airlines sorted out the digital
Carnage but that downtime gave me plenty
of time to pull out the old MacBook and
figure out what was happening to all the
windows machines around the world as far
as we know the crowd strike blue screens
that we've been seeing around the world
for the last several days are the result
of a bad update to the crowd strike
software but why so today I want to help
you understand three key things first
why the crowd strike software is on the
machines at all and second what happens
when a kernel driver like crowd strike
fails and finally we'll look at
precisely why the crowd strike code
fults and brings the machines down and
how and why this update caused so much
Havoc as systems developers at Microsoft
in the 1990s handling crashies like this
was part of our normal bread and butter
every Dev at Microsoft at least in my
area had two machines for example when I
started in Windows NT I had a Gateway
486 dx250 as my main Dev machine and
then some old 386 box as a debug machine
normally you'd run your test or debug
bits on the debug machine while
connected to it as the debugger from
your good machine on nights and weekends
however we did something far more
interesting we ran a process called
anti-stress now anti-stress was a bundle
of tests that would automatically
download to the test machines and run
under the debugger and so every night
every test machine along with all the
machines in the various labs around
campus would run anti stress and put it
through the gauntlet the stress tests
were normally written by our test
Engineers who were software developers
specially employed back in those days to
find and catch bugs in the system so as
an example they might write a test to
Simply allocate and use as many GDI
brush handles as possible if doing so
causes the drawing subsystem to become
unstable or causes some other program to
crash then it would be caught and
stopped in the debugger immediately the
following day all of the crashes and
assertions will be tabulated and
assigned to an individual developer
based on the area of code in which the
problem occurred as the developer
responsible that you would then use
something like telnet to connect to the
Target machine debug it and sorted out
what went wrong all this debugging was
done in Assembly Language whether it was
Alpha myips power PC or x86 and with
minimal symbol table information so it's
not like we had Visual Studio connected
still it was enough information to sort
out most crashes find the code
responsible and either fix it or at
least enter a bug to track it in our
database the hardest issues to sort out
were the ones on that took place deep
inside the operating system kernel which
executes at ring zero on the CPU you see
the operating system uses a ring system
to bifurcate code into two distinct
types kernel mode for the operating
system itself and user mode where your
applications run kernel mode does tasks
such as talking to the hardware and the
devices managing memory scheduling
threads and all of the really core
functionality that the operator system
provides application code never runs in
kernel mode and kernel code never runs
in user mode kernel mode is more
privileged meaning it can see the entire
system memory map and what's in memory
at any physical page in any instance
user mode only sees the memory map pages
that the colel wants you to see so if
you're getting the sense that the kernel
is very much in control that's an
accurate picture even if your
application needs a service provided by
the kernel it won't be allowed to just
run down inside the kernel and execute
it instead your user thread will reach
the kernel boundary and then raise an
exception and wait a kernel thread on
the Kernel side then looks at the
specified ARG ments fully validates
everything and then runs the required
kernel code when it's done the kernel
thread Returns the results to the user
thread and let it continue on its merry
way there is one other substantive
difference between kernel mode and user
mode when application code crashes the
application crashes when kernel mode
crashes the system crashes it crashes
because it has to imagine a case where
you had a really simple bug in the
kernel that freed memory twice when the
kernel code detects that it's about to
free already freed memory it can just
detect that this is a critical failure
and when it does it bluec screens the
system because the Alternatives could be
worse consider a scenaria where this
double freed code is allowed to continue
maybe with an airror message maybe even
allowing you to save your work the
problem is that things are so corrupted
at this point that saving your work
could do more damage erasing or
corrupting the file Beyond repair worse
since it's the kernel system that's
experiencing the issue application
programs are not protected from one
another in the same way the last thing
you want is Solitaire during a kernel
bug that damages your GI enlistment and
that's why when an unexpected condition
occurs in the kernel the system is just
halted this is not a Windows Thing by
any stretch it is true for all modern
operating systems like Linux and Mac OS
as well in fact the biggest difference
is the color of the screen when the
system goes down on Windows it's blue
but on Linux it's black and on Mac OS
it's usually pink but as on all systems
a kernel issue is a reboot at a minimum
now that we know a bit about kernel mode
versus user mode Let's talk about what
spefic specifically runs in kernel mode
and the answer is very very little the
only things that go in the kernel mode
are things that have to like the thread
schedule and the Heap manager and
functionality that must access the
hardware such as the device driver that
talks to a GPU across the pcie bus and
so the totality of what you run in
curdle mode really comes down to the
operating system itself and device
drivers and that's where crowd strike
enters a picture with their Falcon
sensor Falcon is a security product and
while it's not just simply an antivirus
it's is not that far off the mark to
look at it as though it's really anti-
maware for the server but rather than
just looking for file definitions it
analyzes a wide range of application
Behavior so that it can try to
proactively detect new attacks before
they're categorized and listed in a
formal definition and to be able to see
that application behavior from a clear
vantage point that code needed to be
down in the kernel without getting too
far into the weeds of what crowd strike
Falcon actually does suffice it to say
that it has to be in the kernel to do it
and so crowd strike wrote a device
driver even though there's no Hardware
device that it's really talking to but
by writing their code as a device driver
it lives down with the kernel in ring
zero and has complete and unfettered
access to the system data structures and
the services that they believe it needs
to do its job now everybody at Microsoft
and probably at crowd strike is aware of
the stakes when you run code in kernel
mode and that's why Microsoft offers the
whql certification which stands for
Windows Hardware quality Labs drivers
labeled this whql certified have been
thoroughly tested by the vendor and then
have passed the windows Hardware lab kit
testing on various platforms and
configurations and are signed digitally
by Microsoft as being compatible with
the Windows operating system by the time
a driver makes it through the whql lab
test and certifications you can be
reasonably assured that the driver is
robust and trustworthy and when it's
determined to be so Microsoft issues
that digital certificate for that driver
as long as the driver itself never
changes the certificate remain remains
valid but what if you're crowd strike
and you're agile ambitious and
aggressive and you want to ensure that
your customers get the latest protection
as soon as new threats emerge every time
something new pops up on the radar you
could make a new driver and put it
through the hardware quality Labs get it
certified signed and release the updated
driver and for things like video cards
that's a fine process I don't actually
know what the whql turnaround time is
like whether that's measured in days or
weeks but it's not instant and so you'd
have a Time window where a zero day
could propagate and spread simply
because of the delay in getting an
updated crowd strike driver built and
signed what crowd strike often to do
instead was to include definition files
that are processed by the driver but not
actually included with it so when the
crowd strike driver wakes up it
enumerates a folder on the machine
looking for these dynamic definition
files and it does whatever it is that it
needs to do with them but you can
already perhaps see the problem let's
speculate for a moment that the crowd
strike dynamic definition files are not
mer
malware definitions but complete
programs in their own right written in a
PE code that the driver can then execute
in a very real sense then the driver
could take the update and actually
execute the PE code within it in curdle
mode even though that update itself has
never been signed the driver becomes the
engine that runs the code and since the
driver hasn't changed the sech is still
valid for the driver but the update
changes the way the driver operates by
virtue of the P code that's contained in
the definitions and what you've got then
is unsigned code of unknown provenance
running in full kernel mode all it would
take is a single little bug like a null
point of reference and the entire Temple
would be torn down around us put more
simply while we don't yet know the
precise cause of the bug executing
untrusted PE code in the kernel is Risky
Business at best and could be asking for
trouble we can get a better sense of
what went wrong by doing a little
postmortem debugging of our own first we
need to access a crash dump report the
kind you used to get in the good old an
days but are now hidden behind the happy
face blue screen
depending on how your system is
configured though you can still get the
crash dump info and so there was no real
shortage of dumps around to look at
here's an example from Twitter so let's
take a look about a third of the way
down you can see the offending
instruction that caused the crash it's
an attempt to move data to register nine
by loading it from a memory pointer in
register 8 couldn't be simpler the only
problem is that the pointer in register
8 is garbage it's not a memory addressed
at all but a small integer of 9 C hex
which is likely the offset of the field
they're actually interested in with in
the data structure but they almost
certainly started with a null pointer
then added 9C to it and then just
dereferenced it now debugging something
like this is often an incremental
process where you wind up establishing
okay so this bad thing happened but what
happened Upstream beforehand to cause
the bad thing and in this case it
appears that the cause is the dynamic
data file downloaded as a Cy file
instead of containing pcode or a malware
definition or whatever was supposed to
be in the file it was all just zeros we
don't know yet how or why this happened
as crowd strike hasn't publicly released
that information yet what we do know to
an almost certainty at this point
however is that the crowd strike driver
that processes and handles these updates
is not very resilient and appears to
have inadequate air checking and
parameter
validation parameter validation means
checking to ensure that the data and
arguments being passed to a function and
in particular to a kernel function are
valid and good if they're not it should
fail the function call not cause the
entire system to crash but in the
crowdstrike case they've got a bu they
don't protect against and because their
code lives in ring zero with the kernel
a bug and crowd strike will necessarily
bug check the entire machine and deposit
you into the very dreaded recovery blue
screen now even though this isn't a
Windows issue or a fault with Windows
itself many people have asked me why
Windows itself isn't just more resilient
to this type of issue for example if a
driver fails during boot why not try to
boot next time without it and see if
that helps and windows in fact does
offer a number of facilities like that
going back as far as booting n with last
KN and good registry Hive but there's a
catch and that catch is that crowd
strike marked their driver as what's
known as a boot driver a boot driver is
a device driver that must be installed
to start the Windows operating system
most boot drivers are included in driver
packages that are in the box with
Windows and windows automatically
installs these boot start drivers during
their first boot of the system my guess
is that crowd strike decided they didn't
want you booting at all without their
protection provided by their system but
when it crashes as it does now your
system is completely borked fixing a
machine with this issue is fortunately
not a great deal of work but it does
require physical access to the machine
to fix a machine that's crashed due to
this issue you need to boot it into safe
mode because safe mode only loads a
limited set of drivers that mercifully
can still contend without this boot
driver you'll still be able to get into
at least a limited system then to fix
the machine use the console or the file
manager and go to the path window like
Windows and then system through 32
drivers crowd strike in that folder find
the file matching the pattern C and then
a bunch of zeros 2 91. cist and delete
that file or anything that's got the 291
in it with a bunch of zeros when you
reboot your system should come up
completely normal and operational the
absence of the update file fixes the
issue and does not cause any additional
ones it's a fair bet that the update 291
won't ever be needed or used again so
you're fine to Nuke it if you found
today's episode to be any combination of
informative or entertaining remember I'm
mostly in this for the subs and likes so
I'd be honored if you consider
subscribing to my channel and leaving a
like on this video and if you're already
subscribed thank you please consider
sending this video to a friend if you
think it covered the subject well and
please do check out the free sample of
my new book on Amazon the non-visible
part of the autism spectrum it's
intended for folks that don't have ASD
but who suspect they might have a few
characteristics that put them somewhere
on the autism spectrum it's everything I
know now about living a successful life
on the spectrum that I wish I'd known
long ago check it out at the link in the
video description in the meantime and in
between time hope to see you next time
right here in Dave's Garage
関連動画をさらに表示
CrowdStrike Outage Explained by Keith Barker CCIE
CrowdStrike Update: Latest News, Lessons Learned from a Retired Microsoft Engineer
System Calls
Why Microsoft Is To Blame For The Crowdstrike Outage (Not The EU)
Blue Screen of Death(BSOD) | CrowdStrike’s Mistake: Inside the Microsoft Outage |Must Watch
i was right.
5.0 / 5 (0 votes)