CrowdStrike IT Outage Explained by a Windows Developer
Summary
TLDRDave, ein ehemalige Software-Entwickler von Microsoft, erklärt in seinem Video die Ursache für die weltweit auftretenden 'CrowdStrike Blue Screens'. Er erläutert, warum CrowdStrike-Software auf den Maschinen ist, was passiert, wenn ein Kernel-Treiber wie CrowdStrike fehlschlägt, und warum diese spezifische Aktualisierung so viel Havoc verursacht hat. Dave bietet Lösungen an, wie man mit einem solchen Absturz umgeht, und teilt sein Wissen über Kernel-Modus und User-Modus mit, um das Problem besser zu verstehen.
Takeaways
- 👋 Dave ist ein ehemalige Software-Engineer von Microsoft und gibt in seinem Video einen Überblick über das CrowdStrike-Problem.
- 🛠️ CrowdStrike ist eine Sicherheitssoftware, die oft auf Maschinen installiert ist, um Malware-Angriffe zu erkennen und zu verhindern.
- 💻 Das Problem entstand durch ein fehlerhaftes Update der CrowdStrike-Software, was zu globalen Blue Screens führte.
- 🔍 Dave erklärt die Unterschiede zwischen Kernel-Modus und Benutzermodus und warum Bugs im Kernel-Modus das gesamte System abstürzen lassen.
- 🛡️ Kernel-Modus ist für die Ausführung von Code im direkten Zugriff auf Systemressourcen und Hardware verantwortlich.
- 🚫 Benutzermodus-Code darf niemals im Kernel-Modus ausgeführt werden, da dies zu schwerwiegenden Systemproblemen führen kann.
- 🔧 CrowdStrike-Falcon ist ein Produkt, das im Kernel-Modus läuft, um Anwendungsverhalten zu analysieren und Angriffe zu erkennen.
- 📄 Microsoft bietet die WHQL-Zertifizierung für Treiber an, um sicherzustellen, dass sie robust und vertrauenswürdig sind.
- 🚨 CrowdStrike hat möglicherweise ein Update verteilt, das unsignierte PE-Code im Kernel-Modus ausführt, was zu Instabilitäten führen kann.
- 🤔 Ein möglicher Grund für den Absturz könnte eine fehlende Ausreichende Überprüfung und Validierung der Parameter im CrowdStrike-Treibercode sein.
- 🔄 Um das Problem zu beheben, kann man das fehlerhafte CrowdStrike-Update in den System32-Treibern-Ordner entfernen und das System neu starten.
Q & A
Was ist das Hauptthema des Videos?
-Das Hauptthema des Videos ist das Problem mit der CrowdStrike-Software, insbesondere warum sie auf den Computern installiert ist, was passiert, wenn ein Kerneltreiber wie CrowdStrike scheitert, und warum der CrowdStrike-Code die Computer zum Absturz bringt.
Wer ist Dave und was hat er mit dem Thema zu tun?
-Dave ist ein in Microsoft pensionierter Softwareentwickler, der Erfahrungen mit Windows-Entwicklung hat. Er erklärt in seinem Video, was der CrowdStrike-Bluescreen tatsächlich ist und wie man es beheben kann.
Was ist der Unterschied zwischen Kernelmodus und Benutzermodus?
-Der Kernelmodus wird für den Betriebssystem-Code und Gerätetreiber verwendet, der direkt auf der Hardware agiert. Der Benutzermodus wird von Anwendungen verwendet, die nie im Kernelmodus ausgeführt werden. Ein Absturz im Kernelmodus führt zu einem Systemabsturz, während ein Anwendungsabsturz nur die Anwendung selbst beeinträchtigt.
Was ist CrowdStrike Falcon und warum ist es im Kernelmodus?
-CrowdStrike Falcon ist ein Sicherheitsprodukt, das sich nicht nur auf Antiviren-Schutz beschränkt, sondern auch auf die Analyse von Anwendungsverhalten. Um diese Analyse durchzuführen, muss es im Kernelmodus laufen, um uneingeschränkten Zugriff auf Systemdatenstrukturen zu haben.
Was ist die Bedeutung von WHQL-Zertifizierung für Treiber?
-WHQL (Windows Hardware Quality Labs) ist eine Zertifizierung, die besagt, dass ein Treiber von Microsoft als kompatibel mit dem Windows-Betriebssystem angesehen wird. Treiber, die diese Zertifizierung haben, wurden gründlich getestet und sind robust und vertrauenswürdig.
Warum ist es riskant, PE-Code im Kernelmodus auszuführen?
-Ausführen von PE-Code (Portable Executable) im Kernelmodus ist riskant, weil dieser Code nie signiert oder gründlich getestet wurde. Ein kleiner Fehler im Code kann zum Absturz des gesamten Systems führen.
Was passiert, wenn ein Kerneltreiber wie CrowdStrike scheitert?
-Wenn ein Kerneltreiber scheitert, führt dies zu einem Systemabsturz, da der Kernel im Kernelmodus läuft und kritische Systemfunktionen steuert. Ein Fehler im Kernel kann das gesamte System destabilisieren.
Wie kann man einen Computer, der aufgrund des CrowdStrike-Problems abstürzt, reparieren?
-Man muss den Computer im Safe Mode starten, um auf den Systempfad zuzugreifen und die fehlerhafte CrowdStrike-Datei zu löschen. Nach dem Neustart sollte das System normal funktionieren.
Was ist der Unterschied zwischen einem Boottreiber und einem anderen Gerätetreiber?
-Ein Boottreiber ist ein Gerätetreiber, der zum Starten des Windows-Betriebssystems erforderlich ist. Sie werden normalerweise mit Windows geliefert und beim ersten Start automatisch installiert. CrowdStrike hat seinen Treiber als Boottreiber markiert, was bedeutet, dass das System ohne ihn nicht starten kann.
Was ist das Ziel von Daves Video?
-Das Ziel von Daves Video ist es, das CrowdStrike-Problem zu erklären und zu helfen, Menschen zu verstehen, warum ihre Computer abstürzen und wie sie das Problem beheben können.
Outlines
😀 Einführung in CrowdStrike-Problem
Dave, ein ehemaliger Softwareentwickler von Microsoft und heutiger Sanitärinstallateur, stellt sich vor und erklärt das CrowdStrike-Blue-Screen-Problem. Er erinnert an seine Erfahrungen als Windows-Entwickler und erklärt, dass die Blue-Screens durch einen fehlerhaften Update von CrowdStrike-Software verursacht wurden. Dave will die Funktionsweise von CrowdStrike-Software, den Unterschied zwischen Kernel-Modus und User-Modus und die Ursache für das Absturz-Problem erklären. Er erzählt von seiner Zeit bei Microsoft, wo er mit Abstürzen gearbeitet hat und wie man solche Probleme debuggt hat, einschließlich der Verwendung von Debuggern und der Analyse von Absturzberichten.
🔧 Unterschied zwischen Kernel-Modus und User-Modus
Dave erklärt die grundlegenden Unterschiede zwischen Kernel-Modus und User-Modus im Betriebssystem. Er beschreibt, wie der Kernel-Modus für die Verwaltung von Kernfunktionen wie Hardwarezugriff, Speichermanagement und Thread-Scheduling zuständig ist und wie User-Modus für die Anwendungsausführung vorgesehen ist. Er betont die Bedeutung von Kernel-Code, der niemals im User-Modus ausgeführt wird und wie ein Absturz im Kernel-Modus einen Systemabsturz auslöst. Dave diskutiert auch die Rolle von Treibern im Kernel-Modus und wie das CrowdStrike-Falcon-Produkt, das als Sicherheitstool für Server dient, in den Kernel-Modus läuft, um Anwendungsverhalten zu analysieren und Angriffe zu erkennen.
🛠 Problembehandlung für CrowdStrike-Fehler
Dave geht auf die Problembehandlung für das CrowdStrike-Blue-Screen-Problem ein. Er erklärt, dass das CrowdStrike-Fehler durch einen fehlerhaften Download eines dynamischen Definitions-Datei als '.Cy'-Datei verursacht wurde, die nur Nullen enthalten sollte. Dies hat dazu geführt, dass die CrowdStrike-Treiber, die für die Verarbeitung dieser Updates zuständig sind, einen Absturz verursachen, da sie keine ausreichende Parametervalidierung haben. Dave empfiehlt, das betroffene CrowdStrike-Update im Safe-Mode zu löschen, um das Problem zu beheben. Er betont, dass dies keine weiteren Probleme verursachen sollte, da das Update vermutlich nie benötigt wird. Schließlich nutzt Dave die Gelegenheit, um seine neue Buch-Publikation zu bewerben und seine Kanal-Abonnenten zu danken.
Mindmap
Keywords
💡Crowd Strike
💡Blue Screen
💡Kernel-Modus
💡Device Driver
💡Boot Driver
💡WHQL Certification
💡Parameter Validation
💡Null Pointer
💡Safe Mode
💡Postmortem Debugging
Highlights
Dave, a retired software engineer from Microsoft, explains the CrowdStrike blue screen issue.
CrowdStrike blue screens are a result of a bad update to CrowdStrike software.
Dave's experience with debugging blue screens at Microsoft in the 1990s.
Explanation of kernel mode and user mode, and their significance in system stability.
Kernel mode runs at a higher privilege level and a crash in this mode leads to a system crash.
The difference between kernel mode and user mode in terms of memory access and control.
The role of the CrowdStrike Falcon sensor and its necessity to run in kernel mode.
The process of WHQL certification for Windows drivers and its importance for stability.
CrowdStrike's approach to updating their driver without WHQL certification due to agility.
The risk of running unsigned code in kernel mode and its potential to cause system instability.
A postmortem debugging approach to understand the cause of the CrowdStrike issue.
The discovery that the CrowdStrike dynamic definition file was corrupted with zeros.
Lack of resilience and inadequate error checking in the CrowdStrike driver.
The designation of the CrowdStrike driver as a boot driver, causing system startup issues.
A practical guide to fixing a machine affected by the CrowdStrike issue.
Dave's new book on living a successful life on the autism spectrum.
Invitation to subscribe to Dave's channel and leave a like for the informative content.
Transcripts
hey I'm Dave welcome to my shop I'm Dave
plumber a retired software engineer from
Microsoft going back to the MS DOS at
Windows 95 days and thanks to my time as
a Windows developer today I'm going to
explain what the crowd strike issue
actually is the key difference in curdle
mode and why these machines are blue
screening as well as how to fix it if
you come across one now I've got a lot
of experience working up to blue screens
and having them set the tempo of my day
but this Friday was a little different
however first off I'm retired now so I
don't debug a lot of blue screens and
second I was traveling in New York City
which left me temporarily stranded as
the airlines sorted out the digital
Carnage but that downtime gave me plenty
of time to pull out the old MacBook and
figure out what was happening to all the
windows machines around the world as far
as we know the crowd strike blue screens
that we've been seeing around the world
for the last several days are the result
of a bad update to the crowd strike
software but why so today I want to help
you understand three key things first
why the crowd strike software is on the
machines at all and second what happens
when a kernel driver like crowd strike
fails and finally we'll look at
precisely why the crowd strike code
fults and brings the machines down and
how and why this update caused so much
Havoc as systems developers at Microsoft
in the 1990s handling crashies like this
was part of our normal bread and butter
every Dev at Microsoft at least in my
area had two machines for example when I
started in Windows NT I had a Gateway
486 dx250 as my main Dev machine and
then some old 386 box as a debug machine
normally you'd run your test or debug
bits on the debug machine while
connected to it as the debugger from
your good machine on nights and weekends
however we did something far more
interesting we ran a process called
anti-stress now anti-stress was a bundle
of tests that would automatically
download to the test machines and run
under the debugger and so every night
every test machine along with all the
machines in the various labs around
campus would run anti stress and put it
through the gauntlet the stress tests
were normally written by our test
Engineers who were software developers
specially employed back in those days to
find and catch bugs in the system so as
an example they might write a test to
Simply allocate and use as many GDI
brush handles as possible if doing so
causes the drawing subsystem to become
unstable or causes some other program to
crash then it would be caught and
stopped in the debugger immediately the
following day all of the crashes and
assertions will be tabulated and
assigned to an individual developer
based on the area of code in which the
problem occurred as the developer
responsible that you would then use
something like telnet to connect to the
Target machine debug it and sorted out
what went wrong all this debugging was
done in Assembly Language whether it was
Alpha myips power PC or x86 and with
minimal symbol table information so it's
not like we had Visual Studio connected
still it was enough information to sort
out most crashes find the code
responsible and either fix it or at
least enter a bug to track it in our
database the hardest issues to sort out
were the ones on that took place deep
inside the operating system kernel which
executes at ring zero on the CPU you see
the operating system uses a ring system
to bifurcate code into two distinct
types kernel mode for the operating
system itself and user mode where your
applications run kernel mode does tasks
such as talking to the hardware and the
devices managing memory scheduling
threads and all of the really core
functionality that the operator system
provides application code never runs in
kernel mode and kernel code never runs
in user mode kernel mode is more
privileged meaning it can see the entire
system memory map and what's in memory
at any physical page in any instance
user mode only sees the memory map pages
that the colel wants you to see so if
you're getting the sense that the kernel
is very much in control that's an
accurate picture even if your
application needs a service provided by
the kernel it won't be allowed to just
run down inside the kernel and execute
it instead your user thread will reach
the kernel boundary and then raise an
exception and wait a kernel thread on
the Kernel side then looks at the
specified ARG ments fully validates
everything and then runs the required
kernel code when it's done the kernel
thread Returns the results to the user
thread and let it continue on its merry
way there is one other substantive
difference between kernel mode and user
mode when application code crashes the
application crashes when kernel mode
crashes the system crashes it crashes
because it has to imagine a case where
you had a really simple bug in the
kernel that freed memory twice when the
kernel code detects that it's about to
free already freed memory it can just
detect that this is a critical failure
and when it does it bluec screens the
system because the Alternatives could be
worse consider a scenaria where this
double freed code is allowed to continue
maybe with an airror message maybe even
allowing you to save your work the
problem is that things are so corrupted
at this point that saving your work
could do more damage erasing or
corrupting the file Beyond repair worse
since it's the kernel system that's
experiencing the issue application
programs are not protected from one
another in the same way the last thing
you want is Solitaire during a kernel
bug that damages your GI enlistment and
that's why when an unexpected condition
occurs in the kernel the system is just
halted this is not a Windows Thing by
any stretch it is true for all modern
operating systems like Linux and Mac OS
as well in fact the biggest difference
is the color of the screen when the
system goes down on Windows it's blue
but on Linux it's black and on Mac OS
it's usually pink but as on all systems
a kernel issue is a reboot at a minimum
now that we know a bit about kernel mode
versus user mode Let's talk about what
spefic specifically runs in kernel mode
and the answer is very very little the
only things that go in the kernel mode
are things that have to like the thread
schedule and the Heap manager and
functionality that must access the
hardware such as the device driver that
talks to a GPU across the pcie bus and
so the totality of what you run in
curdle mode really comes down to the
operating system itself and device
drivers and that's where crowd strike
enters a picture with their Falcon
sensor Falcon is a security product and
while it's not just simply an antivirus
it's is not that far off the mark to
look at it as though it's really anti-
maware for the server but rather than
just looking for file definitions it
analyzes a wide range of application
Behavior so that it can try to
proactively detect new attacks before
they're categorized and listed in a
formal definition and to be able to see
that application behavior from a clear
vantage point that code needed to be
down in the kernel without getting too
far into the weeds of what crowd strike
Falcon actually does suffice it to say
that it has to be in the kernel to do it
and so crowd strike wrote a device
driver even though there's no Hardware
device that it's really talking to but
by writing their code as a device driver
it lives down with the kernel in ring
zero and has complete and unfettered
access to the system data structures and
the services that they believe it needs
to do its job now everybody at Microsoft
and probably at crowd strike is aware of
the stakes when you run code in kernel
mode and that's why Microsoft offers the
whql certification which stands for
Windows Hardware quality Labs drivers
labeled this whql certified have been
thoroughly tested by the vendor and then
have passed the windows Hardware lab kit
testing on various platforms and
configurations and are signed digitally
by Microsoft as being compatible with
the Windows operating system by the time
a driver makes it through the whql lab
test and certifications you can be
reasonably assured that the driver is
robust and trustworthy and when it's
determined to be so Microsoft issues
that digital certificate for that driver
as long as the driver itself never
changes the certificate remain remains
valid but what if you're crowd strike
and you're agile ambitious and
aggressive and you want to ensure that
your customers get the latest protection
as soon as new threats emerge every time
something new pops up on the radar you
could make a new driver and put it
through the hardware quality Labs get it
certified signed and release the updated
driver and for things like video cards
that's a fine process I don't actually
know what the whql turnaround time is
like whether that's measured in days or
weeks but it's not instant and so you'd
have a Time window where a zero day
could propagate and spread simply
because of the delay in getting an
updated crowd strike driver built and
signed what crowd strike often to do
instead was to include definition files
that are processed by the driver but not
actually included with it so when the
crowd strike driver wakes up it
enumerates a folder on the machine
looking for these dynamic definition
files and it does whatever it is that it
needs to do with them but you can
already perhaps see the problem let's
speculate for a moment that the crowd
strike dynamic definition files are not
mer
malware definitions but complete
programs in their own right written in a
PE code that the driver can then execute
in a very real sense then the driver
could take the update and actually
execute the PE code within it in curdle
mode even though that update itself has
never been signed the driver becomes the
engine that runs the code and since the
driver hasn't changed the sech is still
valid for the driver but the update
changes the way the driver operates by
virtue of the P code that's contained in
the definitions and what you've got then
is unsigned code of unknown provenance
running in full kernel mode all it would
take is a single little bug like a null
point of reference and the entire Temple
would be torn down around us put more
simply while we don't yet know the
precise cause of the bug executing
untrusted PE code in the kernel is Risky
Business at best and could be asking for
trouble we can get a better sense of
what went wrong by doing a little
postmortem debugging of our own first we
need to access a crash dump report the
kind you used to get in the good old an
days but are now hidden behind the happy
face blue screen
depending on how your system is
configured though you can still get the
crash dump info and so there was no real
shortage of dumps around to look at
here's an example from Twitter so let's
take a look about a third of the way
down you can see the offending
instruction that caused the crash it's
an attempt to move data to register nine
by loading it from a memory pointer in
register 8 couldn't be simpler the only
problem is that the pointer in register
8 is garbage it's not a memory addressed
at all but a small integer of 9 C hex
which is likely the offset of the field
they're actually interested in with in
the data structure but they almost
certainly started with a null pointer
then added 9C to it and then just
dereferenced it now debugging something
like this is often an incremental
process where you wind up establishing
okay so this bad thing happened but what
happened Upstream beforehand to cause
the bad thing and in this case it
appears that the cause is the dynamic
data file downloaded as a Cy file
instead of containing pcode or a malware
definition or whatever was supposed to
be in the file it was all just zeros we
don't know yet how or why this happened
as crowd strike hasn't publicly released
that information yet what we do know to
an almost certainty at this point
however is that the crowd strike driver
that processes and handles these updates
is not very resilient and appears to
have inadequate air checking and
parameter
validation parameter validation means
checking to ensure that the data and
arguments being passed to a function and
in particular to a kernel function are
valid and good if they're not it should
fail the function call not cause the
entire system to crash but in the
crowdstrike case they've got a bu they
don't protect against and because their
code lives in ring zero with the kernel
a bug and crowd strike will necessarily
bug check the entire machine and deposit
you into the very dreaded recovery blue
screen now even though this isn't a
Windows issue or a fault with Windows
itself many people have asked me why
Windows itself isn't just more resilient
to this type of issue for example if a
driver fails during boot why not try to
boot next time without it and see if
that helps and windows in fact does
offer a number of facilities like that
going back as far as booting n with last
KN and good registry Hive but there's a
catch and that catch is that crowd
strike marked their driver as what's
known as a boot driver a boot driver is
a device driver that must be installed
to start the Windows operating system
most boot drivers are included in driver
packages that are in the box with
Windows and windows automatically
installs these boot start drivers during
their first boot of the system my guess
is that crowd strike decided they didn't
want you booting at all without their
protection provided by their system but
when it crashes as it does now your
system is completely borked fixing a
machine with this issue is fortunately
not a great deal of work but it does
require physical access to the machine
to fix a machine that's crashed due to
this issue you need to boot it into safe
mode because safe mode only loads a
limited set of drivers that mercifully
can still contend without this boot
driver you'll still be able to get into
at least a limited system then to fix
the machine use the console or the file
manager and go to the path window like
Windows and then system through 32
drivers crowd strike in that folder find
the file matching the pattern C and then
a bunch of zeros 2 91. cist and delete
that file or anything that's got the 291
in it with a bunch of zeros when you
reboot your system should come up
completely normal and operational the
absence of the update file fixes the
issue and does not cause any additional
ones it's a fair bet that the update 291
won't ever be needed or used again so
you're fine to Nuke it if you found
today's episode to be any combination of
informative or entertaining remember I'm
mostly in this for the subs and likes so
I'd be honored if you consider
subscribing to my channel and leaving a
like on this video and if you're already
subscribed thank you please consider
sending this video to a friend if you
think it covered the subject well and
please do check out the free sample of
my new book on Amazon the non-visible
part of the autism spectrum it's
intended for folks that don't have ASD
but who suspect they might have a few
characteristics that put them somewhere
on the autism spectrum it's everything I
know now about living a successful life
on the spectrum that I wish I'd known
long ago check it out at the link in the
video description in the meantime and in
between time hope to see you next time
right here in Dave's Garage
Weitere ähnliche Videos ansehen
Der Geld-Check | Reportage für Kinder | Checker Tobi
Facebook Pixel erstellen und installieren [2024] - Schritt-für-Schritt-Anleitung
Wie funktioniert unser Immunsystem?
Der Angst- und Grusel-Check | Reportage für Kinder | Checker Tobi
Wie funktioniert ein Induktionsherd?
How To Get More Clients Using LinkedIn In 2023.
5.0 / 5 (0 votes)