CrowdStrike Outage Explained by Keith Barker CCIE

CBT Nuggets
23 Jul 202410:52

Summary

TLDRIn July 2024, a CrowdStrike incident led to over 8 million Windows computers experiencing the 'Blue Screen of Death' (BSOD), affecting numerous services globally. Keith Barker explains the technical analogy of a castle's security rings to describe how the incident occurred due to a faulty update in the kernel mode of CrowdStrike's Falcon software. He outlines the resolution process involving safe mode and Microsoft's recovery tool updates, and emphasizes the importance of better Quality Assurance (QA) to prevent such widespread system failures.

Takeaways

  • 😕 A CrowdStrike incident in July 2024 caused over 8 million Windows computers to display the 'Blue Screen of Death' (BSOD), impacting services worldwide.
  • đŸ•”ïžâ€â™‚ïž The incident affected not only individual computers but also critical services like businesses, airlines, hospitals, causing widespread disruptions.
  • 🏰 Keith Barker used a castle analogy to explain system security, with 'Area Zero' being the most secure part, analogous to 'Ring Zero' in computer systems.
  • 🛡 In computer systems, 'Ring Zero' is the most secure area, where the operating system and critical functions run, while 'Ring One' is less secure and where most applications operate.
  • 🔍 The CrowdStrike Falcon software, designed as an advanced anti-malware program, runs in 'Ring Zero' (kernel mode) for direct access to system resources.
  • đŸš« The BSOD occurred due to an update in the Falcon software that introduced a faulty file, causing system failures when running in kernel mode.
  • 🔄 The faulty update was identified as 'C-5Z291Das', which, when deployed, led to system crashes and the BSOD due to its operation in a critical system area.
  • đŸ› ïž To resolve the issue, users are advised to boot into safe mode, remove the problematic Falcon update files, and then reboot the system.
  • 🔒 Additional complications in recovery may arise for systems using BitLocker, requiring extra steps for recovery.
  • 🛑 Microsoft updated their recovery tool on July 22nd to assist IT admins with two repair options: booting from WinPE or recovering from safe mode.
  • 🔄 The incident could have been avoided with better Quality Assurance (QA) on software updates, or by running Falcon not in kernel mode but as a user application to prevent system-wide crashes.

Q & A

  • What is the main topic of the video by Keith Barker?

    -The main topic of the video is the CrowdStrike incident that occurred in July 2024, which affected over 8 million Windows computers and caused a widespread 'Blue Screen of Death'.

  • What is the 'Blue Screen of Death' (BSOD)?

    -The 'Blue Screen of Death' (BSOD) is an error screen displayed on Windows computers when a critical system error occurs, often resulting in a system crash and requiring a reboot.

  • How did the CrowdStrike incident impact people who did not personally experience a BSOD?

    -The incident impacted people indirectly by causing disruptions to services such as businesses, airlines, hospitals, and other critical services around the world, leading to missed flights and appointments due to system downtimes.

  • What is the analogy used by Keith Barker to explain the security breach that led to the CrowdStrike incident?

    -Keith Barker uses the analogy of a castle with different security areas to explain the breach. In this analogy, 'Area Zero' represents the most secure area (ring zero in computer systems), while 'Area One' represents the outer perimeter (ring one in computer systems).

  • What are 'ring zero' and 'ring one' in the context of computer systems?

    -In computer systems, 'ring zero' refers to the most secure area, also known as kernel mode, where the operating system and critical functions run. 'Ring one' is the less secure area, also known as user mode, where most applications run.

  • What is the role of the CrowdStrike Falcon software?

    -CrowdStrike Falcon is an anti-malware program that runs on Windows computers to help identify and prevent malware. It is designed to be highly efficient in catching and preventing malicious activities.

  • Why did the Falcon software cause the 'Blue Screen of Death'?

    -The Falcon software caused the 'Blue Screen of Death' because it was running in kernel mode (ring zero), and an update to the software introduced an incorrect file that caused the application to fail, leading to a system crash.

  • What is the Windows Hardware Qualified Lab (WHQL) and its significance in the CrowdStrike incident?

    -WHQL is a certification process where Microsoft tests and approves third-party software and drivers. The Falcon software was certified through WHQL, indicating that it was tested and approved by Microsoft, but the incident occurred due to an issue with an update post-certification.

  • How can a computer affected by the CrowdStrike incident be resolved?

    -To resolve the issue, a computer can be rebooted into safe mode, where the updated files causing the problem can be identified and deleted. Afterward, the system can be rebooted normally.

  • What complications might arise during the recovery process for servers or systems using BitLocker?

    -For servers without a GUI or systems using BitLocker, additional steps are required for recovery. Servers may require scripting to make changes, and BitLocker systems may need the decryption keys to proceed with the recovery process.

  • How could the CrowdStrike incident have been avoided?

    -The incident could have been avoided with better Quality Assurance (QA) on the updates for the Falcon software or by running the Falcon software as an application in user mode (ring one) instead of kernel mode (ring zero), which would limit the impact of a crash to the application itself rather than the entire system.

Outlines

00:00

đŸ˜Č The CrowdStrike Incident Overview

Keith Barker introduces the CrowdStrike incident that occurred in July 2024, affecting over 8 million Windows computers worldwide. He outlines the three main questions to be addressed: what happened, why it happened, and how to resolve it. Barker explains the widespread impact, including service interruptions in businesses, airlines, hospitals, and more, highlighting the 'Blue Screen of Death' (BSOD) as the primary symptom. He uses the analogy of a castle to describe the secure areas within a computer system, with 'Area Zero' being the most secure, akin to 'Ring Zero' in computer terms, where critical functions operate.

05:00

đŸ›Ąïž The Technical Explanation of the Incident

Barker delves into the technical aspects of the CrowdStrike incident, explaining the roles of 'Ring Zero' and 'Ring One' in a computer system's architecture. He describes how applications operate in 'User Mode' or 'Ring One' and how core system functions, including those of CrowdStrike's Falcon software, operate in 'Kernel Mode' or 'Ring Zero'. The incident occurred due to an update in the Falcon software, which, running in kernel mode, caused a system-wide halt when it failed, leading to the BSOD. Barker also discusses the certification process of the Falcon software through the Windows Hardware Qualified Lab (WHQL) and the consequences of an update gone wrong on July 19th.

10:03

🔄 The Resolution and Prevention of Future Incidents

The resolution of the incident involves rebooting affected computers into 'Safe Mode' to remove the problematic update files from the Falcon software, which were identified by a specific file signature. Barker acknowledges the complexity of this process, especially for servers without a graphical user interface (GUI) attached or systems using BitLocker. He also mentions the updated Microsoft recovery tool released on July 22nd, which aids in the repair process. To prevent such incidents, Barker suggests better quality assurance (QA) for software updates and the consideration of running security software like Falcon in 'User Mode' instead of 'Kernel Mode' to avoid system-wide crashes.

Mindmap

Keywords

💡CrowdStrike

CrowdStrike is a cybersecurity technology company that specializes in cloud-delivered protection of endpoints, identity, and workloads. In the context of the video, it is the company whose Falcon software was involved in the incident that caused the blue screen of death on numerous Windows computers in July 2024.

💡Blue Screen of Death (BSOD)

The Blue Screen of Death is an error screen displayed on Windows computers when a critical system error occurs, causing the system to crash. In the video, it is the main issue that affected over 8 million Windows computers, leading to widespread disruption in various services such as businesses, airlines, and hospitals.

💡Ring Zero and Ring One

In operating systems, Ring Zero refers to the most privileged level, where the kernel and critical system code run, while Ring One is a less privileged level where user applications operate. The video uses these terms to explain how the CrowdStrike Falcon software, running in Ring Zero, caused the BSOD when an update led to a critical error.

💡Kernel Mode

Kernel mode is a privileged mode of operation in computer systems where the processor executes kernel-level code, which has unrestricted access to system resources. The video discusses how the CrowdStrike software, running in kernel mode, impacted the system's stability when it encountered an error.

💡User Mode

User mode is a level of operation in computer systems where applications run with limited access to system resources and cannot directly access hardware. The video contrasts user mode with kernel mode to illustrate why the CrowdStrike software's error in kernel mode led to a system-wide crash rather than just an application crash.

💡Falcon Software

Falcon is the product of CrowdStrike, described as an advanced antimalware program. The video explains that the software's update in July 2024 contained an error that, due to its operation in kernel mode, led to the widespread BSOD incident.

💡WHQL

WHQL stands for Windows Hardware Quality Labs, a program by Microsoft that tests and certifies hardware and software for compatibility with Windows. The video mentions that the CrowdStrike Falcon software was WHQL certified, indicating it passed Microsoft's testing before the incident occurred.

💡Dynamic Updates

Dynamic updates refer to the process where software components are updated without the need for a full application update. In the video, it is mentioned that the CrowdStrike Falcon software used dynamic updates to deploy changes, which unfortunately included an error that caused the BSOD.

💡Safe Mode

Safe mode is a diagnostic mode of operation for Windows systems that starts the system with a minimal set of drivers and services. The video describes using safe mode as part of the resolution process to remove the problematic update from the Falcon software that caused the BSOD.

💡BitLocker

BitLocker is a data protection feature in Windows that encrypts the contents of the system drive to protect against data theft. The video mentions that if a system is using BitLocker, additional steps may be required to recover from the BSOD caused by the CrowdStrike incident.

💡QA (Quality Assurance)

Quality Assurance is the process of ensuring that products or services meet certain quality standards before they are released. The video suggests that better QA on the updates for the Falcon software could have prevented the BSOD incident, highlighting the importance of thorough testing in software development.

Highlights

Introduction to the CrowdStrike incident that affected over 8 million Windows computers in July 2024.

Impact of the incident extended beyond direct blue screen of death occurrences, affecting services like airlines and hospitals worldwide.

An analogy of a castle is used to explain the secure areas within a computer system, comparing outer perimeters to 'ring one' and the innermost secure area to 'ring zero'.

Explanation of the difference between kernel mode (ring zero) and user mode (ring one) in operating systems.

The CrowdStrike Falcon software, designed as an advanced anti-malware program, inadvertently caused the blue screen of death due to its operation in kernel mode.

Falcon software's certification through the Windows Hardware Qualified Lab (WHQL) signifies Microsoft's approval but does not prevent issues with updates.

The specific update on July 19th introduced a faulty file, causing the Falcon driver to fail and result in the blue screen of death.

Resolution involves booting into safe mode to delete the problematic update files, which can be challenging for servers without GUIs or systems using BitLocker.

Microsoft's update to their recovery tool on July 22nd provided additional support for IT admins to expedite the repair process.

The incident could have been avoided with better Quality Assurance (QA) on software updates or by not running Falcon in kernel mode.

The importance of thorough testing in QA to prevent system-wide failures like the CrowdStrike incident.

The potential trade-off between running security software in kernel mode for enhanced functionality and the increased risk of system crashes.

The incident's impact on hundreds of millions of people globally, highlighting the interconnectedness of digital systems in daily life.

The role of dynamic updates in software, which can introduce risks if not properly tested, as seen in the CrowdStrike incident.

The complexity of resolving system-wide issues when they occur, especially in environments without direct user interfaces.

The significance of the CrowdStrike incident as a case study in the balance between security software effectiveness and system stability.

Transcripts

play00:00

hello and welcome my name is Keith

play00:02

Barker and I'd like to give you a high

play00:03

Lev overview of what you need to know

play00:05

about the crowd strike incident that

play00:07

happened in July 2024 and these are the

play00:10

three main questions I'd like to cover

play00:11

with you right now first of all what

play00:13

happened secondly why did it happen and

play00:14

third and fairly important to people are

play00:16

still cleaning up after this how do we

play00:18

resolve it so let's begin with question

play00:20

number one what exactly happened to over

play00:22

8 million Windows computers and are you

play00:24

impacted first of all it would look like

play00:26

the result looks like this this is a

play00:27

representation of the blue screen of

play00:30

death or as his friends call it BS o d

play00:33

for the acronym for blue screen of death

play00:35

and for the question of are you impacted

play00:37

the answer is yes even if you personally

play00:39

didn't have a blue screen of death on

play00:40

your Windows computer it's very likely

play00:42

there was interruptions to other

play00:43

services that you very likely were

play00:45

attempting or were going to use things

play00:47

like businesses and Airlines and

play00:49

hospitals and various critical services

play00:51

around the world so I have friends who

play00:53

Miss flights I had children that missed

play00:55

doctor's appointments all because those

play00:57

systems were down due to their blue

play00:58

screen of death so again whether this

play01:00

happened to you personally or you were

play01:01

impacted by somebody else's computer

play01:03

systems having blue screens of death

play01:05

this impacted hundreds of millions of

play01:06

people all over the planet in one way or

play01:09

another so when something like that

play01:10

happens one of the questions might be

play01:12

well how in the world did this occur so

play01:14

let's tackle that next and a great way

play01:16

to understand why this happened would be

play01:18

to use the analogy of a castle so let's

play01:21

imagine we have a castle right here so

play01:22

this would be the perimeter and then

play01:24

within the castle there's other secure

play01:26

areas it's very likely that's where we

play01:27

keep the king or the royalty whoever it

play01:29

happens to to be in the most secure area

play01:31

so we'll call this area zero the most

play01:33

secure area and the Outer Perimeter here

play01:35

we'll go ahead and call that area one so

play01:37

we can think of this area zero here in

play01:39

the castle the innermost part of it as

play01:40

the most secure area now if somebody

play01:42

needs access to the king or to the staff

play01:44

of the king or to the royalty what can

play01:46

happen is a request can be made for that

play01:49

access maybe we need to get a decision

play01:50

from the king or some other request

play01:52

needs to be made that request is made

play01:54

and then inside this most secure area

play01:56

the decisions are made and then the

play01:58

results are handed back and the concept

play02:00

is if something negative happens out

play02:01

here in area one let's say we have I'm

play02:03

going to go ahead and draw an X to

play02:04

represent something negative happening

play02:05

hopefully that's not going to impact the

play02:07

area zero because there's extra security

play02:09

to get to area zero the goal is to not

play02:12

let those negative events impact the

play02:14

area zero the most secure area where the

play02:16

king and the treasures are all kept or

play02:18

in the case of a queen where the queen

play02:19

and all the treasures are kept so how

play02:21

does this story of the castle with a

play02:23

most secure area and a less secure area

play02:26

apply to why this blue screen of death

play02:28

happened as part of the crowd strike

play02:30

incident in July of 2024 and here's how

play02:33

it applies I'm going to draw the same

play02:34

diagram again except this time I'm going

play02:36

to call this ring one and think of it

play02:39

like area one for the castle the outside

play02:41

perimeter but in a computer system it's

play02:42

referred to as ring one and then the

play02:44

operating system and the most critical

play02:46

functions are going to be running in a

play02:47

separate and more secure area called

play02:49

ring zero and just like our analogy of

play02:52

the castle and area one and area zero

play02:55

inside of a computer system we have

play02:57

these similar areas except they're

play02:58

referred to as the most areas here's

play03:00

ring zero and the less secure area or

play03:02

the outside perimeter here is referred

play03:04

to as ring one and here at ring zero the

play03:06

operating system is handling core

play03:08

functions for the operating system

play03:10

itself so when we have other

play03:11

applications such as Microsoft Office

play03:14

applications or other programs or

play03:15

browsers that we're running they're

play03:16

running in ring one it's also referred

play03:18

to as user mode so these little red

play03:21

boxes I'm going to refer to as

play03:22

applications think of them like user

play03:24

applications that are running in ring

play03:25

one now for those applications to work

play03:27

they need resources for example they

play03:29

need to get to memory or they may need

play03:30

to write to the dis or they may need to

play03:32

write to the network or make requests

play03:33

from the network so when those apps need

play03:35

resources they're interacting and making

play03:36

this request to the ring zero components

play03:38

and right here at ring zero we have

play03:40

things like memory management and access

play03:42

to the hardware and all the super secure

play03:44

services that the operating system is in

play03:46

charge of and that would also include

play03:47

things such as drivers so when an

play03:49

application needs resources it's making

play03:51

those requests and then hopefully those

play03:52

requests are being granted back to the

play03:54

applications from the operating system

play03:56

another common term for ring zero and

play03:58

ring one are kernel mode and I'll match

play03:59

set the same color here and just think

play04:01

of Kernel mode as the most secure area

play04:03

of the operating system that gets direct

play04:05

access to the hardware and resources and

play04:06

again that's where the operating system

play04:07

runs and drivers run in ring zero or in

play04:10

kernel mode and ring one where most

play04:12

applications run that's referred to as

play04:14

user mode some parentheses I'll just

play04:16

shout out ring zero for kernel mode and

play04:19

next to user mode I'll go aad and put

play04:20

ring one just as a reminder now what's

play04:23

the benefit of having two of these

play04:24

different modes kernel mode at ring zero

play04:26

and user mode well the benefit is if we

play04:28

have an application that goes sideways

play04:31

it has a problem or an issue hopefully

play04:33

we just want that application to die by

play04:35

itself and not take down the entire

play04:37

system and that's normally the case with

play04:39

applications that are run in user mode

play04:41

for example let's imagine we're running

play04:43

our favorite network-based game on our

play04:45

computer it's running in user mode in

play04:48

ring level one and it crashes the intent

play04:50

is for just that application to crash

play04:52

because of its problem and not to take

play04:54

down the entire operating system and get

play04:55

a blue screen of death so let me clean

play04:57

this up just a teeny bit and let's talk

play04:59

about why

play05:00

the Falcon application let's take a look

play05:01

at what that is why that from crowd

play05:04

strike caused the blue screen of death

play05:06

crowd strike makes some software that

play05:07

runs on a Windows computer that helps to

play05:10

identify and prevent malware so think of

play05:12

it like an anti-malware program on

play05:15

steroids it's really really efficient

play05:17

until it brings down the computer but

play05:18

for the moment let's just go ahead and

play05:20

label out that their product called

play05:22

Falcon again think of Falcon like an

play05:23

antivirus or antimalware software and as

play05:26

far as the impact goes the blue screen

play05:28

of death happened to individuals and

play05:30

computer systems where they were running

play05:32

the Falcon service from crowd strike now

play05:34

the reason that the Falcon software

play05:36

caused the blue screen of death was

play05:38

because had two things that were

play05:39

currently working at the same time

play05:41

number one the Falcon software is

play05:42

running in kernel mode here at ring zero

play05:46

so if there's a problem with the

play05:47

software and it ruins effectively the

play05:49

kernel that's going to cause the blue

play05:50

screen of death instead of the computer

play05:52

still trying to continue which if

play05:54

there's problems at ring zero with

play05:55

access to memory for example two

play05:57

different programs right into the same

play05:58

memory space or walking on top of each

play06:00

other the operating system is designed

play06:02

to Halt instead of continuing which

play06:04

could lead to further data corruption so

play06:06

their code the Falcon code is being run

play06:08

as a device driver here at ring zero in

play06:11

kernel mode and there's probably some

play06:12

great reasons why they're doing that one

play06:14

would be they want more access and

play06:16

direct access to make sure that their

play06:18

antimalware and anti virus Etc software

play06:21

is currently working correctly and able

play06:22

to go ahead and really catch everything

play06:24

that's trying to happen and it's not

play06:25

like they just walked up and said hey

play06:27

can we do this with Microsoft they went

play06:28

through a process called

play06:32

whql which is an acronym for Windows

play06:34

Hardware qualified lab and they were

play06:37

certified which effectively means that

play06:39

Microsoft tested and worked with their

play06:40

software validated it and said yep

play06:42

you're good to go we approve of this and

play06:45

here's the rub even though the Falcon

play06:47

software was certified through whql so

play06:49

let's go ahead and manage this is Falcon

play06:51

right here so it's been certified it's

play06:52

been signed by Microsoft that it's good

play06:54

to go the underlying components of

play06:56

Falcon periodically get updates and

play06:59

that's where the whole thing went South

play07:00

so let's imagine the Falcon software

play07:02

itself has some subordinate files think

play07:04

of them like files that the Falcon

play07:06

software itself uses as part of its

play07:08

operation and let's go ahead and put

play07:09

file one and file two and file three and

play07:12

so if they need to update some

play07:13

components of the Falcon driver their

play07:15

software effectively what they can do is

play07:17

update those files and then they have an

play07:18

updated component as part of the Falcon

play07:21

software and that's deployed to clients

play07:23

that are using this system from crow

play07:25

called Falcon through Dynamic updates

play07:27

well even though the Falcon software was

play07:30

certified by Microsoft in the event that

play07:33

they do an update and they have a

play07:34

corrupt or incorrect file that they're

play07:36

using as part of the Falcon software

play07:38

because it's running in kernel mode at

play07:40

ring zero if there's a problem with one

play07:41

of those supporting files that's being

play07:42

used by Falcon that could and did cause

play07:45

a problem as part of the update that

play07:47

happened on July 19th with these files

play07:49

that were being used by the Falcon

play07:51

driver which again is our code wearing

play07:52

at ring zero in kernel mode it can be

play07:54

identified as C- then 5 Z

play07:58

291 Das and then some extension which

play08:00

won't matter tooo much because it's this

play08:02

update that caused the problem so to

play08:04

answer the question why did this happen

play08:05

customer systems that were using crowd

play08:08

strikes Falcon software when they got

play08:09

the dynamic update the update had an

play08:11

incorrect file and as a result the

play08:13

incorrect file caused the application to

play08:15

fail and because it was at ring zero or

play08:17

kernel mode that caused the blue screen

play08:19

of death so next let's turn our

play08:20

attention to how it is being resolved

play08:22

now they can push out a new update and

play08:24

they have since the actual incident

play08:26

happened but if a computer boots up to

play08:28

the blue screen of death it's not going

play08:30

to be able to continue with any other

play08:31

kind of update because at that point

play08:33

it's halted and the process for

play08:35

correcting this if you're sitting at the

play08:36

computer that has a blue screen of death

play08:38

would be to reboot the computer into

play08:40

what's known as safe mode and then in

play08:42

safe mode you drill down to the file

play08:44

structure find any of the updated files

play08:46

from Falcon with this

play08:48

00000000 291 and then delete that file

play08:51

or any files with that 291 update and

play08:54

then go ahead and reboot and that works

play08:56

great most of the time except for the

play08:57

fact that a lot of servers don't have a

play09:00

guei connected to them for example a GUI

play09:02

is a graphical user interface a lot of

play09:04

servers don't have a screen attached to

play09:06

them all the time and as a result if you

play09:08

have hundreds or thousands of servers

play09:10

that are running in a Data Center and

play09:12

they don't have screens that some you

play09:13

can just walk up to and work with it may

play09:15

take a little more time or scripting to

play09:17

actually make that change another

play09:18

complication comes in if we're using a

play09:21

security feature on the file system

play09:22

called bit Locker so if a system is

play09:24

using bit Locker there are some

play09:26

additional steps and depending if you

play09:27

have the keys or not additional steps

play09:29

beyond that to get fully recovered just

play09:32

be aware there is some additional work

play09:33

to do if you're currently using bit

play09:35

Locker also as part of the recovery

play09:37

process to Aid that Microsoft updated

play09:39

their Microsoft recovery tool on July

play09:43

22nd and that recovery tool provides two

play09:45

repair options to help it admins

play09:47

expedite the repair process so one of

play09:49

those options is booting from wind PE to

play09:52

facilitate the repair and the other

play09:54

option is doing the recover from safe

play09:56

mode and I guess the final thing we

play09:57

should chat about is how could this be

play09:59

avoided and the answer is QA so even

play10:03

though the Falcon driver or the Falcon

play10:05

code itself which was running at ring

play10:07

zero is certified via whql from

play10:09

Microsoft the updates that they were

play10:12

doing obviously were not thoroughly

play10:14

tested enough to prevent the blue screen

play10:16

of death so the two solutions would be

play10:18

one is better QA on the updates for the

play10:20

Falcon software or secondly they could

play10:23

run the Falcon software not in ring zero

play10:25

in kernel mode but rather run it as an

play10:27

application which might degrade some of

play10:29

its deficiencies in identifying malware

play10:31

but at least if it crashed it would only

play10:33

crash itself and it wouldn't cause the

play10:35

entire system because it's not a kernel

play10:37

zero application it wouldn't cause the

play10:38

entire system to crash so thanks for

play10:40

joining me in this video as we've

play10:41

addressed three elements number one what

play10:43

happened number two why it happened and

play10:46

third how it's being resolved and until

play10:48

next time I'm Keith Barker and stay safe

Rate This
★
★
★
★
★

5.0 / 5 (0 votes)

Étiquettes Connexes
CrowdStrikeIncident2024BlueScreenWindowsSecurityMalwareFalconQAResolution
Besoin d'un résumé en anglais ?