CrowdStrike Update: Latest News, Lessons Learned from a Retired Microsoft Engineer

Dave's Garage
24 Jul 202417:25

Summary

TLDRIn this video, Dave, a retired Microsoft software engineer, discusses the recent CrowdStrike Falcon cybersecurity platform outage caused by a faulty sensor configuration update. He provides technical details, updates on conspiracy theories, and broader lessons learned from the incident, emphasizing the need for better security practices and communication.

Takeaways

  • 👋 Introduction: Dave, a retired Microsoft software engineer, discusses the recent CrowdStrike Falcon cybersecurity platform outage.
  • 🔧 Technical Details: The outage was caused by a faulty sensor configuration update in the Falcon platform, specifically a malformed 'Channel file 291'.
  • 💥 Impact: Approximately 8.5 million Windows devices were affected, leading to significant disruptions across various industries, including banking, airlines, and emergency services.
  • 🛠️ Quick Fix: CrowdStrike identified the issue and deployed a fix to prevent further machines from being affected but did not automatically fix the already impacted systems.
  • 👨‍💻 Manual Intervention: System administrators and IT professionals worldwide had to manually boot affected machines into safe mode to remove the corrupted update file and reboot.
  • 🤔 Microsoft's Role: Despite the issue being primarily with CrowdStrike, the reliance on kernel drivers in Windows raises questions about Microsoft's platform design.
  • 🔄 Past Incidents: CrowdStrike has had similar issues affecting Debian and Linux, and Rocky Linux, indicating a pattern of problems with their updates.
  • 🍎 Cross-Platform: CrowdStrike also provides security solutions for macOS, but the Falcon sensor for macOS does not install kernel extensions due to Apple's deprecation of them.
  • 🛡️ Microsoft's Challenges: The Windows platform requires deep integration for security functionalities, which currently necessitates kernel-side code, posing stability risks.
  • 🏛️ Regulatory Hurdles: Microsoft developed an advanced API for security applications like CrowdStrike, but EU regulators deemed it anti-competitive and prohibited its implementation.
  • 📚 Lessons Learned: The incident highlights the need for better communication, crisis management, and possibly reconsidering the reliance on kernel mode code for security solutions.

Q & A

  • Who is Dave and what is his background?

    -Dave is a retired Microsoft software engineer who started working on Windows back in the early 1990s. He now runs a shop and creates content, including updates on the latest news and speculations, particularly focusing on cybersecurity issues.

  • What was the cause of the recent CrowdStrike IT outage?

    -The recent CrowdStrike IT outage was caused by a faulty sensor configuration update in their Falcon cybersecurity platform. The update involved a malformed configuration file known as Channel file 291, which triggered a logic error in the CrowdStrike kernel driver, resulting in system crashes.

  • How many devices were impacted by the CrowdStrike IT outage?

    -Approximately 8.5 million devices worldwide were impacted by the CrowdStrike IT outage, causing significant disruptions across various industries.

  • What was the nature of the 'fix' CrowdStrike deployed after identifying the issue?

    -The 'fix' CrowdStrike deployed was to prevent more machines from being affected by the faulty update. However, for the machines that had already taken the update, the fix did not automatically resolve the issue; it required manual intervention by system administrators or users to boot into safe mode, delete the corrupted update file, and reboot.

  • Why is it ironic that the IT outage is often associated with Microsoft despite it being a CrowdStrike issue?

    -It is ironic because the issue primarily lies with CrowdStrike's platform and not specifically with Windows itself. However, the perception might be due to the fact that the impact manifested on the Windows platform, which is developed by Microsoft.

  • What similar issues did CrowdStrike face with non-Windows operating systems?

    -CrowdStrike faced similar issues with Debian and Linux on April 19th, causing systems to crash and preventing normal reboots. Another issue occurred on May 13th affecting Rocky Linux servers, which experienced freezes after upgrading to Rocky Linux 9.4, linked to a Linux sensor operating in user mode combined with Pacific 6.x kernel versions.

  • Why doesn't the CrowdStrike sensor for macOS install kernel extensions?

    -The CrowdStrike sensor for macOS does not install kernel extensions because, starting with macOS Big Sur and later versions, Apple deprecated the use of kernel extensions entirely. Instead, CrowdStrike has rearchitected its sensor to use system extensions provided by Apple.

  • What is the role of a kernel driver and why is it considered risky?

    -A kernel driver has very intimate access to the system's most inner workings, allowing for low-level system access necessary for certain security functionalities. However, it is risky because if anything goes wrong with the kernel driver, the system must blue screen to prevent further damage to user settings, files, and security.

  • What was the impact of the regulatory body's decision on Microsoft's advanced API for security applications?

    -The regulatory body, concerned with fair competition, deemed the advanced API anti-competitive and prohibited its implementation. This decision was based on the fear that the API could create a dependency on Microsoft's ecosystem, effectively locking out competitors who couldn't leverage the same level of access to the Windows core.

  • What are some of the lessons that can be learned from the CrowdStrike IT outage?

    -Lessons include the potential risks of relying on a single vendor for critical infrastructure, the need for critical systems like 911 to be on an N-1 or N-2 update schedule, and the importance of proper vetting and testing of software updates to prevent widespread impact from bugs.

  • What is the significance of the Tylenol crisis in the context of corporate crisis management?

    -The Tylenol crisis is significant as it set a new standard for corporate crisis management through transparency, decisiveness, and a focus on consumer safety. It demonstrated the power of ethical leadership and the importance of maintaining open communication during a crisis.

  • What are some of the conspiracy theories that emerged following the CrowdStrike outage?

    -Some conspiracy theories suggest that the outage was a deliberate cyber attack signaling the onset of World War III, while others propose it was orchestrated by political figures to influence geopolitical events. However, these theories lack evidence and are speculative.

  • Why is it important for a device driver to properly vet its input?

    -It is important for a device driver to properly vet its input to prevent access violations and system crashes. Even if the input files are signed, the code needs to sanity check the contents to ensure they are valid and not corrupted, which can help avoid reliance on luck and prevent potential system failures.

Outlines

00:00

🛠️ CrowdStrike Outage and Technical Details

Dave, a retired Microsoft software engineer, introduces the video by discussing the recent CrowdStrike outage. The incident was caused by a faulty sensor configuration update in their Falcon cybersecurity platform. The update involved a malformed configuration file, triggering a logic error in the CrowdStrike kernel driver, leading to system crashes and the infamous blue screen of death on impacted Windows systems. Approximately 8.5 million devices were affected, causing significant disruptions across various industries. CrowdStrike quickly identified the issue and deployed a fix, but the fix only prevented further machines from being affected, leaving the responsibility of fixing the already impacted machines to system administrators. The video also touches on similar issues that occurred on non-Windows platforms, highlighting that the problem is not specific to Windows.

05:01

🔍 Deep Integration and Microsoft's Role

Dave delves into the technical aspects of why CrowdStrike's Falcon sensor operates in kernel mode, suggesting that it's necessary for deep integration with the operating system. He discusses Microsoft's efforts to provide security functionality through APIs like Windows Defender Application Control and Device Guard, which allow for application control and interaction with the operating system. Dave mentions a potential solution that Microsoft had developed to prevent such disasters, an advanced API for security applications. However, this API was deemed anti-competitive by European Union regulators and was not implemented. The video also contrasts Microsoft's approach with Apple's, highlighting the challenges of maintaining backward compatibility and the need for Microsoft to provide official APIs to ensure system stability.

10:02

🏥 Tylenol Crisis and Crisis Management

Dave compares the CrowdStrike outage to the Tylenol crisis of the 1980s, where Johnson & Johnson faced a major public relations challenge after their product was tampered with, leading to deaths. The company's CEO, James Burke, led a response characterized by transparency, decisiveness, and a focus on consumer safety. This included a nationwide recall and the introduction of tamper-evident packaging. The Tylenol crisis serves as an example of ethical leadership and effective crisis management, with the company's actions restoring consumer confidence and ultimately strengthening the brand's reputation. Dave suggests that both Microsoft and CrowdStrike could learn from this example, emphasizing the importance of transparency and consumer safety.

15:03

💡 Speculation and Conspiracy Theories

The video concludes with Dave speculating on what went wrong inside the CrowdStrike driver, suggesting that the issue might be related to null pointer dereferencing due to the all-zero update file. He criticizes the lack of input vetting in the CrowdStrike driver, emphasizing the importance of never trusting user input, especially in device drivers. Dave also addresses various conspiracy theories that have emerged following the outage, such as the idea that it was a deliberate cyber attack or a political maneuver. He prefers to attribute the incident to incompetence rather than malice, while also discussing the broader implications for critical infrastructure and the need for multiple update schedules for critical systems.

Mindmap

Keywords

💡CrowdStrike

CrowdStrike is a cybersecurity technology company that provides security solutions to prevent and detect cyber threats. In the video, it's central to the discussion as the source of a software update that went wrong, impacting millions of devices and causing significant disruptions across various industries.

💡Falcon Platform

The Falcon Platform is CrowdStrike's cybersecurity platform that offers a suite of services including endpoint protection. The video discusses an issue with the platform's sensor configuration update, which led to system crashes, illustrating the platform's critical role in security operations.

💡Blue Screen of Death (BSOD)

The 'Blue Screen of Death' is a colloquial term for an error screen displayed when Windows operating systems encounter a critical system error. The video refers to this phenomenon as a result of the faulty CrowdStrike update, indicating the severity of the issue.

💡Kernel Driver

A kernel driver is a software component that allows an operating system's kernel to interact directly with the hardware. The video discusses the implications of CrowdStrike's kernel driver being involved in the incident, highlighting the deep system access and potential risks associated with kernel-level code.

💡Malicious Named Pipes

In computing, named pipes are a method of inter-process communication. The video mentions that the faulty update was intended to target malicious named pipes used in command and control frameworks, indicating the update's purpose was to enhance security against such threats.

💡System Extensions

System extensions in the context of macOS are a way for apps to modify system behavior without kernel extensions. The video notes that CrowdStrike's approach on macOS uses system extensions, which is a safer alternative to kernel extensions deprecated by Apple.

💡Windows Filtering Platform (WFP)

The Windows Filtering Platform is a set of APIs that allows applications to interact with the network stack in Windows. The video suggests that WFP could be an alternative to kernel drivers for security solutions like CrowdStrike, offering a way to interact with network traffic without the risks of kernel-level code.

💡Tamper Evident Packaging

Tamper evident packaging is a design feature that makes it obvious if a product has been altered or opened without authorization. The video references this concept in the context of the Tylenol crisis, illustrating how companies can restore consumer trust after a crisis.

💡Regulatory Bodies

Regulatory bodies are organizations that govern and enforce rules within an industry to ensure fair competition and protect consumers. The video discusses how such a body in the European Union scrutinized and prohibited the implementation of a Microsoft API, which could have potentially prevented the CrowdStrike issue.

💡Trusted Platform Module (TPM)

A Trusted Platform Module is a specialized microcontroller designed to secure hardware by storing cryptographic keys and providing other security-related functions. The video mentions that even with TPM and secure boot mechanisms, the signed driver from CrowdStrike was still able to cause widespread issues, indicating the limitations of trusted computing in this scenario.

💡Whql Lab

The Windows Hardware Quality Labs (WHQL) is a Microsoft program that tests and certifies hardware and drivers for Windows. The video points out that the CrowdStrike driver, which caused the issue, was fully tested, vetted, approved, and signed by Microsoft in the WHQL lab, raising questions about the thoroughness of such testing processes.

Highlights

Dave, a retired Microsoft software engineer, updates on the latest CrowdStrike Falcon news.

The recent CrowdStrike IT outage was caused by a faulty sensor configuration update in their Falcon cybersecurity platform.

Approximately 8.5 million devices worldwide were impacted, causing significant disruptions across various industries.

CrowdStrike quickly identified the issue and deployed a fix, but it only prevented further machines from being affected.

System administrators must manually boot affected machines into safe mode to fix the issue.

CrowdStrike previously issued flawed updates impacting Debian and Linux systems.

CrowdStrike provides security solutions for Mac OS through its Falcon Plus platform.

Microsoft is not primarily at fault for how CrowdStrike's mistakes manifested on their platform.

Kernel drivers have intimate access to the system's inner workings but can cause system crashes if something goes wrong.

Microsoft has been working on an advanced API for security applications like CrowdStrike, but its implementation was prohibited due to regulatory concerns.

Microsoft's approach to backward compatibility is contrasted with Apple's ability to break driver models in new updates.

CrowdStrike's driver failed to properly vet its input, leading to system crashes.

CrowdStrike's issue could have been mitigated with better procedural and test layers.

Conspiracy theories about the outage include suggestions of a deliberate cyber attack or political manipulation.

Dave speculates on what went wrong inside the CrowdStrike driver, suggesting issues with null pointer referencing.

The importance of not relying on luck in software development is emphasized, with the need for robust input validation.

Dave recommends learning from James Burke's handling of the Tylenol crisis, emphasizing transparency and consumer safety.

The issue of code signing and trusted computing in the context of the CrowdStrike driver is discussed.

Dave questions the communication and messaging from top management regarding the incident.

Transcripts

play00:01

hey I'm Dave welcome to my shop I'm Dave

play00:03

plumber a retired Microsoft software

play00:05

engineer starting our Windows back in

play00:07

the early 1990s and today I'm going to

play00:09

update you on all the latest fulcon news

play00:11

as well as some want and speculation and

play00:13

even conspiracy theories on the crowd

play00:15

strike Falcon it oage if you watch my

play00:18

last video then you already know the

play00:20

specific technical details of what

play00:22

precisely went wrong so I'll only

play00:24

briefly update them here with some new

play00:25

info once we've done that I'll update

play00:28

you on the latest conspiracy theories as

play00:29

as well as consider what broader lessons

play00:31

can be learned from the whole debacle

play00:34

the recent crowd strike it outage was

play00:36

caused by a faulty sensor configuration

play00:38

update in their fulcon cyber security

play00:40

platform here are the key technical

play00:42

details the update involved a

play00:44

configuration file known as Channel file

play00:47

291 designed to Target newly observed

play00:50

malicious named pipes used in common

play00:52

command and control Frameworks the

play00:54

update appears to have been malformed it

play00:57

then triggered a logic air in the crowd

play00:58

strike kernel Drive that resulted in

play01:00

system crashes in the infamous blue

play01:02

screen of death on impacted Windows

play01:04

systems approximately 8.5 million

play01:07

devices worldwide were impacted causing

play01:09

significant disruptions across various

play01:11

Industries including Banks Airlines and

play01:13

businesses even 911 service was

play01:16

disrupted in some areas Crow quickly

play01:19

identified the issue and deployed a fix

play01:21

within a few hours they issue detailed

play01:24

technical guidance for affected

play01:25

customers including mitigation steps and

play01:27

tools to identify impacted hosts now I

play01:30

used air quotes around the word fix

play01:32

because in this case the fix only fixes

play01:34

the update and prevents more machines

play01:36

from being brought down for the 8

play01:37

million or so machines that already took

play01:39

the update it does nothing at all to fix

play01:41

them that's going to be up to the system

play01:43

administrators office managers and nerdy

play01:45

uncles around the world to fix because

play01:47

each and every machine will require that

play01:49

a human manually boot the machine into

play01:51

safe mode from there you have to find

play01:53

the corrupted Channel 291 update file in

play01:55

the crowd strike folder delete it and

play01:57

reboot and so that's where we're at a

play02:00

whole lot of tech standing around with

play02:01

their disc in their hand waiting to Safe

play02:03

boot 8 million blue screen Windows

play02:05

machines doesn't look very good for

play02:07

Microsoft which is ironic because it's

play02:09

primarily a crowd strike issue and not

play02:11

something specific to Windows itself if

play02:13

you don't believe me consider that on

play02:15

April 19th this year Crow strike issued

play02:17

a flawed update that impacted customers

play02:18

running Debbie and Linux the update

play02:21

caused those systems to crash and

play02:23

prevented them from rebooting normally

play02:25

the issue was acknowledged by crowd

play02:26

strike the next day but it took weeks to

play02:28

determine the exact cause and Implement

play02:30

a fix another similar issue occurred a

play02:32

month later on May 13th this time

play02:34

affecting Rocky Linux these servers

play02:36

experience freezes after upgrading to

play02:38

the rocky Linux 9.4 this problem was

play02:41

linked to a Linux sensor operating in

play02:42

user mode combined with Pacific 6.x Kel

play02:46

versions curiously absent from the list

play02:48

though is the Mac like a lot of folks

play02:50

you might just assume that's because

play02:52

it's yet one more piece of software that

play02:53

doesn't even run on the Mac but you'd be

play02:55

wrong crowd strike does provide security

play02:57

solutions for Mac OS through its Falcon

play02:59

Plus platform the Falcon sensor for Mac

play03:02

OS does not install kernel extensions

play03:04

especially with the release of Mac OS

play03:06

big sir and later versions where Apple

play03:08

deprecated the use of K extensions

play03:10

entirely instead crowd strike has

play03:13

rearchitecturing

play03:16

workk provided by Apple known as system

play03:19

extensions and while I generally hold

play03:21

Microsoft blameless in how crowd strikes

play03:23

mistakes manifested on their platform

play03:25

this time around it all comes down to

play03:27

the fact that a kernel driver is

play03:29

involved at all as I explained in the

play03:31

last video a kernel driver has very

play03:34

intimate access to the system's most

play03:35

inner workings as a cost however it

play03:38

brings with it the fact that if anything

play03:40

goes wrong with the kernel driver the

play03:42

system must blue screen to prevent

play03:43

further damage to the user settings

play03:45

files security and so on crowd strike

play03:48

engag is in the risky business of

play03:50

delivering kernel code to the critical

play03:52

path of millions of machines not because

play03:54

they are careless YOLO Cowboys or even

play03:57

in spite of that they do it because it's

play03:59

the only way on Windows to get the low-l

play04:01

system access to do the security Voodoo

play04:03

that they do you see code gets to walk

play04:06

on the wild side in the kernel usually

play04:08

for one of two reasons either for

play04:10

performance reasons or because it needs

play04:12

access to information about or other

play04:13

kernel goings on that it simply cannot

play04:16

do from user mode back in the day when

play04:19

my beard was still dark red as in the

play04:20

days of Windows n31 not even the video

play04:23

driver ran in kernel mode it essentially

play04:25

ran entirely in user mode and when it

play04:27

needed to access the hardware it would

play04:29

be done by a proxy thread in the kernel

play04:31

on behalf of the video driver and the

play04:34

parameters and results will be validated

play04:35

and Marshal back and forth between those

play04:37

threads the problem is that with a gen

play04:39

4x6 GPU connection that's a metric crap

play04:42

ton of data to Marshall and it'd be a

play04:45

lot faster if the driver just had Direct

play04:46

access to the hardware and so for

play04:49

performance Reasons video drivers got

play04:50

moved into kernel space but the key

play04:52

point is that it was not a necessity it

play04:54

was a performance decision made at the

play04:56

cost of potentially reduced reliability

play04:59

oh over time the decision has been made

play05:01

the other way in favor of stability too

play05:03

the original printer subsystem for

play05:05

Windows used a kernel mode driver model

play05:07

for printers and while I would never

play05:09

dare to question the wisdom of printer

play05:11

designers I'm not sure I want some

play05:13

internet brother writing my kernel code

play05:16

and so with a little wailing and nashing

play05:18

of teeth the printer driver model was

play05:20

moved to user mode to make Windows far

play05:21

more

play05:22

robust when it comes to something like

play05:24

crowd strike the Falon sensor is in

play05:26

kernel mode presumably because it needs

play05:28

to do things that can't be done from

play05:30

user mode and to me that's where

play05:31

Microsoft could be responsible because

play05:34

on the Windows platform to the best of

play05:36

my knowledge some of the crowdstrike

play05:37

security functionality requireed deep

play05:39

integration with the operating system

play05:41

that can only be currently achieved on

play05:43

the colonel side that's not to say that

play05:46

Microsoft hasn't tried there's wdac or

play05:49

the Windows Defender application control

play05:51

API there's also the Windows Defender

play05:53

device guard together they provide

play05:55

mechanisms for controlling application

play05:57

execution and ensuring that only trusted

play05:59

code code runs on a system they also

play06:02

offer various apis for antivirus and

play06:04

endpoint protection solutions to

play06:06

interact with the operating system and I

play06:08

don't know to what extent crowd strike

play06:09

those Active network filtering but the

play06:11

Windows filtering platform or wfp allows

play06:14

applications to interact with the

play06:16

network stack without requiring kernel

play06:18

level code the irony of all this is that

play06:20

at one point Microsoft actually tried to

play06:22

do the right thing behind the scenes

play06:24

sources indicate that Microsoft have

play06:26

been working on a solution that could

play06:27

have potentially prevented such

play06:28

disasters

play06:30

the tech giant had developed an advanced

play06:32

API designed specifically for security

play06:34

applications like crowd strikes this API

play06:37

promised deeper integration with the

play06:39

Windows operating system offering

play06:40

enhanced stability performance and

play06:42

security it was a proactive measure

play06:44

aimed at mitigating the risks associated

play06:46

with low-level system interactions which

play06:48

are often fraught with complexities and

play06:50

potential

play06:51

vulnerabilities however as Microsoft

play06:53

prepared to roll out this game-changing

play06:54

API they encountered an unexpected

play06:57

obstacle regulatory body tasked with

play07:00

ensuring Fair competition in the tech

play07:01

industry scrutinized the new API The

play07:04

Regulators in the European Union argued

play07:06

that providing such a powerful tool

play07:08

exclusively to certain applications

play07:10

could give Microsoft an unfair Advantage

play07:12

potentially stifling competition from

play07:14

smaller security firms that wouldn't

play07:16

have the same access now despite

play07:18

Microsoft's assurances that the API

play07:20

would enhance security for all users The

play07:22

Regulators stood firm they feared that

play07:25

integrating this API could create a

play07:26

dependency on Microsoft's ecosystem

play07:29

effectiv L locking out competitors who

play07:31

couldn't leverage the same level of

play07:32

access to the windows core consequently

play07:35

the API was deemed anti-competitive and

play07:38

its implementation was prohibited so

play07:40

allocating blame to Microsoft for

play07:42

inaction on an API is actually pretty

play07:44

unfair Microsoft is also in a very

play07:46

different position than Apple Apple is

play07:49

somehow afforded the luxury of being

play07:50

able to do things like break an entire

play07:52

driver model in a new update that

play07:54

requires everything to be Rewritten

play07:56

conversely backwards compatibility is so

play07:58

deeply ingrained among Microsoft

play07:59

developers that it simply may not be an

play08:02

option on my Mac I've got a universal

play08:04

audio Apollo tnx Thunderbolt sound

play08:07

device and it requires that you disable

play08:08

all of Apple's driver signing and kernel

play08:10

extension security and for weeks the

play08:12

machine would pink screen and reboot

play08:14

until they eventually got their driver

play08:15

more sorted Microsoft needs to support

play08:18

and Export whatever functionality as an

play08:20

official API so that security providers

play08:23

can build their product without putting

play08:24

the entire operating system at risk not

play08:27

because it's the right thing to do but

play08:28

because the harsh reality is that

play08:30

they've got tens of millions of machines

play08:32

serving ad Mission critical roles like

play08:33

911 service that do run kernel mode code

play08:36

those organizations deserve a system

play08:38

that doesn't need to run thirdparty

play08:40

kernel code to safely do its job and

play08:42

only Microsoft can fix that but only if

play08:44

The Regulators would let them now I'm

play08:46

certainly not going to throw satcha

play08:48

under the bus for not throwing crowd

play08:49

strike and the EU under at first but I

play08:52

question the communication and messaging

play08:54

that's coming from the top the decision

play08:56

to not publicly note that this isn't a

play08:57

failure in Windows itself has led to to

play08:59

the widespread misconception amongst my

play09:01

friends and relatives that it was a

play09:02

Windows update that went horribly wrong

play09:04

I think it' be instructive to take a

play09:06

quick look at another PR nightmare that

play09:08

also wasn't the company's fault Tylenols

play09:10

crisis back in the 1980s now that might

play09:13

sound like a long time ago but keep in

play09:15

mind I'm almost 56 now

play09:18

damn I'm sorry anyway Johnson and

play09:21

Johnson faced a crisis that would become

play09:22

a defining moment in corporate crisis

play09:24

Management in September 1982 seven

play09:27

people in the Chicago area died after

play09:29

ingesting Tylenol capsules that had been

play09:31

laced with cyanide this event triggered

play09:34

Widespread Panic and could have easily

play09:35

destroyed the trust and credibility of

play09:37

the Tylenol brand entirely to say the

play09:39

least James Burke the CEO of Johnson and

play09:41

Johnson at the time spearheaded a

play09:43

response that would set a new standard

play09:45

for corporate crisis management his

play09:48

approach was characterized by

play09:49

transparency decisiveness and a focus on

play09:52

consumer safety as soon as the tampering

play09:54

was discovered Burke ordered a

play09:55

nationwide recall of Tylenol products

play09:57

totaling around 31 million bottles and

play09:59

costing the company over $100

play10:01

million this decisive action underscored

play10:04

Johnson and Johnson's commitment to

play10:06

Consumer safety over their short-term

play10:08

Financial losses Burke made it a

play10:10

priority to maintain open lines of

play10:12

communication with the public the media

play10:14

and Regulatory Agencies he ensured that

play10:16

the company was forthright about the

play10:18

risks and the steps being taken to

play10:19

address the situation this transparency

play10:22

helped to build trust with the public

play10:23

during a time of fear and uncertainty in

play10:26

the aftermath of the crisis Johnson and

play10:28

Johnson introduced tamper evident

play10:30

packaging which became an industry

play10:32

standard this move not only addressed

play10:34

immediate safety concerns but also

play10:36

restored consumer confidence in the

play10:38

product the company also launched a

play10:40

major public relations campaign to

play10:41

educate the public about new safety

play10:43

measures and reassure them about the

play10:45

product safety Burg's leadership during

play10:47

the Tylenol crisis was widely praised

play10:49

for its ethical Focus he adhered to the

play10:51

company's Credo which emphasize the

play10:53

importance of the company's

play10:54

responsibility to its consumers

play10:56

employees and Community this ethical

play10:58

Foundation guided all of Johnson and

play11:00

Johnson's decisions during the crisis

play11:03

the Swift and responsible actions taken

play11:04

by Burke and his team not only helped

play11:06

Tylenol to recover from the crisis but

play11:08

also strengthened the Brand's reputation

play11:11

Tylenol regained its market share within

play11:13

a year and the company's handling of the

play11:14

crisis became a case study in business

play11:16

schools around the world James Burke's

play11:19

masterful handling of the Tylenol crisis

play11:21

showcased the power of ethical

play11:22

leadership and set a new Benchmark for

play11:24

crisis management by putting consumer

play11:27

Safety First and maintaining transparent

play11:28

communic ation Burke was able to

play11:31

navigate one of the most challenging

play11:32

crises in its history and emerge

play11:34

stronger of course the Tylenol crisis

play11:37

and the crowd strike outage are very

play11:38

different events but I think both

play11:40

Microsoft and crowd strike would be wise

play11:42

to learn from James Burke's example and

play11:45

maybe it's time for a tamperproof

play11:46

colonel all this would require that the

play11:48

EU reway the greater public good in

play11:50

terms of critical infrastructure over

play11:52

competition in the security API business

play11:55

and speaking of trust what about code

play11:56

signing what went wrong here that a

play11:58

fully signed driver was able to bork 10

play12:00

million Windows machines remember that

play12:02

Microsoft fully tested and vetted and

play12:04

approved and signed the crowd strike

play12:06

driver in the whql lab and the driver

play12:10

didn't change just the channel update

play12:12

file did the channel files are used as

play12:15

input to the driver and we subsequently

play12:17

learned that the channel 291 update file

play12:19

was made up entirely of zeros and then

play12:22

when the driver ingested that update

play12:24

file it choked and because it was in

play12:25

curdle mode its only choice was to then

play12:27

turn blue and D that also means that all

play12:30

of the trusted platform modules and

play12:32

secure boots in the world wouldn't have

play12:33

saved you the driver was already fully

play12:36

trusted so even if you were running

play12:37

locked down to sign bits only the driver

play12:39

never changed data files like Channel

play12:42

updates aren't signed as far as I know

play12:44

so a digital signature wouldn't have

play12:46

helped and even if they were signed an

play12:48

all zero signed Channel file would still

play12:50

likely have crashed the signed driver so

play12:53

in this case trusted Computing was of

play12:54

little help since there have been very

play12:56

few specific technical details made

play12:58

public so far it's time to get a little

play13:00

further into the weeds with some

play13:01

speculation before moving on to some

play13:03

outright conspiracy theories my

play13:05

speculation begins with my assessment of

play13:07

what went wrong inside the crowd strike

play13:09

driver in the last episode we saw how

play13:11

the driver was access violating and

play13:13

crashing the system but why what caused

play13:15

it the best assessment I can come up

play13:17

with is that the code D referencing a

play13:19

null pointer plus an offset into a data

play13:21

structure that is expecting to find in

play13:23

memory why their base pointer for the

play13:25

structure is no is harder to say but

play13:27

it's almost certainly tied to the fact

play13:29

that the channel update file was all

play13:31

zeros a few folks have written to ask me

play13:33

why such code can't just be placed in a

play13:35

tri accept block so that if it access

play13:37

violates operation can continue and the

play13:40

answer is you can in theory and since

play13:42

the exception will be triggered on the

play13:44

attempt to write to Illegal memory and

play13:46

not merely after the fact memory itself

play13:48

is protected and preserved that means

play13:50

that as long as the code with the

play13:52

exception Handler can return gracefully

play13:54

and the callers Upstream can in turn

play13:56

cope with the air being returned back to

play13:57

them all as well

play13:59

but I didn't want to give you the

play14:00

impression that you can just wrap

play14:01

suspect code in a tri accept block and

play14:03

eat the exceptions there's a bit more to

play14:05

it than that I think the real failure

play14:07

here is on the part of the crowdstrike

play14:09

driver in its lack of properly vetting

play14:11

its input they're not great about

play14:13

teaching it in college but one of the

play14:15

first things you learn as a real

play14:16

developer is never to trust user input

play14:19

and if you're a device driver and your

play14:21

input is a dynamically downloaded

play14:23

Channel update file you can't just

play14:25

implicitly trust it even if the channel

play14:28

files were signed by himself the code

play14:30

needs to sanity check the contents let's

play14:32

say you're writing a little app to read

play14:34

in a bitmap file and displayed on the

play14:36

screen using the graphics card when you

play14:38

read that file into memory and pass it

play14:40

to the draw bitmap API the first thing

play14:42

that the API is going to do is to check

play14:44

the bitmap structure and header and make

play14:46

sure that it's all valid and if you pass

play14:48

that bit map off to direct X to render

play14:50

it with the GPU you can rest assured

play14:52

that the kernel side of the driver is

play14:54

going to carefully inspect the bit map

play14:55

for validity in every possible sense

play14:57

before attempting to draw it and crowd

play14:59

strike man not so much looks like their

play15:01

code just kind of raw dogged it and

play15:03

hoped for the best but it is in life as

play15:05

it is in software you can be lucky

play15:07

sometimes but if you come to rely on

play15:09

luck it will eventually run out and

play15:11

crowd strikes appears to have run out

play15:13

when that channel file full of zeros

play15:14

brought down what must be a fairly

play15:16

fragile section of their code following

play15:19

the crowd strike outage various

play15:20

conspiracy theories have emerged on

play15:22

Twitter and Reddit one popular Theory

play15:24

posits that the outage was a deliberate

play15:25

Cyber attack signaling the onset of

play15:27

World War II with some can get to

play15:29

warnings from the world economic Forum

play15:31

about potential Global cyber threats

play15:33

another theory suggests that the oage

play15:35

was orchestrated by political figures to

play15:37

influence geopolitical events although

play15:39

there is no evidence supporting any of

play15:41

these claims as for me I try to never

play15:43

attribute to malice that which can be

play15:45

sufficiently explained by incompetence

play15:47

it's not as simple as one programmer air

play15:49

either though when I was at Microsoft I

play15:52

only wrote the odd bit of Colonel code

play15:53

but the culture among the colonel guys

play15:55

was pretty hardcore the quality bar was

play15:57

extremely high as was a level of

play15:59

scrutiny that your code would receive

play16:01

from the colel team if you wandered

play16:02

under their Turf and checked something

play16:04

into their Source control even so I'm

play16:06

not going to just condemn the programmer

play16:07

especially based on the limited

play16:09

information that we have on the actual

play16:10

bug but regardless of how egregious the

play16:13

bug is or isn't there should be several

play16:15

procedural and tests and review layers

play16:17

that would prevent this bug or any bug

play16:19

from having the impact that this one had

play16:21

there are a lot more lessons to consider

play16:23

here from whether or not seemingly the

play16:25

entire world's infrastructure should be

play16:27

dependent on a single vendor to whether

play16:29

critical systems like 911 need to be on

play16:31

an N minus1 or an N minus 2 update

play16:33

schedule and what that all means and

play16:36

Heaven help you if you are running bit

play16:37

Locker on the affected machine but all

play16:40

that we'll have to wait for a future

play16:41

episode so if you found today's episode

play16:43

to be any combination of entertaining or

play16:45

informative please remember that I'm

play16:46

mostly in this for the subs and likes

play16:48

and I'd be honored if you'd consider

play16:49

subscribing to my channel and leaving a

play16:51

like on the video if you're already

play16:53

subscribed thank you please consider

play16:55

sending this video to a friend if you

play16:57

think it's covered the subject well and

play16:59

please do check out the free sample of

play17:01

my new book on Amazon the non-visible

play17:03

part of the autism spectrum it's

play17:05

intended for folks that don't have ASD

play17:07

but who suspect they might have a few

play17:09

characteristics that put them somewhere

play17:11

on the Spectrum it's everything I know

play17:13

now about living a successful life on

play17:14

the spectrum that I wish i' had known

play17:16

long ago check it out at the link in the

play17:18

video description in the meantime and

play17:21

between time hope to see you next time

play17:23

right here in Dave's Garage

Rate This

5.0 / 5 (0 votes)

Étiquettes Connexes
CrowdStrikeOutageCybersecurityMicrosoftWindowsLinuxDriver IssueSystem CrashSecurity UpdateCrisis Management
Besoin d'un résumé en anglais ?