Intel has a Pretty Big Problem

Level1Techs
10 Jul 202424:19

Summary

TLDRThe video script discusses the instability issues with Intel's 13900 K and 14900 K processors, suggesting a deeper problem than just motherboard voltage and clock settings. The speaker investigates game telemetry and crash databases, revealing a significant number of decompression errors and IO issues specifically with these Intel chips. The analysis implies that the problem might not be fully resolved by a microcode update, and raises concerns about Intel's messaging to customers and system integrators regarding the CPU's reliability.

Takeaways

  • 🤔 Intel's 13900 K and 14900 K processors are experiencing instability issues, with users reporting intermittent problems that are difficult to troubleshoot.
  • 🕵️‍♂️ The speaker conducted an 'armchair diagnostic' using game telemetry data from two different game companies to investigate the nature of these crashes.
  • 📊 A significant number of crashes were attributed to decompression errors, which are unusually high for Intel's 13th and 14th generation CPUs, suggesting a potential hardware issue.
  • 📈 The crash rate seems to increase over time for the problematic CPUs, indicating a possible degradation or cumulative error effect.
  • 💻 Data from game servers using Intel's 13900 K and 14900 K CPUs revealed similar stability issues, even on more conservative W680 chipset motherboards.
  • 💾 The speaker found that IO errors were also disproportionately high for the affected Intel CPUs, suggesting a broader range of hardware issues.
  • 🛠 BIOS updates and adjustments, including disabling ecores and reducing memory speed, were attempted as fixes but did not fully resolve the issues.
  • 💡 The possibility of a power or voltage issue within the CPUs is suggested, as the problems persist even on motherboards designed for stability.
  • 📉 Intel's messaging to customers and system integrators appears to be inconsistent, with some reports indicating a 10-25% failure rate for the CPUs.
  • 🔄 The data center provider is charging a premium for support on Intel systems due to the high number of incidents requiring intervention.
  • 🚫 The lack of clear communication from Intel and the ongoing uncertainty about the root cause of the issues are causing frustration among gamers and enthusiasts.

Q & A

  • What is the main issue discussed in the video script regarding Intel's 13900 K and 14900 K processors?

    -The main issue discussed is the instability of Intel's 13900 K and 14900 K processors, which has been ongoing for months and is suspected to be more than just a simple motherboard voltage or clock problem.

  • What does the speaker believe might be the deeper issue with Intel's chips based on their investigation?

    -The speaker suggests that there might be a deeper hardware issue with the chips themselves, as opposed to just a software or microcode issue, based on the high number of crashes and errors reported.

  • How does the speaker obtain data for their analysis?

    -The speaker obtains data by accessing crash databases from two different game developers who provided them with information on system configurations, play times, and crash rates.

  • What is the significance of the 'out of VRAM' error mentioned in the script?

    -The 'out of VRAM' error is significant because it is a common error reported with Intel CPUs that have the instability problem, even when the system is not actually out of VRAM, indicating a potential hardware issue.

  • What does the speaker find when analyzing decompression errors in game databases?

    -The speaker finds that there are 1,584 decompression errors logged in the past 90 days, with 1,431 of those being related to Intel's 13th or 14th generation CPUs, which is a disproportionately high number compared to other CPUs.

  • Why does the speaker believe that the issue might not be fully resolved even with a BIOS update or disabling ecores?

    -The speaker believes this because the error rates and instability issues persist even in more conservatively configured systems, such as those using the W680 chipset in data centers, which suggests a deeper hardware problem.

  • What is the implication of the speaker's findings for game developers and data center providers?

    -The implication is that game developers and data center providers may experience higher support costs and system instability, leading some to consider alternative CPU options like AMD's 7950 X for new server deployments.

  • What is the speaker's view on Intel's messaging to customers regarding the CPU issues?

    -The speaker criticizes Intel for not providing clear and concise messaging to customers, especially enthusiasts who are experiencing issues, and suggests that Intel should offer replacements for affected CPUs.

  • What steps did the speaker take to ensure the systems were configured correctly for testing?

    -The speaker updated the BIOS to the latest versions as of June 25th, 2024, and tested various configurations, including different DDR5 speeds and multipliers, to find the most stable settings.

  • What unusual phenomenon did the speaker observe in some systems prior to a hard crash?

    -The speaker observed that in some systems, the CPU would become unexplainably slow for up to a minute before a hard crash, with no clear correlation to thermal monitoring or power issues.

Outlines

00:00

🤔 Investigating Intel CPU Instability

The script discusses the widespread instability issues with Intel's 13900 K and 14900 K processors, suggesting a deeper problem beyond a simple motherboard voltage or clock speed issue. The author embarks on an 'armchair diagnostic' journey, using game telemetry data from two different game developers to analyze crash reports. The data reveals inconsistencies in crashes, indicating a potential hardware issue. The author also explores the possibility of decompression errors being related to the CPUs and notes the high frequency of these errors in game databases, particularly with Intel's 13th and 14th generation chips.

05:00

🔍 Diving Deeper into CPU Crash Data

Continuing the investigation, the script examines the distribution of Intel and AMD CPUs in the crash database, finding a significant preference for Intel among the crashing systems. It also touches on the misleading nature of telemetry data due to various factors like operating systems and GPU configurations. The author highlights the peculiarity of IO errors, which seem to be more common with the problematic Intel CPUs, and notes the underrepresentation of AMD CPUs in the error logs. The analysis also includes looking at systems with high GPU memory, suggesting that the observed errors are not due to insufficient VRAM but could be related to the CPUs themselves.

10:04

📈 Unraveling Data Center CPU Issues

The script shifts focus to data centers, where the same Intel CPUs are used in server environments. It reveals that these CPUs, even when used with more conservative motherboards like the W680 chipset, still exhibit stability issues. The author shares insights from a negotiation process for new servers, where Intel systems are more expensive due to higher support costs associated with unresolved issues. The data center providers have experienced high support incidents, leading to a premium for Intel systems, and have even resorted to replacing CPUs or updating BIOS to mitigate the problems.

15:05

🛠️ Exploring Possible Solutions and Intel's Response

The author explores potential solutions, including BIOS updates and memory speed adjustments, finding that while these measures help, they do not completely resolve the issues. There is speculation about Intel's communication with large system integrators, suggesting that the problem rate might be higher than officially stated. The script also addresses the lack of clear messaging from Intel to its customers and the potential impact on gamers and enthusiasts who have invested in these CPUs.

20:06

🚫 The Ongoing Challenge of CPU Stability

In conclusion, the script emphasizes the ongoing challenge of CPU stability with Intel's 13900 K and 14900 K processors. Despite various attempts to address the issue, the root cause remains elusive, and the situation is not improving. The author calls for clearer communication from Intel and a commitment to making affected customers whole. The script ends with the author signing off with a note of uncertainty and the need for further investigation into the problem.

Mindmap

Keywords

💡Intel 13900 K and 14900 K

The Intel 13900 K and 14900 K are high-performance processors from Intel's 13th and 14th generation, respectively. They are central to the video's theme as the script discusses their reported instability issues, which have been causing crashes and performance problems for gamers and data centers alike. The script mentions these chips as having 'ridiculously high single thread clock speeds' and being the subject of the investigation into the crashes.

💡Instability

Instability in this context refers to the inconsistent and unpredictable behavior of the Intel CPUs, leading to system crashes and poor performance. The video's theme revolves around diagnosing the cause of this instability, with the script mentioning that it is 'not super consistent' and causing 'intermittent problems' which are the worst to troubleshoot.

💡Microcode Update

A microcode update is a firmware update that can fix issues in a CPU's operation by altering the low-level instructions that the CPU executes. The script raises the question of whether a microcode update can resolve the instability issues with the Intel CPUs, suggesting that the problem might be deeper than just a software fix.

💡Telemetry

Telemetry in the script refers to the data collection from games to track performance and issues such as crashes. The video discusses how the telemetry data from games was used to analyze and identify the pattern of crashes related to the Intel CPUs, highlighting the extensive use of this data in diagnosing the problem.

💡Decompression Failure

Decompression failure is a specific type of error mentioned in the script that occurs when the CPU is unable to properly decompress data, which is a common task in gaming. The video suggests that this is an 'Intel specific thing' and a potential indicator of the hardware issue with the 13th and 14th generation CPUs.

💡BIOS Changes

BIOS changes refer to modifications made to the Basic Input/Output System, which is firmware that initializes hardware during the booting process. The script mentions BIOS changes as one of the suggested solutions to mitigate the CPU instability issues, particularly around power and clock settings.

💡Data Center

A data center is a facility that houses a large number of servers and other computing resources. The video script reveals that the instability issues with Intel CPUs are not limited to gaming systems but also affect data centers, where these CPUs are used for gaming servers and other applications requiring high performance.

💡W680 Chipset

The W680 chipset is a type of motherboard chipset designed for data centers, which is more conservative in terms of power and clock targets compared to the Z-series chipsets used for overclocking. The script discusses the W680 chipset in the context of data center motherboards, which are experiencing similar issues with the Intel CPUs as desktop systems.

💡ECores

ECores, or Efficiency cores, are a part of Intel's hybrid architecture, designed to handle less demanding tasks to improve power efficiency. The script mentions disabling ECores as a potential mitigation strategy for the instability issues, suggesting that these cores might be implicated in the problems.

💡DDR5 Memory

DDR5 Memory is a type of random-access memory used in computers, which offers higher performance than its predecessor, DDR4. The video script discusses the conservative settings of DDR5 memory in data centers, which might be related to the stability issues with the Intel CPUs, as slower memory speeds were found to be more stable.

💡Game Telemetry

Game telemetry refers to the collection and analysis of data from games, which can include information about player behavior, game performance, and crash reports. The script uses game telemetry as a source of data to investigate the instability issues with the Intel CPUs, highlighting the prevalence of crashes and errors in games.

Highlights

Intel's 13900 K and 14900 K processors are experiencing instability issues, with concerns that the problem may be deeper than just a motherboard voltage or clock speed issue.

The speaker conducted an 'armchair diagnostic' to investigate the root cause of the crashes, suggesting it might not be fixable with a microcode update.

Access to game crash databases from two different game companies revealed a high number of decompression errors specifically with Intel's 13th and 14th generation CPUs.

The error rate for decompression failures was significantly higher for Intel 13900 K and 14900 K compared to other CPUs, with 1431 instances logged in the past 90 days.

Game telemetry data suggests that the instability is not isolated to gaming systems, as similar issues were found in data center servers using the same CPUs.

Data center motherboards, which are designed for maximum stability, still experienced high crash rates with the 13900 K and 14900 K CPUs.

The speaker found that disabling ecores and adjusting BIOS settings helped mitigate but did not fully resolve the stability issues.

There is a significant discrepancy in the error rates between Intel and AMD CPUs, with AMD showing far fewer instances of decompression failures.

The analysis of game server crash data indicates that the rate of errors may be increasing over time for the problematic Intel CPUs.

The speaker suggests that the issue might be related to power delivery or timing problems within the CPU, rather than just motherboard issues.

Intel CPUs deployed in data centers for gaming servers are reportedly more expensive and have higher support costs due to stability issues.

Some data center providers are steering customers towards AMD 7950 X systems as an alternative due to unresolved issues with Intel CPUs.

The speaker highlights the importance of using game telemetry and crash data for identifying and understanding hardware issues.

There is a call for Intel to provide clear messaging and support for customers affected by the CPU instability issues.

The ambiguity and lack of clear resolution from Intel are causing frustration among gamers and data center operators alike.

The video concludes with the speaker emphasizing the need for further investigation and the potential for deeper hardware issues with the Intel CPUs.

Transcripts

play00:00

we all know that Intel's got a problem

play00:01

with the 13900 K and the 14900 K being

play00:03

unstable it's been months do we really

play00:05

blame Gamers for being impatient at this

play00:08

point I'm not so sure that this is

play00:10

really just a boost motherboard voltage

play00:13

clock problem I'm starting to think

play00:15

there's something deeper wrong with the

play00:16

chips I mean it's Thursday all I want to

play00:18

do is just play Dwarf Fortress of

play00:20

ridiculously high single thread clock

play00:21

speeds I get intermittent problems are

play00:23

the worst to troubleshoot but it

play00:25

occurred to me we have another source of

play00:27

crashes the games themselves

play00:30

loads of games have Telemetry in them to

play00:33

track their crashes a data source from

play00:35

the crashes I mean so I started digging

play00:37

into that and what I found will shock

play00:38

you I'm much less sure that we can fix

play00:40

this with a bioer micro code update

play00:42

because I started digging into this and

play00:44

I went on a journey and now now now I'm

play00:47

troubled come with me as we attempt a

play00:49

level one text armchair diagnostic it's

play00:52

a level one diagnostic but from the

play00:53

armchair right so take that with a bit

play00:55

of a grain of salt

play00:58

[Music]

play01:07

I needed data from thousands of systems

play01:10

how am I going to do that well did you

play01:12

know when you play a game that most of

play01:15

the time it logs usage data a lot of the

play01:17

times one set of analytics goes to the

play01:20

marketing team like how long you're

play01:21

going to play the game when do you open

play01:23

it what all is going on with it but

play01:24

there's another set of data that goes to

play01:26

the dev team around when the game

play01:28

crashes the game develop velers will get

play01:30

a crash report so I reached out to my

play01:32

contact list and I found two different

play01:34

people from two different games that

play01:35

were willing to give me access to their

play01:37

crash database so I could look around

play01:38

for interesting stuff okay full

play01:40

disclosure it took a little bit of

play01:41

finagling and convincing because I said

play01:43

hey I think this weird thing is going on

play01:45

and they said n that's that seems weird

play01:47

I was like no the errors are probably

play01:49

not what you think they are and they

play01:50

said hm you might be right see the

play01:52

problem with the the way that the 14900

play01:54

K and the 1300k and other CPUs in the

play01:56

13th 14th generation are crashing is

play01:58

that they're not super consistent I

play02:00

needed to know system configuration Play

play02:02

Time crash rate I need to know about the

play02:05

population of just people playing the

play02:07

game without errors in order to make

play02:08

suppositions and these databases are

play02:10

large and they're also rolling neither

play02:12

company really hangs on to untagged or

play02:14

untriaged events more than about 3

play02:16

months and some there's some exceptions

play02:18

to that for really outliers but mostly

play02:20

it's a three-month time window the

play02:22

instability is not particular as well

play02:25

it's not like the ancient problem you

play02:27

know the the Pentium foof bug this is a

play02:29

Hardware problem which actually led to

play02:31

Intel creating their whole micro Code

play02:33

system so they could patch CPU erata and

play02:35

software even when you had a hardware

play02:38

problem uh in this case the errors are

play02:40

all over the place you're even getting

play02:41

GPU errors GPU errors yeah you know the

play02:44

out of vram error that has now become

play02:46

infamous a lot of armchair experts and

play02:48

forums saying hey your game sucks you've

play02:50

got a vram leak it's got to be you and

play02:53

that was actually not the case it is

play02:55

actually a really common error with

play02:57

Intel CPUs that have this problem out of

play03:00

vram error when you're not actually out

play03:02

of vram and these crashes aren't so bad

play03:04

that you know the game totally crashes

play03:07

always though we know that users are

play03:09

experiencing worse crashes things like

play03:10

blue screens and that sort of thing but

play03:12

we can't get a crash report if the

play03:14

computer blue screens or at least most

play03:15

of these games are not set up to be able

play03:16

to do that so there is a little bit of

play03:18

survivorship bias in all the stuff that

play03:20

I'm talking about today so again grain

play03:21

of salt there's also one operation that

play03:23

stresses the CPU in a particular way

play03:26

that is decompression it is a common

play03:28

feature of game

play03:30

and check out this statement from Ood

play03:33

regarding decompression failures when

play03:35

we're talking about their game Tools Red

play03:36

V game tools yeah check this out this is

play03:39

an Intel specific thing now this

play03:40

decompression library is ubiquitous it's

play03:43

well-used it's probably nearly bug-free

play03:45

this is a hardware issue that's creating

play03:47

this and you can see that from this

play03:48

bulletin this article suggests some bios

play03:51

changes and some clock changes that'll

play03:53

help users mitigate and most of these

play03:55

changes are around power and clock

play03:57

settings okay fine maybe motherboard man

play03:59

facturers pushed things too far okay

play04:02

maybe we'll come back to that but I've

play04:04

got some big databases here let's see if

play04:06

I can find this particular error the

play04:08

udle error how many decompression errors

play04:11

are there logged in their respective

play04:14

game databases for the past 90 days okay

play04:17

that the answer to that is

play04:18

1,584 how many of those are Intel 13th

play04:21

or 14th generation decompression errors

play04:24

1,431 what's the next most high CPU with

play04:26

an error it's an i79750h with just 11 11

play04:30

instances of decompression failure what

play04:32

about AMD CPUs maybe AMD I only saw four

play04:35

entries from any AMD CPU and you know

play04:38

that's pretty awesome don't assume that

play04:40

that means that AMD CPUs are better I

play04:41

mean I think they are in this particular

play04:43

case they're not experiencing these

play04:44

kinds of issues but there could be way

play04:47

less AMD CPUs in this population we

play04:49

don't we don't know I decided to check

play04:51

out those things and check out the

play04:53

distribution and then there's handhelds

play04:55

to worry about almost all handhelds are

play04:57

AMD CPUs not all of them but a lot of

play05:00

them okay so what's the breakdown cuz

play05:02

you know if we're only working with five

play05:04

players on AMD systems so like we we

play05:06

need to know right okay the breakdown

play05:07

between Intel CPUs and AMD CPUs in the

play05:09

crash database was about 7030 in favor

play05:11

of Intel which suggests something about

play05:13

70% of players are using Intel 60% on

play05:16

Nvidia and the rest were among AMD and

play05:19

literally everything else this is also

play05:21

another instance where it's weird and

play05:22

misleading because you know this is game

play05:26

Telemetry data from Windows but

play05:27

sometimes this is actually game

play05:28

Telemetry data from Linux

play05:30

and this reporting tool reports uh gpus

play05:33

grouped together in the Linux scenario

play05:36

differently than it does on Windows and

play05:38

so you end up with some percentage of

play05:41

the 40% of users actually being Linux

play05:43

users that could be using an Nvidia or

play05:45

an AMD CPU so 60% really is the floor

play05:49

for NVIDIA gpus in this scenario which

play05:52

is weird I know it doesn't matter for

play05:54

this video I didn't I I chose not to

play05:56

clean up the data for GPU distribution I

play05:58

just wondered something about the

play05:59

population so we get it the data here is

play06:01

not super clean not ideal and this is

play06:03

also why I think this is flown under the

play06:05

radar a little bit fortunately the rate

play06:07

of Errors per unique player was also not

play06:09

super high there was a small number of

play06:12

people less than 200 that were suffering

play06:15

tremendously don't get me wrong but like

play06:18

the PC World article where they had

play06:20

actually SWA their CPUs I think those

play06:22

200 people actually would be better off

play06:23

just swapping their CPUs just getting an

play06:25

RMA no amount of micro code is going to

play06:27

help those users I think for the problem

play06:29

users I used those as a springboard to

play06:32

look at what other errors their systems

play06:34

had logged because they've logged a lot

play06:36

of them and I saw a lot of IO errors or

play06:39

so IO errors the game these are nvme

play06:41

errors but the game doesn't log that the

play06:43

Telemetry tool doesn't look at system

play06:46

errors as a thing that we should log it

play06:48

just says it can't retrieve a game asset

play06:50

which is not super uncommon like an nvme

play06:53

error is not really a super uncommon

play06:55

thing and once again the population of

play06:57

all systems that had any IO eror grouped

play07:00

by CPU the percent of Intel 13900 K and

play07:03

14900 K were definitely over represented

play07:06

they had a much higher uh rate of errors

play07:09

in all errors for those CPUs versus any

play07:12

other CPU it's very odd the other reason

play07:15

I say that this is odd is because when

play07:16

we're talking about IO and pcie devices

play07:18

those are a different clock domain on

play07:19

the CPU like if every little overclock

play07:22

instability would result in file system

play07:24

corruption you'd be reinstalling Windows

play07:26

a lot and yeah sure a severe overclock

play07:29

can cause problems

play07:32

but most of the time you're not going to

play07:34

corrupt your SSD from a bad overclock if

play07:36

a gamer experienced 10 or more errors in

play07:38

the last 90 days you could bet at least

play07:40

one of them was an IO error of some type

play07:42

in the game where it couldn't retrieve

play07:43

an asset at least in the filtered cohort

play07:46

of users experiencing four or more

play07:47

errors in in 90 days uh there were no IO

play07:51

errors of this type from AMD systems at

play07:53

all in the last 90 days there was you

play07:57

know maybe there aren't enough AMD

play07:59

system systems I mean I really really

play08:01

dug through both game companies

play08:02

databases and I only found four errors

play08:04

that could possibly be attributed to an

play08:06

IO error or an IO error in that sort of

play08:09

a context that's not really enough data

play08:11

in my opinion to to make any sweeping

play08:12

conclusions but this IO error mostly

play08:16

does seem confined to people that are

play08:17

having real Earnest Hardware errors I

play08:20

decided to window the data down to

play08:22

people with at least 20 gigs of GPU

play08:24

memory next because it's just not

play08:26

possible for either one of these games

play08:27

to need more than that much V rank

play08:30

Intel has also accepted that this error

play08:32

likely stems from instability specific

play08:34

to the problems we're talking about with

play08:35

13th and 14th generation Enthusiast CPUs

play08:38

so you don't have to take my word for it

play08:40

that this error is probably not actually

play08:41

related to out of vram in these specific

play08:43

games so selecting for everybody that

play08:45

had a 390 490 7900 XT 7900 XTX which is

play08:49

actually surprisingly difficult because

play08:50

the Telemetry tool doesn't always record

play08:52

the correct GPU when you have more than

play08:54

one GPU and igpu plus everything else

play08:56

the oldest error that I found was one

play08:58

from uh more than six months ago which

play09:01

was logged and the dev team had spent a

play09:04

lot of time on so it wasn't cycled out

play09:06

of the system but one user had

play09:08

experienced uh about two crashes um or

play09:11

about one crash for every two hours of

play09:14

playtime that seems like a lot trying to

play09:17

find a cohort of players that play

play09:18

regularly and then grouping them by CPU

play09:21

and crash rate showed that at least for

play09:23

this game AMD players had fewer crashes

play09:26

than 13900 K and 14900 K players per

play09:28

unit of play time at least when we take

play09:31

into account the crash rate if the crash

play09:32

rate is consistent and also the 12900 K

play09:35

was shockingly better The 12900 K was

play09:37

about equivalent for AMD CPUs and about

play09:40

equivalent to everything that was in a

play09:41

13th or 14th gen CPU which is

play09:43

interesting it's also interesting that

play09:45

you can't just check 13th gen CPUs in

play09:48

general because some 13th gen CPUs are

play09:50

actually rebadged 12th gen CPUs so you

play09:52

can't just say all 13th gen CPUs you

play09:55

have to look at specific ones so for

play09:56

this video I tried to stick to just

play09:57

13900 K KS KF and 4900 K ksk KF so in

play10:04

other words just because it says it's

play10:05

Intel 13th gen on the box doesn't

play10:06

actually mean it's Intel 13th gen I mean

play10:08

that's true for the K series CPUs it is

play10:09

what it says but OEM processors and

play10:12

variants

play10:13

like it just depends So based on this

play10:16

data and a lot of data that I'm not

play10:17

commenting on I would think around 20 to

play10:19

30% of players with one of those two

play10:20

CPUs have experienced at least one crash

play10:24

that can be attributed to the CPU or the

play10:26

motherboard during their lifetime of

play10:28

play

play10:30

I think that the more that these CPUs

play10:31

are used the rate of error is increasing

play10:34

over time at least I spent a lot of time

play10:36

trying to find a Smoking Gun here that

play10:38

would look like that and it seems like

play10:40

there's some data that fits that

play10:41

supposition but the analysis of that is

play10:44

very hard because the data is not

play10:46

organized in a way to really be

play10:48

conducive to that kind of searching for

play10:50

the players that play consistently the

play10:52

number of errors that they are

play10:53

encountering with our systems are

play10:54

definitely increasing over time for

play10:56

those two CPUs I can say that at least

play10:59

and the intermittent problems they're

play11:00

the worst to troubleshoot it's like

play11:02

there's intermittent contact in the

play11:03

input polarizers did somebody sticking

play11:05

iron filing in the in the CPU vat is it

play11:07

some sort of contaminant I don't know

play11:10

it's a leathery burnt bacon enigma

play11:12

wrapped in a

play11:14

mystery uh maybe the problem here is

play11:16

self-inflicted I mean that's certainly

play11:18

some of the messaging from Intel users

play11:19

are overclocking their system or

play11:21

motherboard makers are overclocking the

play11:23

motherboard that's got to be the source

play11:25

of the problem right well let's use our

play11:27

brain and think about counter examples

play11:29

of that oh look on that Quest I did

play11:32

actually find something interesting it

play11:34

turns out the 13900 K and the 14900 K do

play11:36

have a place in the data center they're

play11:38

they're not just for gaming CPUs they're

play11:40

also for gaming servers and guess what

play11:42

in the data center most systems are

play11:44

deployed with a motherboard that is

play11:45

based around the

play11:47

w680 chipset a w series chipset not a z-

play11:50

series chipset z- series chipsets for

play11:52

overclocking totally different

play11:53

motherboard totally different Power

play11:55

phase designs maybe Intel is just

play11:57

rebadging a Z series and they've been

play11:59

super lazy about it but I don't think so

play12:01

cuz it's designed for zeeon I mean these

play12:02

are these are alternative motherboards

play12:03

for LGA 1700 that is uh different than

play12:06

their desktop class counterparts and so

play12:10

one might expect that the CPUs would

play12:11

behave differently on these much more

play12:13

conservative motherboards because these

play12:14

motherboards are designed to operate

play12:16

well within the specifications of the

play12:18

CPU so I said about Gathering crash data

play12:21

from thousands of systems already

play12:23

deployed in the data center around the W

play12:24

series chipsets and here's a screenshot

play12:27

of one of the systems that had crashed

play12:29

hard and rebooted notice a few things on

play12:33

this screenshot one the insanely

play12:35

conservative ddr5 memory speed this may

play12:37

be a a result of automatic crashing and

play12:39

the Asus bios backing off a memory speed

play12:41

two Asus w680 chipset like I was saying

play12:46

I I I've seen similar crash screens from

play12:48

the super micro w680 boards which is the

play12:51

other game provider they use super micro

play12:53

one of them uses Asus and the Crash rate

play12:56

is pretty similar between these two I

play12:58

mean the reason these are uses because

play12:59

of the insanely High single thread clock

play13:01

speed and for Game servers it turns out

play13:02

it's actually useful w680 was created to

play13:05

go along with motherboards designed for

play13:06

maximum stability neither Asus nor super

play13:09

micro motherboards really support giving

play13:11

tons of extra power to the CPU or doing

play13:13

insane overclocking for things on a

play13:15

desktop so I really don't think both

play13:17

Asus and super micro have colored really

play13:19

far outside the lines on this

play13:20

motherboard and I really don't think

play13:22

Super Micro or Asus have just lazily

play13:24

copy paste the voltage settings from

play13:26

their desktop motherboards to the server

play13:28

class motherboard boards fully 50% of

play13:30

the systems deployed for both companies

play13:32

with either one of these processors to

play13:34

within one percentage Point are

play13:36

experiencing the same stability issues

play13:37

even disabling ecores has not fully

play13:39

resolved the issue for one of these

play13:41

companies the error rate also seems to

play13:43

be going up over time on the server side

play13:45

as well oh and get this it gets better

play13:48

one of the companies is negotiating for

play13:49

another $100,000 of servers and this

play13:52

time a line item popped up in the

play13:54

proposal from the provider the Intel

play13:56

systems are more than $1,000 more

play13:58

expensive than their equivalent AMD

play14:00

counterparts and they let me insert

play14:02

myself in the negotiation process I was

play14:05

able to get on the phone and talk to the

play14:06

data center provider oh boy they dropped

play14:08

a lot of interesting nuggets the way

play14:11

that the sales work here is you buy a

play14:13

system and it gets deployed in the data

play14:15

center but if there's an issue it's a

play14:17

support contract that the owner of the

play14:19

system never actually really touches it

play14:21

at least until it's time to be retired

play14:23

so I asked okay why is the support cost

play14:24

on these Intel systems so much higher

play14:26

than it was for roughly the same systems

play14:29

that were bought in 2023 I mean there's

play14:31

a couple minor Hardware changes but

play14:33

similar systems and they said and I

play14:36

quote support incidents have been

play14:38

unusually high for that configuration so

play14:41

recently we've had to update the BIOS

play14:43

disable ecores or do CPU swaps to get

play14:45

the issues resolved and we're not sure

play14:47

that the issues are fully resolved so we

play14:50

are charging a support premium for those

play14:52

systems right

play14:54

now huh isn't that interesting $1,000

play14:58

extra that wasn't there six months ago

play15:00

so I asked is that is that normal have

play15:02

you had a lot of s like what's going on

play15:04

and they said we had really good luck

play15:05

with the 12900 K based systems that we

play15:07

had and we always had good luck with the

play15:09

Xeon something just isn't right with the

play15:11

13900 K and 14900 K we already replaced

play15:14

a lot of customer 13 900k systems with

play15:16

14 900k systems and the issues don't

play15:18

seem to be fully resolved we've been

play15:20

steering customers toward 7950 X systems

play15:23

instead they're almost always faster

play15:27

anyway neat we talking one of the game

play15:29

developers about this they said I think

play15:31

I'm going to lose about $100,000 in Lost

play15:34

players from their multiplayer server

play15:38

crashes yeah if you were a game player

play15:41

i' you'd be frustrated too makes sense

play15:44

this game is terrible it just crashes

play15:45

all the time and I get there there's a

play15:47

certain cohort of you watching this

play15:48

video out there that are going to say oh

play15:49

this is what you get for not going with

play15:50

Zeon but this wasn't a problem with 12

play15:52

or 11th or 10th or 9th generation Intel

play15:54

Desktop CPUs for this use case

play15:56

relatively High single core performance

play15:59

and relatively low cost it makes sense

play16:01

for Game servers even if you trade minor

play16:03

stability issues for uh the cost and the

play16:07

performance but the stability tradeoff

play16:10

for 13th to 14th generation at least in

play16:12

these scenarios at least for the last

play16:13

six months to a year for them I guess

play16:15

has been terrible or at least 6 months

play16:17

or so I tried to prod them a little bit

play16:19

on what the data center was experiencing

play16:21

from you know messaging from Intel and

play16:24

support and they really didn't seem like

play16:26

they were getting much support Beyond

play16:28

just here having a tray of extra CPUs

play16:30

swap CPUs and hope for the best it

play16:32

really doesn't add up especially when

play16:33

you consider w680 and it's more

play16:36

conservative power and clock targets

play16:38

that got me to thinking what is Intel

play16:39

telling large system integrators like

play16:41

Dell HP and Lenovo so I reached out to

play16:44

contacts that I have inside of those

play16:45

companies which required a little bit

play16:46

more Intrigue on my part and I'm not

play16:48

really sure that I got good Intel from

play16:50

those companies but the Intel that I did

play16:53

get said that well you can expect

play16:54

between 10 and 25% of CPUs have a

play16:56

problem or are marginal in some way and

play16:58

and we're not really sure what the root

play17:00

issue is do they say clearly if it is a

play17:03

motherboard problem or an Intel CPU

play17:05

problem the messaging seems to be that

play17:06

it's a little column A and A little

play17:08

column B even for OEM systems which also

play17:10

like w680 tend to be a little more

play17:12

conservative in power and clock

play17:15

performance configurations even when

play17:17

you're buying a k not necessarily when

play17:18

you're buying a nonk based on what I saw

play17:20

from game Telemetry and game server

play17:22

crash data I would say that 10 to 25% is

play17:24

much less than I would have guessed I

play17:26

would have guessed that about half of

play17:28

these CPUs have some type of issue with

play17:30

some clearly a lot worse than others is

play17:33

that attributable to power on time or

play17:36

how like how much they've been used or

play17:37

some overclock attempt I don't know I

play17:39

don't know in terms of specifics and

play17:41

crashing the two populations of systems

play17:43

were a little different the one provider

play17:45

uses dual dim configurations and that

play17:48

seemed to suffer a lot the single dim

play17:49

configurations seem to work a little

play17:51

better uh 2 48 gig dims versus uh 4 32

play17:55

gig dims opt for 2 48 gig dims every

play17:57

time the most stable configuration for

play18:00

testing YC cruncher 24 hours at a time

play18:03

on the Linux side was definitely

play18:05

configuring a Max multiplier of 53 and

play18:08

configuring the ddr5 speed to 4200 for

play18:12

the 4 dim configuration 5200 was fine

play18:15

for single dim but 4200 uh you know

play18:18

that's it technically yes the w680 does

play18:20

support XMP but it's not recommended

play18:22

especially in a game server context so

play18:24

in order to find failures we would use a

play18:26

combination of uh decompression tests

play18:28

for the fonics test Suite with

play18:29

Automation and Y cruncher cuz y cruncher

play18:31

is always pretty great and a lot of the

play18:34

time the failures were just random the

play18:35

core was random everything else is

play18:37

random there were a handful of machines

play18:39

that would have specific failures we had

play18:40

one machine that when doing S&T testing

play18:43

it would always fail at the S&T test

play18:45

almost immediately no matter what it's

play18:48

kind of wild but mostly the failures

play18:51

were random and sometimes why cruncher

play18:53

would pass but compress -7 zip and PTs

play18:55

would fail was very interesting oh and

play18:58

in case you're wondering one of the

play18:59

first things that I did in setting up

play19:01

both machines for both providers was to

play19:03

fully update the BIOS to whatever was

play19:05

current as of June 25th 2024 which were

play19:09

quite a lot of BIOS updates and that did

play19:11

help but did not fully resolve the

play19:13

issues in the end ddr5 4200 and

play19:16

disabling ecores were the most drastic

play19:18

things that positively impacted

play19:20

stability but mostly disabling ecores

play19:22

didn't have as much impact as making the

play19:24

memory Run Dog slow one last thing I'll

play19:27

leave you with on the Linux servers

play19:29

because the way that one of the

play19:31

particular games works it logs a number

play19:34

of Game World ticks per second and it

play19:37

turned out that I could use this to work

play19:38

backwards from Crash events sometimes

play19:40

when a system crashes The Tick rate in

play19:43

the world would drop to about 50% of

play19:45

normal and this I couldn't attribute

play19:47

this to EC cor problems I couldn't

play19:49

attribute this to power core uh Power

play19:51

problems see the power is actually

play19:52

logged at the socket by smart power

play19:54

strips in the data center which is very

play19:56

useful I couldn't figure out anything to

play19:58

correlate this to like thermal

play20:00

monitoring because we log that just

play20:02

there wasn't anything just every now and

play20:03

then the CPU gets miserably

play20:05

slow and for up to a minute before

play20:08

actual hard crash and I don't have any

play20:10

explanation for that or Theory as to why

play20:12

that might be for about half the systems

play20:15

that are working fine there's no such

play20:18

slowdown happening that is unexplainable

play20:21

if intel knows what the root cause is it

play20:23

could be that half of the CPUs that have

play20:25

the issue can be mitigated in software

play20:27

maybe that's it so maybe that's you know

play20:29

that leaves

play20:30

25% half of the half that are going to

play20:33

have to be physically replaced I don't

play20:35

know but you know the fact that we're

play20:37

still doing guess work months after the

play20:39

fact is is not ideal I mean the

play20:40

ambiguity here is the problem consider

play20:43

that when motherboard makers were caught

play20:45

juicing AMD CPUs to the point of

play20:48

catastrophic

play20:49

failure AMD was quick to make a

play20:51

statement that anyone affected by this

play20:53

issue would be made

play20:54

whole I really don't know that Intel has

play20:58

done the same in a clear and concise way

play21:00

like you want to take care of your

play21:01

enthusiasts first right I mean that's

play21:04

sort of the first and biggest killer of

play21:06

this situation it's the uncertainty

play21:07

Intel should step up with clear

play21:09

messaging that Gamers and enthusiasts

play21:12

with these CPUs with an affected CPU

play21:14

will be made whole with a replacement

play21:16

CPU if that's what it comes to if

play21:18

they're still experiencing issues after

play21:20

that the updates are in and the clock

play21:22

for getting you know updates in almost

play21:25

out in my opinion customers that are

play21:28

buying system and and CPUs by the

play21:30

thousands um they don't seem to all have

play21:33

the same message I mean they think the

play21:35

problem rate is going to be like maybe

play21:37

10% if you're a smaller provider larger

play21:39

providers is it being told a larger

play21:41

number from what I can tell w680

play21:43

experiencing problems with a similar

play21:44

error rate as desktop at least for the

play21:46

gaming stuff is interesting because the

play21:49

w680 power targets are so much more

play21:50

conservative eventually I think that the

play21:53

Enterprise customers and the corporate

play21:54

customers that are involved they're

play21:56

going to start comparing notes and

play21:58

they're going to start leaking things to

play22:00

randos on the internet that make videos

play22:03

to try to get some relief or to try to

play22:05

get you know more eyes on the problem

play22:07

especially if some percentage of CPUs

play22:09

are going to end up having to be swapped

play22:10

and it's not able to be fixed in

play22:12

software or micro code or anything else

play22:15

I

play22:16

merely I mean for this video really what

play22:18

I want is to point out that we do in

play22:20

fact have the data there's a lot of data

play22:22

from Steam games and there's a lot of

play22:23

data from from you know just games in

play22:25

general game Telemetry crash data the

play22:27

windows event logs

play22:29

uh yeah it's a little prickly and Rough

play22:31

Around the Edges but you know I mean

play22:33

you're not going to be able to just do

play22:34

select star from where CPU you know from

play22:37

game problems where CPU is equal to

play22:38

3900k and then just look at the data uh

play22:41

some some of these crashes are because

play22:43

people are overclocking their system but

play22:45

the data center error rate is alarmingly

play22:47

high from these CPUs and there's got to

play22:50

be something to that I will not in the

play22:52

least be surprised if there is a deeper

play22:54

issue with the power voltage you know

play22:56

timing problem and maybe the CPU have

play22:58

degraded over time some of them to the

play23:00

point that they can't be salvaged like

play23:02

that would not surprise me in the least

play23:04

at this point in time whereas you know a

play23:06

month ago I might have said ah it's

play23:08

probably just motherboards and micro

play23:09

code I it's not I don't I don't believe

play23:12

that that's the issue anymore and what I

play23:14

found was interesting and I think based

play23:15

on this video that uh smarter people are

play23:18

going to do their homework and go

play23:19

digging and try to find stuff and

play23:21

they'll probably find some really

play23:22

interesting stuff how what this is level

play23:24

one this has just been some some

play23:26

rambling on what I found doing analysis

play23:29

of game failure databases and listening

play23:31

to some customers of Intel complain

play23:33

about things a lot of things oh and in

play23:36

case you're wondering that the game

play23:37

provider they're going with AMD 7950 X

play23:39

systems for their new game

play23:42

servers they're being provisioned right

play23:44

now so that they have a little bit of

play23:45

breathing room to to move over their

play23:47

multiplayer servers so that they can

play23:49

troubleshoot and update bioses and do

play23:51

whatever else on the uh the Intel

play23:53

systems I mean they've already been

play23:54

doing the BIOS update square dance uh

play23:57

for a couple of months now just get

play23:58

tired of it I'm what this level one I'm

play24:00

signing out you can find me in the level

play24:01

one forums

play24:05

[Music]

play24:14

[Music]

Rate This

5.0 / 5 (0 votes)

Étiquettes Connexes
Intel CPUsGaming IssuesServer StabilityHardware BugsCrash AnalysisMicrocode UpdatePerformance ImpactOverclocking ConcernsData Center IssuesAMD Alternative
Besoin d'un résumé en anglais ?