NVidia is launching a NEW type of Accelerator... and it could end AMD and Intel

Coreteks
2 Jun 202420:39

Summary

TLDRNvidia's recent developments in accelerator technology, as discussed in a video script, highlight their strategic move towards more efficient AI processing. The script delves into Nvidia's patent dump, revealing an upcoming accelerator designed to enhance inference performance through techniques like vector scale quantization, pruning, and clipping. This could potentially disrupt the market, offering significant speed improvements over current GPUs like Blackwell, and may be integrated into future products in various forms, from discrete cards to integrated systems in laptops and workstations.

Takeaways

  • ๐Ÿ“… Nvidia's Computex 2024 keynote provided updates on upcoming products, including the successor to Blackwell and Vera, a new CPU.
  • ๐Ÿ› ๏ธ Nvidia discussed an accelerator that was previously mentioned in 2022, which is crucial for their strategic direction and is expected to be disruptive.
  • ๐Ÿ”ข The script delves into the technical aspects of number representation and its evolution in Nvidia GPUs, from FP32 to FP16, and the introduction of complex instructions like Tensor Cores and HMMA.
  • ๐Ÿš€ Nvidia's approach to balancing operation cost and data movement has been key to their success, allowing them to maintain a performance lead in AI workloads.
  • ๐Ÿ’ก The accelerator mentioned is designed to improve inference performance, using techniques like vector scale quantization, pruning, and clipping to achieve high efficiency.
  • ๐Ÿ“ˆ Nvidia's CUDA platform plays a significant role in enabling software to take advantage of hardware capabilities, including the new accelerator's features.
  • ๐Ÿ”‘ The accelerator is expected to be much faster at inference tasks compared to current GPUs, potentially offering up to six times the performance.
  • ๐Ÿ’ป The script suggests that this new accelerator could be implemented in various ways, including as a discrete PCIe card, an integrated part of an SoC, or part of a larger system like a superchip.
  • ๐Ÿ” Nvidia has patented an API that allows for seamless integration of the accelerator with existing systems, handling both GPU and accelerator tasks from a single call.
  • ๐Ÿข The implications of this technology extend beyond consumer devices to enterprise applications, potentially influencing the future of AI inference in both edge servers and client devices.
  • ๐Ÿ”ฎ The video script hints at a follow-up video that will explore the broader applications and impact of this accelerator on the market and existing players like Intel and AMD.

Q & A

  • What was the main topic of Nvidia's Computex 2024 keynote?

    -The main topic of Nvidia's Computex 2024 keynote was the introduction of Ruben, the successor to Blackwell, and Vera, a new CPU to succeed Grace. The keynote also discussed the strategy and future of Nvidia's accelerators.

  • What is the significance of the number representation changes in Nvidia's GPUs over the years?

    -The number representation changes in Nvidia's GPUs, such as the shift from 32-bit to 16-bit and the introduction of 8-bit and 4-bit integer data types, have been significant for improving performance in AI workloads. These changes allow for reduced precision, which in turn reduces the amount of data needed, leading to lower bandwidth usage and energy efficiency.

  • What is the purpose of the 'Tensal' instruction in Nvidia's GPUs?

    -The 'Tensal' instruction, which stands for Tensor Matrix Multiply and Accumulate, is a complex instruction that performs multiple operations at once. It reduces the need for frequent data fetching from memory, thus improving energy efficiency and performance in AI workloads.

  • How does the introduction of the 'Imma' instruction benefit Nvidia's GPUs?

    -The 'Imma' instruction, or Integer Matrix Multiply and Accumulate, allows for the use of 8-bit and 4-bit integer data types in matrix operations. This further reduces the precision and energy cost of operations, making Nvidia's GPUs more efficient for AI inference tasks.

  • What is the role of the new accelerator discussed in the script?

    -The new accelerator discussed in the script is designed to perform inference tasks more efficiently than traditional GPUs. It uses techniques like vector scale quantization, pruning, and clipping to achieve high performance with reduced precision, making it suitable for AI services and edge devices.

  • How does the accelerator improve inference speed and efficiency?

    -The accelerator improves inference speed and efficiency by performing operations in a single cycle that would take multiple cycles on a traditional GPU. It also optimizes memory usage for specific data structures and operations, leading to high bandwidth and low energy consumption.

  • What are the potential implementations for the new Nvidia accelerator?

    -The potential implementations for the new accelerator include a discrete PCIe card for inference acceleration, an integrated accelerator in a system-on-chip (SoC), and as part of a board or platform similar to the Grace Blackwell superchip but potentially scaled down for use in laptops or other devices.

  • How does the accelerator handle API calls in a heterogeneous system?

    -The accelerator handles API calls in a heterogeneous system by automatically splitting the call along the pipeline into the GPU and the accelerator. This allows both components to share the same memory pool and ensures that programmers don't have to code specifically for the accelerator.

  • What is the potential impact of the new accelerator on the client PC market?

    -The new accelerator could significantly impact the client PC market by enabling more efficient and faster AI inference on edge devices. This could lead to a shift in control of the market, as companies that can effectively implement inference acceleration may dominate both the edge server market and client devices.

  • What are some of the applications where the new accelerator could be used?

    -The new accelerator could be used in a wide range of applications, including AI services like chatbots and virtual assistants, gaming for AI-driven features, and in professional fields such as data analysis and scientific research, where fast and efficient AI inference is crucial.

Outlines

00:00

๐Ÿš€ Nvidia's Upcoming Accelerator and CPU Reveals

The video script discusses Nvidia's recent keynote at Computex 2024, which included details about the successor to Blackwell, named Ruben, and a new CPU called Vera. The script also mentions an accelerator that was not revealed during the keynote but is believed to be a significant part of Nvidia's strategy. The video is sponsored by UCDkeys.com, offering discounts on Windows 11 and Office 2019 keys. The script delves into technical aspects of Nvidia's GPUs, highlighting the evolution of number representation and the introduction of complex instructions like HMMA and IMMA, which have improved energy efficiency and performance in AI workloads. The accelerator discussed is expected to be a disruptive technology, aligning with the themes of Jensen's presentation at Computex.

05:01

๐Ÿ› ๏ธ Deep Dive into Nvidia's AI Performance Enhancements

This paragraph provides a detailed analysis of Nvidia's advancements in AI performance, focusing on the evolution of GPU architecture and the introduction of complex instructions that reduce data movement and increase energy efficiency. The discussion covers the shift from 32-bit to 16-bit precision, the introduction of HMMA in the Pascal microarchitecture, and the further reduction to 8-bit and 4-bit integer data types with the Hopper architecture. The summary explains how these changes have led to significant performance improvements, with number representation being a key factor in achieving greater performance in AI workloads. The paragraph also introduces the concept of dedicated accelerators for inference, which could potentially outperform current GPUs like Blackwell in terms of speed and energy efficiency.

10:04

๐Ÿ” Nvidia's New Accelerator: A Game Changer for Inference

The script outlines Nvidia's development of a new accelerator designed to optimize inference operations for AI models, particularly large language models. The accelerator incorporates techniques such as vector scale quantization, pruning, and clipping to achieve high performance with reduced precision, allowing for faster and more efficient inference on resource-constrained devices. The potential impact of this technology is discussed, suggesting that it could redefine the client PC market and pose a significant challenge to competitors like Intel and AMD. The accelerator's performance is compared to existing GPUs, highlighting its potential to be multiple times faster and more energy-efficient for inference tasks.

15:05

๐Ÿ”ง Exploring the Potential Implementations of Nvidia's Accelerator

The script speculates on the possible ways Nvidia could bring the new accelerator to market, including as a discrete PCIe card, an integrated accelerator in a system-on-chip (SoC), or as part of a larger board or platform similar to the Grace Blackwell superchip. It discusses patents that detail how the accelerator could work within a heterogeneous system, sharing memory pools and being automatically integrated with API calls from CUDA or other programming environments. The potential applications of the accelerator in various devices, from laptops to data centers, are also considered, highlighting its versatility and the broad impact it could have on the industry.

20:06

๐ŸŽฎ The Broad Impact of Nvidia's Accelerator on Future Applications

The final paragraph of the script hints at the wide-ranging applications of Nvidia's new accelerator, suggesting that it could be used in various fields, including gaming. The script invites viewers to subscribe for the continuation of the discussion and mentions upcoming coverage of Computex. It also encourages support through Patreon for the in-depth analysis provided, including the examination of patents and technical presentations related to Nvidia's technology.

Mindmap

Keywords

๐Ÿ’กNvidia

Nvidia is a multinational technology company known for its graphics processing units (GPUs). In the video, Nvidia's developments and innovations in AI and GPU technology are a central theme, showcasing their advancements and the impact on computing performance.

๐Ÿ’กAccelerator

An accelerator is a specialized hardware designed to enhance the performance of specific tasks, particularly in AI and machine learning. The video discusses Nvidia's new accelerator, which aims to improve efficiency and speed in AI workloads by reducing data movement and increasing operational complexity.

๐Ÿ’กBlackwell

Blackwell is a code name for Nvidia's upcoming GPU architecture. The video mentions Blackwell as the successor to previous architectures, highlighting its significance in Nvidia's ongoing advancements in AI and GPU technology.

๐Ÿ’กPrecision

Precision refers to the accuracy of numerical representations in computing. The video explains how Nvidia has improved AI performance by reducing precision, from 32-bit to 16-bit, and even lower, which helps in optimizing data processing and energy efficiency in AI models.

๐Ÿ’กTensor

Tensor is a term used by Nvidia to describe complex instructions that perform multiple operations simultaneously. The video highlights the importance of Tensor operations in AI workloads, which reduce the need for frequent data fetching from memory, thus improving energy efficiency.

๐Ÿ’กInference

Inference in AI refers to the process of making predictions or decisions based on a trained model. The video discusses Nvidia's focus on developing accelerators that enhance inference performance, which is crucial for applications like AI assistants and real-time data processing.

๐Ÿ’กQuantization

Quantization is a technique used to reduce the precision of data in neural networks, improving computational efficiency without significantly affecting accuracy. The video explains how Nvidia's accelerators utilize quantization to enhance performance in AI inference tasks.

๐Ÿ’กCUDA

CUDA is Nvidia's parallel computing platform and programming model. It allows developers to leverage the power of Nvidia GPUs for general-purpose computing. The video mentions CUDA in the context of how Nvidia's software ecosystem supports its hardware innovations.

๐Ÿ’กAI Workloads

AI workloads refer to the computational tasks involved in training and running AI models. The video explores how Nvidia's hardware advancements, particularly in GPUs and accelerators, are designed to handle these workloads more efficiently, driving significant performance improvements.

๐Ÿ’กEnergy Efficiency

Energy efficiency in computing refers to the effective use of power to perform computational tasks. The video discusses how Nvidia's innovations, such as reducing data movement and increasing instruction complexity, have led to significant improvements in the energy efficiency of their AI accelerators and GPUs.

Highlights

Nvidia's 2024 Computex keynote revealed information on Ruben, the successor to Blackwell, and Vera, a new CPU to succeed Grace.

The video is sponsored by UCDKeys.com, offering OEM keys for Windows 11 and Office 2019 at discounted prices.

Nvidia's patent dump includes an accelerator discussed in 2022, which is key to their strategy going forward.

The accelerator is designed to be one of the most disruptive technologies ever launched by Nvidia.

Nvidia's GPUs have seen a gradual change in number representation, starting from native support for FP32 to FP16, and introducing complex instructions like Tensor Cores.

The introduction of complex instructions like HMMA has significantly improved energy efficiency in Nvidia's GPUs.

Nvidia's strategy involves reducing the precision of operations in AI workloads to increase performance without sacrificing accuracy.

The new accelerator relies on techniques like vector scale quantization and clipping to achieve greater performance in inference.

The accelerator can perform operations in one cycle that would take tens or hundreds of cycles on Nvidia's current GPUs.

Nvidia's accelerator is designed to be a drop-in solution for various applications, potentially dominating the edge server and client device markets.

The accelerator could be implemented as a discrete PCIe card, an integrated accelerator in an SoC, or part of a board or platform.

Nvidia patented an API that automatically handles calls between the GPU and the new inference accelerator, sharing the same memory pool.

The implementation of the accelerator is kept open to interpretation, covering various bases without revealing too much detail.

The accelerator could redefine control of the client PC market in the next decade, posing a significant challenge to Intel and AMD.

Nvidia's approach to inference acceleration could make it easier to implement and potentially dominate both the edge server and client device markets.

The upcoming video will cover the broad scope of applications for the accelerator, including its potential impact on gaming.

The video also discusses the rest of Computex and Nvidia's position in the training hardware market.

Transcripts

play00:00

Nvidia had the coputex 2024 keynote a

play00:03

few hours ago so I delayed this video

play00:05

that I had planned to release earlier to

play00:07

add any information Nvidia revealed that

play00:09

was relevant while Nvidia did not reveal

play00:12

the accelerator that I'm going to

play00:13

discuss today they did reveal some

play00:15

information on Ruben the successor to

play00:18

Blackwell and Vera a new CPU to succeed

play00:21

Grace and that is where this accelerator

play00:24

will fit in there's a lot to discuss so

play00:26

I'll have to split this video into two

play00:28

parts so let's Dive In this video is

play00:31

sponsored by Ur cdkeys.com if you pay

play00:34

full price for a Windows 11 key you are

play00:36

wasting your money instead you can get a

play00:39

Windows 11 OEM key from Ur cdkeys.com

play00:42

for just over $29 or even lower at

play00:46

$21.86 if you use my coupon code c25 at

play00:50

checkout all Ur CD Keys product keys are

play00:54

on sale right now so don't miss the

play00:56

super spring sale follow the link below

play00:58

to the windows 11 page that urc.com

play01:01

click purchase enter your c25 discount

play01:04

code you can pay with a credit card or

play01:06

PayPal and then just add the Windows key

play01:08

to your Windows activation settings and

play01:11

you're done the code is sent to you

play01:13

within minutes by the way you can also

play01:14

use my c25 code on other products so if

play01:18

you need office 2019 make sure to get it

play01:20

from U cdkeys.com for a much lower price

play01:23

than retail check the exclusive links in

play01:26

the video description to get your cheap

play01:27

OEM windows or office keys today from U

play01:31

cdkeys.com

play01:33

Nvidia had a big patent dump this week

play01:37

and as I looked through the drawings and

play01:39

descriptions I remembered an accelerator

play01:41

that Nvidia talked about back in 2022 at

play01:44

the vsi circuits conference before I

play01:46

explain what this accelerator does we

play01:48

have to get a little bit technical for a

play01:50

couple of minutes just so you understand

play01:52

the context for such a processor to make

play01:54

sense and why I believe Nvidia is

play01:56

launching it next year this accelerator

play01:58

is key to strategy going forward in fact

play02:02

underlying Jensen's presentation at

play02:04

comput X were all the elements that will

play02:06

make this accelerator one of the most

play02:08

disruptive technologies that Nvidia has

play02:10

ever launched if we look back at the

play02:12

gpus Nvidia has launched in the last 12

play02:15

years or so from kler to Blackwell which

play02:17

will launch later this year one of the

play02:19

Paradigm changes that were gradual but

play02:22

constant was the change in number

play02:24

representation native support So in

play02:26

kapler operations were Scala at fp32 2

play02:30

so that's 32-bit and this was around

play02:32

2012 which was when Nvidia started

play02:34

porting models to their gpus to see if

play02:37

the parallelism would benefit Ai

play02:39

workloads and obviously not only it did

play02:42

but it has made advv one of the biggest

play02:44

companies in the world in just over a

play02:46

decade now kler was already out and the

play02:49

next gen was already taped out so it

play02:52

took a while until we started to see

play02:53

number representation changes and fp16

play02:56

was only natively supported in 2016 when

play03:00

the Pascal microarchitecture launched in

play03:02

the form of the p100 GPU so Nvidia

play03:05

reduced Precision from 32bit to 16 bit

play03:08

why does that matter well in AI

play03:10

workloads one of the lwh hanging fruits

play03:12

for achieving greater performance was

play03:14

reducing Precision I won't get into how

play03:17

that works but on an algorithm and API

play03:19

level you can reduce the Precision of

play03:21

the input and achieve the same results

play03:23

with less data which means you don't get

play03:25

choked on bandwidth but more importantly

play03:28

you don't waste as much energy moving

play03:30

data around reducing Precision works

play03:32

wonderfully in AI workloads where you

play03:35

can prune the network of unnecessary

play03:37

data and focus only on the things that

play03:40

are likely to have the results you need

play03:42

so after Pascal Nvidia launched in 2017

play03:45

the v00 and introduced its now famous

play03:48

tensal a tensal is just a fancy

play03:50

marketing name for an instruction a

play03:53

complex instruction that is called half

play03:55

Matrix multiply and accumulate or hmma

play03:58

before you had adds and multiplies and

play04:01

now you have a complex instruction that

play04:03

does a bunch of stuff all at once what

play04:06

that means is that you don't have to

play04:07

keep fetching stuff from memory in other

play04:09

words if the instruction is simple but

play04:11

you have to keep fetching data from

play04:12

memory you end up using a ton of energy

play04:15

in data movement whereas the operation

play04:17

itself uses very little energy I've

play04:19

explained this in past videos so it

play04:21

makes sense From nvidia's perspective to

play04:23

have the instruction do more operations

play04:25

in one go and fetch the data less times

play04:28

and that's what a Tens SC does it's an

play04:30

instruction that does a lot of things at

play04:32

once instead of just an ad or a multiply

play04:35

to give you an idea back in the days of

play04:37

fused multiply ad operations in the

play04:40

generations prior to Pascal to do a

play04:42

thatch and decode so that state of

play04:44

movement that costs 30 pjws it was 40

play04:47

times more costly to do that data

play04:49

movement than to actually do a multiply

play04:51

ad operation once you had a data which

play04:54

cost a mere 1.5 PS so that's a 2,000%

play04:58

overhead because of moving that data

play05:01

around it's only 1.5 PJs to do the

play05:03

operation in logic but it costs 30 PJs

play05:06

to get the data from memory to do that

play05:08

operation but with the introduction of

play05:11

this complex instruction hmma or what

play05:14

nvidia's marketing callus the tensal the

play05:16

operation cost went up significantly

play05:18

because the logic is doing a lot more

play05:20

operations in one go but the cost of

play05:22

moving the data was reduced massively so

play05:25

now the same 30 PJ F and decode cost 110

play05:29

p jeel but since the data didn't have to

play05:31

go back and forth for the rest of the

play05:32

operations to conclude the overhead was

play05:35

only 22% instead of 2,000% it's a

play05:39

massive Improvement in Energy Efficiency

play05:42

so you have this balance of how complex

play05:44

you make the instruction which will

play05:45

result in the logic being active longer

play05:47

and using more energy per operation and

play05:50

how much data do you move around Nvidia

play05:52

has so far figured that the cost of

play05:54

moving data around in a 2.5d package at

play05:57

least is much higher than the actual

play05:59

operation ations and has therefore added

play06:01

more and more complex instructions to

play06:03

their gpus over the years always giving

play06:05

it a fancy marketing name from VTA to

play06:08

Turing to anir and then to Ada nothing

play06:11

much changed on the instruction level

play06:13

what changed were the matrices sizes the

play06:16

next real big step was when another

play06:18

complex operation was introduced in the

play06:20

form of imma which stands for integer

play06:23

Matrix multiply and accumulate when

play06:26

Harper launched so with harer Nvidia

play06:29

introduced 8bit and 4bit integer data

play06:31

types as inputs for those matrices

play06:34

nvidia's marketing calls this the

play06:36

Transformer engine but like I said that

play06:38

name is just marketing for another

play06:40

complex instruction now notice what

play06:42

happens to the energy cost of this new

play06:44

complex instruction in Hopper the same

play06:46

30 P je Fetch and decode now only

play06:49

represents a 16% overhead even though

play06:51

the operation itself is more costly to

play06:54

execute at 160 PJs and this is the very

play06:57

reason why dedicated acceler Racers from

play07:00

other companies haven't been able to

play07:02

catch up with Nvidia because of this

play07:04

Balancing Act between the operation cost

play07:06

and data movement Nvidia has gotten this

play07:09

right this approach also means that

play07:11

Nvidia can circumvent bandwidth

play07:13

limitations to some extent like I said

play07:16

earlier because you are doing less calls

play07:18

to memory so this reduction in Precision

play07:20

is achieved by natively supporting these

play07:22

different types of number representation

play07:25

now down to int 8 and int four if we

play07:27

take a bir eye view of where per

play07:29

performance jumps have come from since

play07:31

kapler up until hoer in workloads where

play07:34

Precision can be reduced number

play07:35

representation accounted for the

play07:37

greatest leap at

play07:39

32x when you take into account

play07:41

parasitics then the introduction of

play07:43

complex instructions resulted in a

play07:45

performance jump of around 12.5 x the

play07:48

process node is hard to say but at least

play07:50

3x and sparity another 2 a so that's why

play07:54

you hear Jensen on stage saying they

play07:55

achieve the 1,000x performance uplift in

play07:59

AI in 8 years the cumulative effect of

play08:02

all these changes are what led to that

play08:05

and with Blackwell well it's two Hoppers

play08:07

fused together so you'll get another 2x

play08:09

so you see that number representation

play08:11

has been the loow hanging fruit of

play08:13

achieving greater performance in this

play08:15

last decade of course for all of this to

play08:17

work you need the software to take

play08:19

advantage of it and that's where

play08:21

nvidia's Cuda mode plays its part but

play08:24

now that we're down to four Bits And if

play08:26

Blackwell is just two Hoppers fused

play08:28

together pretty much where is NVIDIA

play08:30

headed next at cutx a few hours before

play08:33

this video goes live Jensen hammered on

play08:36

the fact that acceleration will be key

play08:38

both in future BCS but also data centers

play08:41

but key to what key to

play08:45

inference well back to that accelerator

play08:48

from 2022 as far as number

play08:50

representation The Next Step will be

play08:52

support for log 8 which is log 4.3 along

play08:55

with a plus or minus symbol I won't go

play08:58

into that here but if if you are on the

play09:00

Discord server I'll be happy to explain

play09:02

there how this format works and why it's

play09:04

more performant than in4 in Blackwell

play09:07

where it comes out later this year

play09:08

anyway more importantly in addition to

play09:10

the work being done for training Nvidia

play09:12

came up with this dedicated accelerator

play09:15

back in 2022 that relies on two

play09:18

techniques to achieve greater

play09:19

performance in inference these

play09:21

techniques are scaling and optimal

play09:23

clipping to do large language models for

play09:25

instance without any loss of accuracy as

play09:28

you can see in this graph from Nvidia

play09:29

with quantization you can map High

play09:32

Precision formats like flow 32 to lower

play09:35

Precision formats like int 8 the reason

play09:37

you would do this is to achieve greater

play09:39

inference speed on resource constrained

play09:42

devices you know devices like laptops

play09:46

the particular technique that Nvidia

play09:47

shown in recent papers is vsq or vector

play09:51

scale quantization so this is the first

play09:53

technique the second one is clipping

play09:55

like I said which is another form of

play09:57

reducing Precision without losing

play09:59

accuracy so anyway what does this

play10:01

technical jargon mean in Practical terms

play10:04

the accelerator that Nvidia created does

play10:06

precisely these optimizations on the

play10:08

hardware level an accelerator takes

play10:10

special data types and th Special

play10:13

Operations that are not natively

play10:15

supported on nvidia's gpus which means

play10:17

you can do in one cycle what would take

play10:19

tens or hundreds of Cycles to do in

play10:22

Blackwell and accelerator can also do

play10:24

massive parallelism but with locality so

play10:27

it's massively performant compared to to

play10:29

a GPU with a complex memory system you

play10:31

also get optimized memory with high

play10:33

bandwidth and low energy for those

play10:35

specific data structures and operations

play10:38

and the key ingredient is that you can

play10:40

do software and Hardware code design so

play10:43

the algorithm can be specifically

play10:45

tailored for the hardware as you can

play10:47

imagine this is Jensen's wet dream yet

play10:50

another mode to force people into using

play10:52

nothing but Nvidia this time for

play10:54

inference nvidia's gpus are great for AI

play10:57

currently because they are programed

play10:59

able they are flexible they can adapt to

play11:01

new models and algorithms but for micro

play11:04

operations that are to some degree

play11:06

agnostic to the developments in models

play11:09

Nvidia can build accelerators to improve

play11:11

performance particularly in inference

play11:14

which is where the majority of

play11:15

computation will move to in the next 5

play11:17

years while training will decelerate so

play11:20

this is the layout of the test

play11:22

accelerator that Nvidia taped out H us

play11:24

per area of one of their Network

play11:26

switches on 5 NM and served as a

play11:29

prototype for what Nvidia will be

play11:30

bringing to Market and which I believe

play11:32

could truly put Intel and AMD in a tough

play11:35

spot so this accelerator does Vector

play11:38

scaled quantization pruning and clipping

play11:41

in order to support in4 operations so it

play11:44

can do large language models without

play11:46

losing any accuracy at five times the

play11:48

speed of Blackwell so almost 100 ter Ops

play11:52

per watt versus 20 ter Ops per watt on

play11:54

the b00 GPU that's coming out this year

play11:57

now note that this was on five n so when

play12:00

Reuben comes out so that will be the

play12:02

r100 either late next year or early 2026

play12:05

presumably this accelerator will be on 3

play12:08

nmet so possibly six times as fast as

play12:12

Blackwell in inference workloads it's

play12:14

also possible that Nvidia will include

play12:16

it in the b200 next year instead so when

play12:19

you use an AI service like a tax prompt

play12:21

to do search you enter your prompt that

play12:24

text is then tokenized so the whole text

play12:26

or some of the words or part of the word

play12:29

is split into smaller units called

play12:31

tokens which are a unit of data that the

play12:34

AI model can understand these tokens are

play12:36

converted into a numerical format so

play12:39

fp32 int 8 in 16 int 4 Etc so these are

play12:43

the number representations that gpus

play12:46

work on to perform computation in the

play12:48

context of inference so when you are

play12:50

doing queries locally on your computer

play12:52

or phone you don't have the compute

play12:54

resources to do complex operations so

play12:56

they have to be simplified and acceler

play12:59

ated by fixed function units like this

play13:01

accelerator that NV has prototyped now

play13:04

you might be saying but don't queries

play13:05

just get sent to the cloud and process

play13:08

there why do I need them done locally

play13:10

for one there's sensitive security

play13:12

issues with personal data or company

play13:14

data that you don't want to be sent off

play13:16

premises secondly you can think of

play13:18

service providers as AI clients

play13:20

themselves and these accelerators will

play13:22

also be deployed there sort of an

play13:24

intermediary level between yourself and

play13:27

the data center in order to speed things

play13:29

things up massively and thirdly it's

play13:30

going to be much faster to do most of

play13:32

the computation locally and get an

play13:34

almost immediate response than sending

play13:36

all the tokens to a data center having

play13:38

the processing done there and then

play13:40

fetching the results you want AI agents

play13:43

you know chat GPT and virtual assistance

play13:46

and whatnot to respond immediately to

play13:48

your requests so you need the operations

play13:50

to be accelerated locally without

play13:52

resources to do full Precision you need

play13:54

this accelerator to use the techniques I

play13:56

mentioned to reduce the number of

play13:58

representation down to something like

play14:00

int 4 so that even a phone or a laptop

play14:03

can perform operations that would

play14:04

normally require a bunch of powerful

play14:06

gpus like I said one of these

play14:08

accelerators locally on your PC is

play14:11

around six times faster at inference

play14:14

than a Blackwell GPU would be at doing

play14:16

the same operation it's funny that in a

play14:18

way we're coming full circle if you were

play14:20

around in a days of CPUs doing the 3D

play14:23

rendering in games and then saw the

play14:25

massive jump in performance that GPU

play14:27

accelerators brought to that render you

play14:29

can think of gpus now having the role of

play14:31

CPUs back then and this new accelerator

play14:34

being as disruptive as gpus were back

play14:37

then except instead of accelerating 3D

play14:39

rendering in games the workload is no

play14:42

inference and you can bet that whoever

play14:44

gets inference right and makes it easy

play14:47

to implement will dominate both the edge

play14:49

server Market but also client devices

play14:52

remember how over the past year I've

play14:54

been saying that Nvidia will come to the

play14:55

PC market as a disruptor and leave the

play14:58

incumbent in a tough spot well that's

play15:00

how important this accelerator is it

play15:03

could Define who will control the client

play15:05

PC market in the next 10 years Intel and

play15:08

AMD are not prepared for this even if

play15:10

they have the hardware to compete Nvidia

play15:12

has the verticals so that's the software

play15:15

that will make this accelerator a dropin

play15:17

solution for a bunch of applications to

play15:20

wrap part one up how exactly is NVIDIA

play15:22

bringing this accelerated Market there

play15:24

are three options as a discrete pcie

play15:27

card similar to a discrete GPU that you

play15:30

slot into your system to accelerate

play15:32

Graphics but in this case it would

play15:34

accelerate inference secondly as an

play15:36

integrated accelerator in an S so so

play15:39

this would be a hetrogeneous chips

play15:40

similar to an APU or similar to the

play15:43

Apple amp chips which feature a bunch of

play15:45

accelerators also and finally as part of

play15:47

a board or platform similar to The Grace

play15:50

Blackwell super chip but scaled down

play15:53

possibly down to a laptop platform

play15:55

looking at the recently published

play15:56

patterns on this accelerator the first

play15:58

one titled application programming

play16:00

interface to transfer information

play16:02

between accelerator memory we see the

play16:05

first embodiment of this accelerator

play16:07

which is in a hetrogeneous processor in

play16:09

this 110 Page Long patent basically

play16:13

Nvidia is describing how different

play16:15

accelerators can access the same memory

play16:17

pool in a hetrogeneous system in essence

play16:20

an API call from either Cuda or ROM or

play16:23

one API is automatically split along the

play16:26

pipeline into the GPU and the

play16:28

accelerator so that programmers don't

play16:30

have to code specifically for this

play16:32

accelerator so in other words Nvidia

play16:34

patented a way for an API call to be

play16:36

automatically handled by the GPU with

play16:39

the accelerator getting specific code

play16:41

generated for it with both sharing the

play16:43

same memory Nvidia goes on to detail

play16:45

several possible variations of this and

play16:47

states that this can be both implemented

play16:50

in a single device so that could be for

play16:52

example a laptop or in a distributed

play16:55

computer system the second relevant

play16:57

patent published on the same date is

play16:59

complimentary and handles any API call

play17:02

errors so it essentially handles where

play17:04

in memory any errors will be stored in

play17:06

such a system the third and final

play17:08

Associated patent describes how the

play17:10

processing occurs in such a system like

play17:12

I was saying the API call is put into a

play17:15

stream that has all the instructions and

play17:17

then some parts of the stream are

play17:19

allocated to the GPU while other parts

play17:21

of the stream to the accelerator note

play17:24

that the pattern states that the

play17:25

instructions in this stream that are

play17:27

meant to go into the accelerator can be

play17:29

distributed to multiple accelerators so

play17:32

a possible implementation is having more

play17:34

than one of these accelerators to speed

play17:36

inference even more we could see this

play17:38

being useful in asch servers for

play17:40

instance but possibly also in client

play17:42

devices now obviously I'm simplifying

play17:45

all of the possible Hardware

play17:46

implementations here out of the almost

play17:48

400 pages of patents that I went through

play17:51

but you can see that Nvidia has kept the

play17:53

accelerator implementation wide enough

play17:55

or rather open to interpretation so as

play17:58

not to reveal too much but still ensure

play18:01

they have all bases covered personally I

play18:03

think it's unlikely that Nvidia would

play18:04

release a pcie type card for inference

play18:07

even though that would be awesome while

play18:09

not outside the realm of possibility I

play18:11

think it's more likely that future

play18:13

generation of gaming gpus will include

play18:15

this accelerator in some form at least a

play18:17

highend gpus so just like today you can

play18:20

use Cuda to accelerate say video

play18:22

rendering in Adobe Premiere with your

play18:24

RTX 4070 or 4090 you could be able to

play18:27

use this inference accelerator in the

play18:29

future to speed up some of the verticals

play18:31

that we will look at in the second part

play18:33

of this video I do hope Nvidia releases

play18:35

a PCI type card for the consumer Market

play18:38

how awesome would that be another piece

play18:40

of Hardware we can lust over then spend

play18:42

an insulting amount of money on and then

play18:44

never use it PC Master race style so the

play18:47

other two implementations I think are

play18:49

the most likely approaches so that's

play18:51

having a board wide system similar to

play18:53

the Blackwell Superchip where Nvidia

play18:56

could have an arm CPU either developed

play18:58

intern or licensed a GPU chip and then

play19:01

the inference accelerator all sharing

play19:03

memory and this could be used for

play19:05

laptops or workstations for instance and

play19:07

finally the most traditional and I guess

play19:09

the most likely application would be an

play19:11

S so in the vein of a large Apu with arm

play19:14

cores Nvidia GPU Cordes and this

play19:17

inference accelerator to do the heavy

play19:19

lifting of AI inference locally

play19:21

depending on how large and power hungry

play19:23

this is there could be a variant that

play19:25

would go into smaller devices and not

play19:27

just laptops perhaps even even a phone

play19:29

or a new Shield tablet or a virtual

play19:31

reality headset similar to the one apple

play19:34

made and that everyone has already

play19:35

forgotten about or a Mini PC especially

play19:38

with Dell hinting this week they will be

play19:40

selling AI workstations and AI thin

play19:43

clients or BCS next year with Nvidia

play19:46

Hardware included so in part two we will

play19:48

look at why Nvidia is wasting time with

play19:50

this when they are already banking

play19:52

massively on training Hardware selling

play19:55

millions of gpus at massive margins and

play19:57

also what applic we will see this used

play20:00

in I don't think people realize how

play20:02

broad a scope such an accelerator will

play20:04

have and the sorts of applications it

play20:06

can make viable and that includes games

play20:09

so subscribe to the channel right now so

play20:11

you don't miss part two and this coming

play20:13

week I'll also be covering the rest of

play20:15

computex so stay tuned for that consider

play20:18

joining my patreon to support all this

play20:20

work of reading 400 Pages worth of

play20:22

patents and going through Nvidia obscure

play20:24

presentations in circuits conferences

play20:27

and for just $2 a month you will also

play20:29

get access to the cortex Discord server

play20:31

thanks for watching and until the next

play20:33

one

Rate This
โ˜…
โ˜…
โ˜…
โ˜…
โ˜…

5.0 / 5 (0 votes)

Related Tags
Nvidia AIAccelerator TechInference EfficiencyCPU AdvancementsQuantizationClipping TechniqueAI WorkloadsHardware InnovationSoftware AdaptationComputex 2024Tech Analysis