Networking for GenAI Training and Inference Clusters | Jongsoo Park & Petr Lapukhov

@Scale
7 Sept 202323:00

Summary

TLDR在这段视频脚本中,Jonsu 和 Pierre 讨论了大型语言模型(LLMs)在系统设计,尤其是网络子系统设计中的影响。他们指出,与以往的识别模型相比,LLMs 在训练和推理时需要更多的计算能力。为了在合理的时间内完成训练,需要使用数万个加速器,这给网络子系统带来了挑战。他们提到了数据并行和模型并行等不同的并行化技术,以及这些技术如何产生多样化的通信模式,这对网络设计提出了新的要求。此外,他们还讨论了推理过程中的低延迟需求,以及如何通过分布式推理来满足这些需求。最后,他们强调了未来对于更大规模GPU集群的需求,以及为了实现更快的训练和更复杂的模型,需要超过30 exaflops的计算能力。

Takeaways

  • 🚀 生成性AI是当前最难的话题之一,它涉及创造新的、逼真的内容,与理解现有信息的AI模型不同。
  • 📈 生成性AI的发展为图像和视频生成、文本生成等新应用领域带来了巨大机遇。
  • 📚 从2015年至今,计算能力的大幅提升和网络技术的发展对生成性AI的进步起到了关键作用。
  • 🌱 Meta在生成性AI领域做出了显著贡献,例如通过提示生成令人信服的图像。
  • 💻 大型语言模型(LLMs)对系统设计,特别是网络子系统,提出了新的挑战。
  • ⚙️ 大型语言模型的训练和推理需要更多的计算资源,这导致了对网络子系统的不同需求。
  • 📊 与推荐模型相比,大型语言模型需要多个数量级更多的计算能力。
  • 🔍 LLM训练需要数万个GPU,以在合理的时间内完成训练。
  • 🔗 Meta训练的最新大型语言模型,如拥有70亿参数的LLaMa 2,训练需要2万亿个token,相当于1.7亿GPU小时。
  • 🔄 为了提高训练效率,需要使用数据并行化、模型并行化或流水线并行化等不同的并行化方案。
  • 🔍 推理也成为系统设计中的一个有趣问题,因为它需要低延迟和高内存吞吐量。
  • 🌐 随着模型和数据量的增加,网络子系统面临更多挑战,需要更高效的网络硬件和架构。

Q & A

  • 生成性AI与传统的AI模型在功能上有何不同?

    -生成性AI专注于创建和生成新的、逼真的内容,而传统的AI模型通常用于理解现有的信息,如图像分类和分割。生成性AI与理解现有内容的AI模型的主要区别在于,前者致力于生成新内容。

  • 生成性AI的发展历史可以追溯到什么时候?

    -生成性AI的发展可以追溯到2015年,当时多伦多大学的Jeff Hinton实验室展示了在桌面上生成一串香蕉的图像。

  • 在生成性AI中,哪些技术进步对图像和文本生成产生了重要影响?

    -DALL-E和稳定扩散(Stable Diffusion)对图像生成产生了重要影响,而GPT对文本生成产生了重要影响。

  • 为什么大型语言模型(LLMs)的训练和推理需要大量的计算能力?

    -大型语言模型的训练和推理需要处理大量的数据和复杂的算法,这要求使用大量的加速器(如GPU)来在合理的时间内完成训练,并且为了提供良好的用户体验,推理也需要大量的计算能力。

  • Meta在大型语言模型训练中面临了哪些网络子系统方面的挑战?

    -Meta面临的挑战包括需要大量的加速器来完成训练,以及在推理阶段需要分布式推理,这要求网络子系统能够处理大量的数据传输和低延迟的需求。

  • 在训练大型语言模型时,为什么需要使用不同的模型并行化技术?

    -由于数据并行化技术已经不足以满足大型语言模型训练的需求,因此需要使用模型并行化或流水线并行化等其他并行化方案,这会在多个维度上产生多样化的通信模式。

  • Meta在大型语言模型训练中使用了哪些网络技术?

    -Meta在大型语言模型训练中使用了Rocky和InfiniBand网络技术,其中Rocky网络技术在生产集群中实现了与InfiniBand相似的速度和可扩展性。

  • 为什么大型语言模型的推理现在也成为了一个网络问题?

    -由于模型的增长,单个GPU或主机内存无法容纳这些大型模型,需要跨多个系统进行推理,这就需要在多个系统之间进行数据的分布式处理,从而变成了一个网络问题。

  • 在大型语言模型训练中,数据并行和模型并行各有什么特点和挑战?

    -数据并行适合于规模较大的域,其挑战在于随着规模增加,消息大小减小,导致延迟变得更加明显。模型并行则需要更高的带宽效率,并且更难与计算部分重叠,对延迟和带宽的要求更高。

  • 为什么大型语言模型训练需要考虑故障和可靠性问题?

    -大型语言模型训练涉及大量的硬件和软件组件,随着系统的扩展,出现故障的频率也会增加。故障隔离和调试在大型系统中需要更长的时间,这会影响训练的效率和可靠性。

  • Meta对未来大型语言模型训练的愿景是什么?

    -Meta的愿景是实现超过30 exaflops的计算能力,这将使得训练大型语言模型的时间从一个月缩短到不到一天,从而加快创新步伐,并使得能够训练更复杂、数据量更大的模型。

Outlines

00:00

🚀 介绍生成性AI及其在Meta的应用

Jonsu从Meta介绍了生成性AI(Generative AI)的概念,它与理解现有信息的AI模型不同,专注于创造新内容。生成性AI的应用包括图像、视频和文本生成。Jonsu提到了2015年多伦多大学Jeff Hinton实验室生成的低分辨率香蕉图像,以及随后几年在图像生成和文本生成上的突破,如Dali、稳定扩散和GPT模型。他强调了计算能力和网络技术在推动生成性AI发展中的作用,特别是Meta在这一领域的贡献,如通过提示生成令人信服的图像。此外,Jonsu还讨论了大型语言模型(LLMs)对系统设计的影响,特别是在网络子系统方面。

05:01

📈 大型语言模型的训练和推理需求

Jonsu继续讨论了大型语言模型(LLMs)的训练和推理需求。他提到了Meta训练的最新大型语言模型,如拥有700亿参数的Llama 2,它需要2万亿个token进行训练,这需要800个数据平面和1.7亿GPU小时来完成。他还提到了使用Rakibi网络织物在生产集群中训练的模型,这表明使用更普通的网络硬件可以民主化LLM训练。Jonsu强调了数据量呈指数增长的趋势,以及为了满足未来的计算需求,可能需要超过42,000个GPU。他还讨论了训练大型模型时并行化方案的挑战,如数据并行化、模型并行化和流水线并行化,并指出这些并行化方案产生了多样化的通信模式,这对网络设计构成了有趣的挑战。

10:04

🤖 大型语言模型对系统设计的深远影响

Peter作为系统工程师,深入探讨了大型语言模型对网络拓扑和其他参数的影响。他强调了从排名模型到LLMs,计算能力需求的显著增加,这意味着需要构建更大的集群来支持模型训练。Peter讨论了大型集群的两个主要领域:扩展域和扩展上行域,并解释了它们如何映射到数据并行和模型并行流量。他提到了为了实现模型并行化而将网络分割成多个部分,并在这些部分之间传递激活所产生的挑战。此外,他还讨论了大型集群中出现的延迟问题,以及如何通过优化模型来有效重叠网络通信和计算。

15:05

🔍 大型集群中的挑战和故障排除

Peter继续讨论了在大型集群中进行故障排除的挑战。他指出,随着集群规模的扩大,组件数量增加,故障发生得更频繁,这增加了故障排除的难度和时间。他还提到了在大型系统中进行故障隔离的复杂性,以及这如何影响训练时间。此外,Peter讨论了推理问题,指出随着模型的增长,单个GPU或主机内存已无法容纳它们,因此需要分布式推理。他预测,未来的小型集群将至少包含64个GPU,并且推理将成为一个网络问题,需要在多个系统上运行。

20:07

🌟 大型语言模型的未来展望

最后,Peter总结了大型语言模型对计算需求的巨大影响,这导致了更大的集群、更大的推理织物的需求,以及对网络拓扑的优化。他强调了规模上行活动现在需要超越机架或节点,这是过去几年网络拓扑中最大的变化。他还提到了模型并行化的需求,以及推理现在成为一个必须在多个系统上运行的问题,类似于训练,但只执行前向传递。Peter和Jonsu在演讲结束时邀请听众提问。

Mindmap

Keywords

💡生成性人工智能(Generative AI)

生成性人工智能是一种能够创造新颖且逼真内容的人工智能技术。它与以往主要用于理解现有信息的AI模型不同,如图像分类和分割。生成性AI在视频、图像和文本生成等方面开辟了新的应用机会。例如,2015年多伦多大学Jeff Hinton实验室展示了生成一张桌子上的香蕉图像,尽管分辨率较低,但标志着生成性AI的一个重要进展。

💡大型语言模型(Large Language Models, LLMs)

大型语言模型是一类具有大量参数的AI模型,它们能够处理和生成语言文本。这些模型在系统设计,尤其是在网络子系统方面提出了新的挑战。它们不仅推动了基础设施的极限,而且在训练和推理时需要大量的计算能力。例如,Meta训练的最新LLaMa模型拥有700亿参数,需要2万亿个token来完成训练。

💡计算能力(Compute Capabilities)

计算能力指的是执行计算任务的能力,对于训练和运行大型语言模型至关重要。随着模型大小和数据量的增加,所需的计算能力也随之增加。例如,为了在合理的时间内完成LLM的训练,可能需要使用数万个GPU。

💡网络技术(Network Technologies)

网络技术在连接多个加速器以提供所需的计算能力方面发挥了重要作用。文中提到的Rocky网络技术就是一个例子,它在生产集群中被用来训练具有4亿参数的模型,并且展示了与InfiniBand相似的速度和可扩展性。

💡模型并行性(Model Parallelism)

模型并行性是一种将大型模型分割成多个部分并在不同硬件上并行处理的方法。这对于大型语言模型的训练至关重要,因为它允许模型的大小超越单个硬件的内存限制。然而,这也带来了通信和协调上的挑战,因为不同部分之间的数据需要同步。

💡数据并行性(Data Parallelism)

数据并行性是一种在多个输入上并行处理模型的方法。这种方法在扩展到大量GPU时自然映射到拓扑结构,并且通常与模型并行性结合使用,以处理大型语言模型的复杂性。

💡延迟(Latency)

在AI训练和推理中,延迟指的是系统响应时间,特别是在生成第一个token或增量token时。低延迟对于提供良好的用户体验至关重要。例如,用户不希望在看到第一个响应之前等待过长时间,通常希望这个时间少于一秒。

💡分布式推理(Distributed Inference)

分布式推理是指在多个系统上并行执行推理任务,这在模型太大以至于无法在单个GPU或主机上容纳时变得必要。分布式推理需要在系统之间进行数据切片和传输,因此对网络带宽和延迟有更高的要求。

💡系统设计(System Design)

系统设计涉及到构建和优化AI模型训练和推理所需的硬件和软件架构。随着大型语言模型的出现,系统设计必须考虑到计算能力、网络拓扑和内存吞吐量等因素,以支持模型的有效训练和快速推理。

💡RDMA(Remote Direct Memory Access)

RDMA是一种直接从内存到内存的数据传输技术,不需要CPU的介入,从而减少延迟并提高数据传输效率。在文中,RDMA技术如InfiniBand和Rocky被提及,它们在构建大型AI训练集群时发挥了重要作用。

💡GPU(Graphics Processing Unit)

GPU是一种专门设计来处理图形和复杂计算任务的硬件加速器。在AI领域,GPU被广泛用于提供并行处理能力,以训练和运行大型语言模型。文中提到,为了满足计算需求,可能需要使用数万个GPU。

Highlights

生成性AI是当前最难的话题之一,它涉及创建和生成新的、逼真的内容。

生成性AI与理解现有信息的AI模型不同,它更侧重于生成新内容。

生成性AI的发展为图像、视频和文本生成等应用领域带来了新的巨大机遇。

自2015年以来,计算能力的大幅提升和网络技术的进步对生成性AI的发展起到了重要的推动作用。

Meta公司在生成性AI领域做出了显著贡献,例如通过小仙人掌的提示生成令人信服的图像。

大型语言模型(LLMs)正推动基础设施的极限,它们对系统设计尤其是网络子系统提出了新的挑战。

大型语言模型的训练和推理需要大量的计算能力,这导致了对网络子系统的有趣问题。

大型语言模型的训练需要数万个GPU来在合理的时间内完成,这对数据中心的AI工作负载特性产生了显著影响。

大型语言模型的训练和推理过程中,解码和预填充阶段对系统的需求有显著不同。

Meta训练的最新大型语言模型LLaMa-2拥有700亿参数,训练需要2万亿个token和800个DLaaS单元。

使用Rocky网络织物在生产集群中训练的LLaMa-3模型,展示了使用更普通的网络硬件进行LLM训练的可能性。

数据输入到模型的量呈指数增长,这需要更多的GPU和更高的计算能力。

为了实现更快的创新和训练更复杂的模型,Meta的愿景是实现超过30 exaflops的计算能力。

使用简单的数据并行化方案在训练大型模型时已经不够,需要采用模型并行化或流水线并行化等其他并行化方案。

大型语言模型的推理现在也成为一个网络问题,需要在多个系统上分布式地运行以满足目标延迟目标。

大型集群的构建需要考虑更多的延迟和可靠性问题,以及在软件和硬件层面上的故障排除。

随着模型和集群的增长,模型并行化的挑战变得更加显著,需要更高的带宽和更低的延迟来有效实现。

Transcripts

play00:05

hey I'm jonsu from meta and Pierre and I

play00:08

are going to talk about the tongue

play00:10

they're talking for Gen AI training

play00:12

influence clusters

play00:14

so generative AI is one of the hardest

play00:17

topics these days

play00:19

and it's about creating generating new

play00:22

and realistic contents

play00:25

and before genetic models become popular

play00:29

AI models were often used to understand

play00:32

existing information like image

play00:34

classification and segmentation

play00:37

so genitive AI is about generating new

play00:40

content versus

play00:42

understand the existing contents so

play00:44

that's the main difference

play00:46

and generative AI opens up a new huge

play00:49

opportunities and new applications so

play00:52

for example image and video generations

play00:55

and also text Generations

play00:59

AI goes back to 2015 when Jeff hinton's

play01:04

Lab at the University of Toronto showed

play01:06

generating an image of bowel of bananas

play01:09

on the table

play01:10

and you can notice that how low

play01:12

resolution they are and the next few

play01:15

years we've seen a lot of breakthroughs

play01:18

so for example Dali and stable diffusion

play01:21

for image generation and GPT for text

play01:24

generation

play01:26

and one of the important enablement

play01:28

Technologies from 2015 until now is a

play01:33

huge amount of compute capabilities

play01:35

available and Network Technologies that

play01:39

connect many accelerators played a very

play01:42

important role

play01:44

a metal contributed this field

play01:47

significantly

play01:49

for example this is to walk in this year

play01:52

and then you can see that by giving a

play01:56

prompt of a small cactus wearing or

play01:58

sunglass insara desert we can see very

play02:03

convincing image and also photography

play02:06

image compared to the images shown in

play02:08

the previous slide

play02:11

and of course there are large linkage

play02:13

models from beta like Lama and we can

play02:16

create a chapel from these models to

play02:19

have a large language model based

play02:21

knowledge Discovery and actually llms

play02:24

are the ones usually pushing the limit

play02:26

of infrastructures

play02:28

so in this store we are gonna focus more

play02:30

on the llms

play02:34

I and specifically large language models

play02:37

being for system design especially for

play02:40

the network subsystems

play02:42

so recommendation models have been the

play02:45

primary AI workloads in meta data center

play02:48

but large language models have very

play02:51

different characteristics compared to

play02:53

the recognition models first large

play02:55

language models whose training and

play02:57

influence requires much more compute

play03:00

and because of this especially for the

play03:03

large language model training we need

play03:05

huge number of accelerators to finish

play03:07

the training in a reasonable amount of

play03:10

time

play03:11

and this creates a very interesting

play03:12

problems to the network subsystem

play03:15

and also interestingly even within

play03:18

element in France it has very diverse

play03:21

characteristics for example element

play03:23

influence consists of two stages called

play03:25

decoding and pre-fill

play03:28

and the decoding has

play03:30

a very low latency requirement

play03:34

and let's talk about both details about

play03:36

this specifically for the compute demand

play03:39

so this table Compares how much compute

play03:42

we need for LMS comparing with the

play03:45

recommendation models

play03:47

elements requires multiple orders of

play03:49

magnitude more compute than the

play03:51

recognition models

play03:52

so for example for llm training for each

play03:56

sentence we need about a pair of fluffs

play03:59

of compute and then we need to train

play04:02

with hundreds of billions of sentences

play04:05

and the size of the models and the

play04:08

amount of data we are feeding to those

play04:10

models have been increasing and this is

play04:12

why we need tens of thousands of gpus

play04:15

for large Legacy model training

play04:18

and another influence also requires huge

play04:21

amount of compute

play04:23

to provide reasonable user experience

play04:27

and within our low latency

play04:31

we need a few petaflops of compute

play04:36

and you can notice that

play04:38

this huge amount of compute cannot be

play04:40

satisfied by just eight gpus per one

play04:44

host and this is why we need a

play04:47

distributed inference

play04:49

so the cost level gpus is not only

play04:52

needed for training anymore so we also

play04:54

need them for the influence and this is

play04:56

another interesting problem for the

play04:58

network subsystem

play05:01

there's more concrete examples these are

play05:03

the recent large language models trained

play05:06

from meta

play05:07

and the latest llama 2 with 70 billion

play05:11

parameters are trained with 2 trillion

play05:13

tokens

play05:15

and that needs 800 data flavs to finish

play05:18

the training and this translates into

play05:21

1.7 billion GPU hours assuming we are

play05:25

using nvidia's 800 gpus or more than one

play05:29

month even if you we are using 2n100

play05:34

gpus this is a huge amount of compute

play05:36

and these foundational model LM training

play05:39

have been done in research supercluster

play05:43

but I'd like to highlight that one of

play05:45

the latest model Lama two sorry three

play05:48

four billion parameter has been trained

play05:51

in a production cluster using rakibi to

play05:54

network fabric

play05:56

and then we are able to achieve similar

play05:58

speed and scalability compared to

play06:02

Infinity band

play06:03

and to the best of our analogy this is

play06:06

probably one of the largest production

play06:09

use case of Rocky B2

play06:12

and then we hope this can help

play06:14

democratizing the lln training using

play06:18

more commodity Network Hardware

play06:22

and are they at better will present more

play06:25

details about this so if you are

play06:27

interested in more details you can watch

play06:30

his talk in the same event

play06:34

and about the complexity on the amount

play06:36

of data

play06:38

we are feeding into these models have

play06:40

been increasing exponentially and then

play06:43

we don't expect that Trend will stop

play06:45

anytime soon

play06:48

and this is a reason why

play06:51

we need a lot of gpus and we are using

play06:54

about 2000 gpus these days but we don't

play06:58

think that's going to be enough going

play07:01

forward so that's why we are thinking

play07:02

about 42 000 gpus and even Beyond and

play07:07

our vision is achieving more than 30

play07:09

extra flops

play07:11

which corresponds to about one third of

play07:15

the theoretical Peak compute capability

play07:17

provided by 32

play07:20

000 gpus

play07:22

and this will enable training the lava

play07:25

model less than one day

play07:27

instead of both in one month and this

play07:30

will innovate and enable much faster

play07:33

Innovation and also enable much more

play07:36

complex models trained with more data

play07:40

and one of the challenges in training

play07:43

these large models using huge number of

play07:46

accelerators

play07:48

is using simple parallelization scheme

play07:52

is running out of steam

play07:55

the current most common way of

play07:58

parallelizing these models is called

play08:00

Data parallelization

play08:02

and is paralyzing across the inputs

play08:05

but that itself is not enough anymore so

play08:09

we need to use other parallelization

play08:10

schemes like model parallelism or

play08:14

pipeline parallelism so basically we

play08:16

need to slice into along the multiple

play08:19

dimensions

play08:22

by combining multiple ways of

play08:24

parallelization

play08:26

it generates or diverse patterns of

play08:29

communication and that is also very

play08:31

interesting problem for the network

play08:35

so early influence is a very interesting

play08:38

problem for system design so for good

play08:41

usual experience we typically care about

play08:45

two negative three metrics so first one

play08:48

is called time to First token so

play08:50

basically we don't want users to wait

play08:52

too long until they start seeing the

play08:54

first response and then typically we

play08:57

want them to be less than one second and

play09:00

the second latency metric is called time

play09:02

per incremental token so once you start

play09:04

generating tokens we don't want them to

play09:07

be too slow and then we typically want

play09:09

them to be less than 50 milliseconds so

play09:12

basically we are seeing every tokens

play09:14

every 15 milliseconds

play09:17

and let's look at more details

play09:21

and another influence consists of two

play09:23

stages pre-filled and decode and

play09:25

pre-fill determinants of time to First

play09:28

token and decoding determines the time

play09:31

per incremental token and what's

play09:33

interesting is it has they have very

play09:35

distinctly different system demand so

play09:38

preview is about understanding the usual

play09:41

prompts and then you can work on

play09:45

multiple tokens from the user prompts in

play09:48

parallel so that's why it can be very

play09:51

computer intensive but on the other hand

play09:53

in the decoding stage

play09:55

it needs to read huge amount of the

play09:57

amount of data when it's going over

play10:00

generating one output token one by one

play10:04

so that's the reason why it becomes very

play10:06

memory intensive

play10:08

so one stage is very compute and the

play10:11

other stage is memory intensive

play10:13

so the inference system needs to provide

play10:16

very high compute throughput and also it

play10:19

needs to provide very high memory

play10:21

through and that's the reason why it's

play10:24

hard to contain an influence

play10:27

within one host typically with httpus

play10:30

and going forward we expect we need a

play10:34

distribute influence for element

play10:36

inference so we need a small cluster

play10:40

for inference

play10:44

so lastly we kept the first part of the

play10:46

talk so llms requires orders of

play10:50

magnitude more compute compared to

play10:52

recognition models and training in

play10:55

particular requires tens of thousands of

play10:58

accelerators to finish it in reasonable

play11:00

amount of time

play11:02

and because of that we need to use

play11:05

different types of model types of

play11:07

parallelization and then that generates

play11:11

diverse patterns of communication

play11:13

diverse communication patterns which is

play11:16

a very interesting problem for laptop

play11:19

design

play11:20

and influence also requires a small

play11:23

cluster and then that influence also

play11:26

becomes a network problem

play11:28

now Peter is gonna talk about going to

play11:31

move in depth on the system design for

play11:33

element training and inference thank you

play11:36

Jung so thanks for excellent

play11:37

presentation my name is Peter I'm a

play11:40

medical engineer and in my section we're

play11:43

going to dive deeply in the effect that

play11:45

the large language models and Jai in

play11:48

general have on networking topologies

play11:49

and other parameters of our Fabrics

play11:53

so as we covered briefly in the previous

play11:56

section the biggest challenge the

play11:58

biggest change from llms to llms for

play12:01

ranking models was the increase in

play12:02

compute capacity requirements what this

play12:05

means is that we we now need to build

play12:07

much larger clusters to support training

play12:10

office models now a big cluster

play12:13

naturally separates into two large

play12:15

domains one is a scale out and that was

play12:18

a collection of scale aftermates let's

play12:20

resume briefly the scale out domain is

play12:23

what connects the compute pods together

play12:26

you can think of racks of services small

play12:28

parts scale out is when we use

play12:31

Technologies like infiniband or rocky to

play12:34

implement connectivity for tens of hours

play12:37

of nodes so this is where scalability is

play12:39

most important not so much of speeds oh

play12:42

speeds and a joke still you have

play12:44

connectivity at the rate of 50 gigabytes

play12:47

per second it's gigabyte and gigabit

play12:50

on a contrast the scale up Dom domain is

play12:53

usually contained in one server this is

play12:55

your enveline technology or xgmi

play12:58

for a few examples

play13:00

on the contrast with this scale out

play13:02

again it was a short distance but very

play13:04

high bandwidth like in contemporary

play13:06

system the Delta between scale out

play13:08

bandwidth and the scale up Bandit is

play13:10

about 9x but that means from 50 gigabyte

play13:13

we're now moving to 450 gigabytes per

play13:15

second

play13:17

and now as we mentioned previously when

play13:19

you train parallel when you train model

play13:21

in part of fashion you generate two

play13:23

types of pluralism at a very high level

play13:25

one is data parallel another small

play13:27

parallel now scale out part of topology

play13:30

naturally maps to the data parallel

play13:32

traffic and the scale up nature of the

play13:35

encapsulates the motor part of traffic

play13:37

and now let's take a look at how this

play13:39

looks topologically

play13:41

jongsu spoke about the goal they need to

play13:43

build apologies which we currently

play13:46

contain to 32k gpus or accelerators even

play13:50

though it's a large number it's of the

play13:51

limit but here we are looking at the

play13:53

fabric but instantiates such topology

play13:55

for Network Engineers this isn't looking

play13:57

too surprising in fact this is a

play13:59

well-known cost apology which has

play14:01

multiple tiers of connectivity at the

play14:04

very bottom you have your racks in our

play14:06

case each rack has 16 gpus in two

play14:09

servers so effectively every rack has

play14:11

two domains of skill-up topology scale

play14:13

up bandwidth above those racks you have

play14:16

your scale out fabric this is where

play14:18

Infinity band or Roku works in our

play14:20

example we have a rocky fabric as

play14:23

mentioned before we deploy both Rocky

play14:25

and infinity band Fabrics but Rocky is

play14:27

also unique you see more examples from

play14:29

Indian band in a public and Rocky so

play14:32

this is this slide demonstrates the

play14:34

Baroque instance notice that we have in

play14:38

each layer above the racks 18 cluster

play14:41

switches now this is important because

play14:43

it gives you additional capacity to

play14:46

protect against failures as you will see

play14:49

further failures and reliability is of

play14:52

utmost importance for these clusters and

play14:54

these designs

play14:56

now there's a lot of details to

play14:58

implement Rocky of which ADI will talk

play15:00

separately in his presentation on a

play15:02

review public implementation but I want

play15:04

to stress out that this is pushing Rocky

play15:07

or anything about to estimates but both

play15:09

in very large clusters of thousands of

play15:12

gpus

play15:15

and this slide captures what happens

play15:19

inside these Fabrics now just to recap

play15:21

Once Again when you train models you

play15:24

generate two types of traffic patterns

play15:26

one stem from data parallelism another

play15:28

from model pluralism now the most

play15:31

challenging part is model parallelism

play15:33

but before we get there let's take a

play15:35

look at data parallel patterns there you

play15:37

generate patterns like or reduce or all

play15:40

governed use scatter these are well

play15:42

known they've been known like for many

play15:44

years for practitioners who've been

play15:45

trying to train models before imagine

play15:48

size here is usually substantial but it

play15:51

grows smaller and smaller as you

play15:53

increase the size of a scale-out domain

play15:55

and this is where some of the challenges

play15:56

become more evident as we'll see later

play15:59

latency becomes more visible latency

play16:01

here it means propagation latency

play16:04

notably however we did a part of

play16:06

patterns typically can overlap efficient

play16:09

of compute it's not Universal in some

play16:12

cases you won't get this for free you

play16:13

have to work and optimize model to

play16:16

achieve efficient overlapping however

play16:18

very often scale out part and data

play16:21

parallelism can be well overlapped with

play16:23

compute which makes them less

play16:25

challenging so to speak for networking

play16:27

not so much more to protracted model

play16:30

parallel is a result of slicing the

play16:33

networks into pieces and trying to pass

play16:35

activations between those components

play16:37

there you have your familiar or reduce

play16:40

or Auto all patterns which come from say

play16:43

tensor parallelism or pipeline

play16:45

parallelism and here the benefit is much

play16:48

much higher this is where you really

play16:51

need the scalar bandwidth to be

play16:52

efficient because the messages is still

play16:54

pretty large and your demand for

play16:57

bandwidth is 10x if not more to realize

play16:59

this parallelism

play17:01

and most importantly and critically it's

play17:03

much harder to overlap model traffic

play17:06

model parallel execution with compute

play17:10

part so this is where latency and

play17:12

bandwidth are much more important than

play17:14

for data parallelism

play17:18

and this diagram demonstrates you how

play17:20

all this Collective so to speak map to

play17:23

the network topology here you have a

play17:26

scale out collectives or reduce reduce

play17:28

scatter and all gather which map on the

play17:30

cluster switches above a racks this is

play17:33

where you see all these Rings often

play17:36

which result from reduced scatter but

play17:38

span multiple switches and go across all

play17:40

the racks in the topology for instance

play17:42

if you're changing set size is 16k gpus

play17:46

you typically see the ring size of what

play17:49

is taking 1000 gpus spanning across all

play17:53

the switches this is where latency

play17:55

starts to add up but this is where you

play17:58

still can overlap this collections with

play18:00

the computation

play18:01

now at the bottom of this tree you see

play18:04

your model pile chart you see all reduce

play18:06

and others which map to the scale up

play18:09

domains for example in case of Nvidia is

play18:11

going to be in the link interconnects

play18:14

however once you cross single server

play18:17

single board you have to run this

play18:19

traffic across the scale out and this is

play18:21

where you see the impact of much lower

play18:24

bandwidth and as you will see this

play18:26

bottleneck dictates the need to grow the

play18:29

scalp domains Beyond one server

play18:34

and so now let's just recap and look

play18:37

back and what changes with large

play18:39

clusters now now you can say Okay scale

play18:41

is great but looks like it's the same

play18:43

traffic same problems over again well

play18:46

pretty much so however it's important to

play18:49

reiterate that latency stats become

play18:50

important what's funny what's we

play18:53

observed in the AI training is that

play18:54

latency in the network wasn't so

play18:56

critical as it is typically it's in nhpc

play18:58

applications most of the time because

play19:00

you can overlap collect these

play19:02

interpretations however with llm

play19:04

training with very large clusters you

play19:06

have now machines which span whole

play19:09

buildings so now we have latency from

play19:11

switches from the fiber even from the

play19:15

transceivers which keeps adding up and

play19:18

as it adds up it starts to be visible

play19:19

for smaller messages as mentioned before

play19:22

as you increase the power domain size

play19:24

your message size does decrease and this

play19:28

is where you start to see the exposed

play19:29

latency and you have to really pay

play19:31

attention and manage which much better

play19:34

now for the second party liability

play19:37

naturally it's you should expect but as

play19:40

you grow Network size as there are more

play19:42

components and more elements they fail

play19:44

more frequently now to be fair most of

play19:47

feathers often happen in the software

play19:49

land not so much in Hardware but the

play19:51

hardware this skill also exhibit issues

play19:53

so often when you start with systems for

play19:56

the first time you have to go for

play19:57

burning process identify both the

play19:59

components eliminate them replace them

play20:01

and so on and this takes time second

play20:04

problem is that in a large system it

play20:06

takes much longer longer time to do

play20:08

fault isolation you have to track the

play20:11

issue with much more components and

play20:13

often which takes much more time when it

play20:15

does in a smaller setup and all of that

play20:18

as a training time more time to debug

play20:20

less time to run the actual computation

play20:24

and finally the thing with Johnson

play20:26

mentioned that inference for other lamps

play20:28

is now becoming a networking problem

play20:31

previously we can contain the inference

play20:33

in single GPU you can see often hear

play20:35

about when you run inference use only

play20:37

one GPU typically or even like a single

play20:40

PC card in the LM case you have two

play20:43

challenges first of all this models

play20:45

grows so large you can't contain them in

play20:47

single GPU memory or even a single host

play20:50

memory you have to go across just to

play20:52

keep the coefficients and Optimizer

play20:55

States and our parameters together

play20:57

secondly you need more compute to

play21:00

achieve the target latency goals for

play21:02

example during previous stage you need

play21:04

much more compute to achieve the let's

play21:07

say Legacy of one second for first token

play21:10

for large models and long sequence lens

play21:13

if you want to go to SQL stats of 32k

play21:16

64k or even 1000k you have to go with a

play21:20

distributed inference and that means you

play21:22

now have a mini cluster that implements

play21:24

forward paths in distributed fashion you

play21:27

have to run always tensor slicing water

play21:29

poloism across multiple systems and if

play21:32

you are bottlenecked by the scale out

play21:33

well yeah it's your problem to solve

play21:36

because now you're much much slower

play21:39

and so as a result of this trend we

play21:41

foresee that the mini trans clusters are

play21:44

going to go to 1642 64 gpus even in

play21:48

modern generation and you can expand the

play21:50

strategory future but I don't think it

play21:52

will be on 64 the next couple of years

play21:55

and now to to recap what we have done in

play21:58

this section

play22:00

and once again the biggest shift for our

play22:02

lamps was it tremendous increase in

play22:04

computational demand now this dictates

play22:06

everything as you have seen higher

play22:08

computation requires larger clusters it

play22:10

requires large influence fabrics and so

play22:12

on large clusters have better issues

play22:14

right visibility problems with issues

play22:16

with latency and various political

play22:19

structure which requires optimization

play22:21

the biggest Trend we're seeing is that

play22:23

scale up an activity now supposed to go

play22:26

beyond a rack or beyond the node this is

play22:28

probably the biggest change we've seen

play22:30

in a topologists in the last three four

play22:32

years is that explosion of bandwidth but

play22:35

you need to realize model parallelism

play22:37

and once again inference is is now a

play22:41

most known problem you have to run

play22:43

inference across multiple systems it

play22:45

becomes like a mini cluster similar to

play22:47

training but only doing forward pass not

play22:50

the backboard pass and that's it for my

play22:52

part thank you so much for listening and

play22:56

Johnson and I can take your questions

play22:57

now

Rate This

5.0 / 5 (0 votes)

الوسوم ذات الصلة
生成性AI图像生成视频生成文本生成系统设计网络拓扑计算能力数据并行模型并行延迟优化分布式计算
هل تحتاج إلى تلخيص باللغة الإنجليزية؟