Networking for GenAI Training and Inference Clusters | Jongsoo Park & Petr Lapukhov
Summary
TLDR在这段视频脚本中,Jonsu 和 Pierre 讨论了大型语言模型(LLMs)在系统设计,尤其是网络子系统设计中的影响。他们指出,与以往的识别模型相比,LLMs 在训练和推理时需要更多的计算能力。为了在合理的时间内完成训练,需要使用数万个加速器,这给网络子系统带来了挑战。他们提到了数据并行和模型并行等不同的并行化技术,以及这些技术如何产生多样化的通信模式,这对网络设计提出了新的要求。此外,他们还讨论了推理过程中的低延迟需求,以及如何通过分布式推理来满足这些需求。最后,他们强调了未来对于更大规模GPU集群的需求,以及为了实现更快的训练和更复杂的模型,需要超过30 exaflops的计算能力。
Takeaways
- 🚀 生成性AI是当前最难的话题之一,它涉及创造新的、逼真的内容,与理解现有信息的AI模型不同。
- 📈 生成性AI的发展为图像和视频生成、文本生成等新应用领域带来了巨大机遇。
- 📚 从2015年至今,计算能力的大幅提升和网络技术的发展对生成性AI的进步起到了关键作用。
- 🌱 Meta在生成性AI领域做出了显著贡献,例如通过提示生成令人信服的图像。
- 💻 大型语言模型(LLMs)对系统设计,特别是网络子系统,提出了新的挑战。
- ⚙️ 大型语言模型的训练和推理需要更多的计算资源,这导致了对网络子系统的不同需求。
- 📊 与推荐模型相比,大型语言模型需要多个数量级更多的计算能力。
- 🔍 LLM训练需要数万个GPU,以在合理的时间内完成训练。
- 🔗 Meta训练的最新大型语言模型,如拥有70亿参数的LLaMa 2,训练需要2万亿个token,相当于1.7亿GPU小时。
- 🔄 为了提高训练效率,需要使用数据并行化、模型并行化或流水线并行化等不同的并行化方案。
- 🔍 推理也成为系统设计中的一个有趣问题,因为它需要低延迟和高内存吞吐量。
- 🌐 随着模型和数据量的增加,网络子系统面临更多挑战,需要更高效的网络硬件和架构。
Q & A
生成性AI与传统的AI模型在功能上有何不同?
-生成性AI专注于创建和生成新的、逼真的内容,而传统的AI模型通常用于理解现有的信息,如图像分类和分割。生成性AI与理解现有内容的AI模型的主要区别在于,前者致力于生成新内容。
生成性AI的发展历史可以追溯到什么时候?
-生成性AI的发展可以追溯到2015年,当时多伦多大学的Jeff Hinton实验室展示了在桌面上生成一串香蕉的图像。
在生成性AI中,哪些技术进步对图像和文本生成产生了重要影响?
-DALL-E和稳定扩散(Stable Diffusion)对图像生成产生了重要影响,而GPT对文本生成产生了重要影响。
为什么大型语言模型(LLMs)的训练和推理需要大量的计算能力?
-大型语言模型的训练和推理需要处理大量的数据和复杂的算法,这要求使用大量的加速器(如GPU)来在合理的时间内完成训练,并且为了提供良好的用户体验,推理也需要大量的计算能力。
Meta在大型语言模型训练中面临了哪些网络子系统方面的挑战?
-Meta面临的挑战包括需要大量的加速器来完成训练,以及在推理阶段需要分布式推理,这要求网络子系统能够处理大量的数据传输和低延迟的需求。
在训练大型语言模型时,为什么需要使用不同的模型并行化技术?
-由于数据并行化技术已经不足以满足大型语言模型训练的需求,因此需要使用模型并行化或流水线并行化等其他并行化方案,这会在多个维度上产生多样化的通信模式。
Meta在大型语言模型训练中使用了哪些网络技术?
-Meta在大型语言模型训练中使用了Rocky和InfiniBand网络技术,其中Rocky网络技术在生产集群中实现了与InfiniBand相似的速度和可扩展性。
为什么大型语言模型的推理现在也成为了一个网络问题?
-由于模型的增长,单个GPU或主机内存无法容纳这些大型模型,需要跨多个系统进行推理,这就需要在多个系统之间进行数据的分布式处理,从而变成了一个网络问题。
在大型语言模型训练中,数据并行和模型并行各有什么特点和挑战?
-数据并行适合于规模较大的域,其挑战在于随着规模增加,消息大小减小,导致延迟变得更加明显。模型并行则需要更高的带宽效率,并且更难与计算部分重叠,对延迟和带宽的要求更高。
为什么大型语言模型训练需要考虑故障和可靠性问题?
-大型语言模型训练涉及大量的硬件和软件组件,随着系统的扩展,出现故障的频率也会增加。故障隔离和调试在大型系统中需要更长的时间,这会影响训练的效率和可靠性。
Meta对未来大型语言模型训练的愿景是什么?
-Meta的愿景是实现超过30 exaflops的计算能力,这将使得训练大型语言模型的时间从一个月缩短到不到一天,从而加快创新步伐,并使得能够训练更复杂、数据量更大的模型。
Outlines
🚀 介绍生成性AI及其在Meta的应用
Jonsu从Meta介绍了生成性AI(Generative AI)的概念,它与理解现有信息的AI模型不同,专注于创造新内容。生成性AI的应用包括图像、视频和文本生成。Jonsu提到了2015年多伦多大学Jeff Hinton实验室生成的低分辨率香蕉图像,以及随后几年在图像生成和文本生成上的突破,如Dali、稳定扩散和GPT模型。他强调了计算能力和网络技术在推动生成性AI发展中的作用,特别是Meta在这一领域的贡献,如通过提示生成令人信服的图像。此外,Jonsu还讨论了大型语言模型(LLMs)对系统设计的影响,特别是在网络子系统方面。
📈 大型语言模型的训练和推理需求
Jonsu继续讨论了大型语言模型(LLMs)的训练和推理需求。他提到了Meta训练的最新大型语言模型,如拥有700亿参数的Llama 2,它需要2万亿个token进行训练,这需要800个数据平面和1.7亿GPU小时来完成。他还提到了使用Rakibi网络织物在生产集群中训练的模型,这表明使用更普通的网络硬件可以民主化LLM训练。Jonsu强调了数据量呈指数增长的趋势,以及为了满足未来的计算需求,可能需要超过42,000个GPU。他还讨论了训练大型模型时并行化方案的挑战,如数据并行化、模型并行化和流水线并行化,并指出这些并行化方案产生了多样化的通信模式,这对网络设计构成了有趣的挑战。
🤖 大型语言模型对系统设计的深远影响
Peter作为系统工程师,深入探讨了大型语言模型对网络拓扑和其他参数的影响。他强调了从排名模型到LLMs,计算能力需求的显著增加,这意味着需要构建更大的集群来支持模型训练。Peter讨论了大型集群的两个主要领域:扩展域和扩展上行域,并解释了它们如何映射到数据并行和模型并行流量。他提到了为了实现模型并行化而将网络分割成多个部分,并在这些部分之间传递激活所产生的挑战。此外,他还讨论了大型集群中出现的延迟问题,以及如何通过优化模型来有效重叠网络通信和计算。
🔍 大型集群中的挑战和故障排除
Peter继续讨论了在大型集群中进行故障排除的挑战。他指出,随着集群规模的扩大,组件数量增加,故障发生得更频繁,这增加了故障排除的难度和时间。他还提到了在大型系统中进行故障隔离的复杂性,以及这如何影响训练时间。此外,Peter讨论了推理问题,指出随着模型的增长,单个GPU或主机内存已无法容纳它们,因此需要分布式推理。他预测,未来的小型集群将至少包含64个GPU,并且推理将成为一个网络问题,需要在多个系统上运行。
🌟 大型语言模型的未来展望
最后,Peter总结了大型语言模型对计算需求的巨大影响,这导致了更大的集群、更大的推理织物的需求,以及对网络拓扑的优化。他强调了规模上行活动现在需要超越机架或节点,这是过去几年网络拓扑中最大的变化。他还提到了模型并行化的需求,以及推理现在成为一个必须在多个系统上运行的问题,类似于训练,但只执行前向传递。Peter和Jonsu在演讲结束时邀请听众提问。
Mindmap
Keywords
💡生成性人工智能(Generative AI)
💡大型语言模型(Large Language Models, LLMs)
💡计算能力(Compute Capabilities)
💡网络技术(Network Technologies)
💡模型并行性(Model Parallelism)
💡数据并行性(Data Parallelism)
💡延迟(Latency)
💡分布式推理(Distributed Inference)
💡系统设计(System Design)
💡RDMA(Remote Direct Memory Access)
💡GPU(Graphics Processing Unit)
Highlights
生成性AI是当前最难的话题之一,它涉及创建和生成新的、逼真的内容。
生成性AI与理解现有信息的AI模型不同,它更侧重于生成新内容。
生成性AI的发展为图像、视频和文本生成等应用领域带来了新的巨大机遇。
自2015年以来,计算能力的大幅提升和网络技术的进步对生成性AI的发展起到了重要的推动作用。
Meta公司在生成性AI领域做出了显著贡献,例如通过小仙人掌的提示生成令人信服的图像。
大型语言模型(LLMs)正推动基础设施的极限,它们对系统设计尤其是网络子系统提出了新的挑战。
大型语言模型的训练和推理需要大量的计算能力,这导致了对网络子系统的有趣问题。
大型语言模型的训练需要数万个GPU来在合理的时间内完成,这对数据中心的AI工作负载特性产生了显著影响。
大型语言模型的训练和推理过程中,解码和预填充阶段对系统的需求有显著不同。
Meta训练的最新大型语言模型LLaMa-2拥有700亿参数,训练需要2万亿个token和800个DLaaS单元。
使用Rocky网络织物在生产集群中训练的LLaMa-3模型,展示了使用更普通的网络硬件进行LLM训练的可能性。
数据输入到模型的量呈指数增长,这需要更多的GPU和更高的计算能力。
为了实现更快的创新和训练更复杂的模型,Meta的愿景是实现超过30 exaflops的计算能力。
使用简单的数据并行化方案在训练大型模型时已经不够,需要采用模型并行化或流水线并行化等其他并行化方案。
大型语言模型的推理现在也成为一个网络问题,需要在多个系统上分布式地运行以满足目标延迟目标。
大型集群的构建需要考虑更多的延迟和可靠性问题,以及在软件和硬件层面上的故障排除。
随着模型和集群的增长,模型并行化的挑战变得更加显著,需要更高的带宽和更低的延迟来有效实现。
Transcripts
hey I'm jonsu from meta and Pierre and I
are going to talk about the tongue
they're talking for Gen AI training
influence clusters
so generative AI is one of the hardest
topics these days
and it's about creating generating new
and realistic contents
and before genetic models become popular
AI models were often used to understand
existing information like image
classification and segmentation
so genitive AI is about generating new
content versus
understand the existing contents so
that's the main difference
and generative AI opens up a new huge
opportunities and new applications so
for example image and video generations
and also text Generations
AI goes back to 2015 when Jeff hinton's
Lab at the University of Toronto showed
generating an image of bowel of bananas
on the table
and you can notice that how low
resolution they are and the next few
years we've seen a lot of breakthroughs
so for example Dali and stable diffusion
for image generation and GPT for text
generation
and one of the important enablement
Technologies from 2015 until now is a
huge amount of compute capabilities
available and Network Technologies that
connect many accelerators played a very
important role
a metal contributed this field
significantly
for example this is to walk in this year
and then you can see that by giving a
prompt of a small cactus wearing or
sunglass insara desert we can see very
convincing image and also photography
image compared to the images shown in
the previous slide
and of course there are large linkage
models from beta like Lama and we can
create a chapel from these models to
have a large language model based
knowledge Discovery and actually llms
are the ones usually pushing the limit
of infrastructures
so in this store we are gonna focus more
on the llms
I and specifically large language models
being for system design especially for
the network subsystems
so recommendation models have been the
primary AI workloads in meta data center
but large language models have very
different characteristics compared to
the recognition models first large
language models whose training and
influence requires much more compute
and because of this especially for the
large language model training we need
huge number of accelerators to finish
the training in a reasonable amount of
time
and this creates a very interesting
problems to the network subsystem
and also interestingly even within
element in France it has very diverse
characteristics for example element
influence consists of two stages called
decoding and pre-fill
and the decoding has
a very low latency requirement
and let's talk about both details about
this specifically for the compute demand
so this table Compares how much compute
we need for LMS comparing with the
recommendation models
elements requires multiple orders of
magnitude more compute than the
recognition models
so for example for llm training for each
sentence we need about a pair of fluffs
of compute and then we need to train
with hundreds of billions of sentences
and the size of the models and the
amount of data we are feeding to those
models have been increasing and this is
why we need tens of thousands of gpus
for large Legacy model training
and another influence also requires huge
amount of compute
to provide reasonable user experience
and within our low latency
we need a few petaflops of compute
and you can notice that
this huge amount of compute cannot be
satisfied by just eight gpus per one
host and this is why we need a
distributed inference
so the cost level gpus is not only
needed for training anymore so we also
need them for the influence and this is
another interesting problem for the
network subsystem
there's more concrete examples these are
the recent large language models trained
from meta
and the latest llama 2 with 70 billion
parameters are trained with 2 trillion
tokens
and that needs 800 data flavs to finish
the training and this translates into
1.7 billion GPU hours assuming we are
using nvidia's 800 gpus or more than one
month even if you we are using 2n100
gpus this is a huge amount of compute
and these foundational model LM training
have been done in research supercluster
but I'd like to highlight that one of
the latest model Lama two sorry three
four billion parameter has been trained
in a production cluster using rakibi to
network fabric
and then we are able to achieve similar
speed and scalability compared to
Infinity band
and to the best of our analogy this is
probably one of the largest production
use case of Rocky B2
and then we hope this can help
democratizing the lln training using
more commodity Network Hardware
and are they at better will present more
details about this so if you are
interested in more details you can watch
his talk in the same event
and about the complexity on the amount
of data
we are feeding into these models have
been increasing exponentially and then
we don't expect that Trend will stop
anytime soon
and this is a reason why
we need a lot of gpus and we are using
about 2000 gpus these days but we don't
think that's going to be enough going
forward so that's why we are thinking
about 42 000 gpus and even Beyond and
our vision is achieving more than 30
extra flops
which corresponds to about one third of
the theoretical Peak compute capability
provided by 32
000 gpus
and this will enable training the lava
model less than one day
instead of both in one month and this
will innovate and enable much faster
Innovation and also enable much more
complex models trained with more data
and one of the challenges in training
these large models using huge number of
accelerators
is using simple parallelization scheme
is running out of steam
the current most common way of
parallelizing these models is called
Data parallelization
and is paralyzing across the inputs
but that itself is not enough anymore so
we need to use other parallelization
schemes like model parallelism or
pipeline parallelism so basically we
need to slice into along the multiple
dimensions
by combining multiple ways of
parallelization
it generates or diverse patterns of
communication and that is also very
interesting problem for the network
so early influence is a very interesting
problem for system design so for good
usual experience we typically care about
two negative three metrics so first one
is called time to First token so
basically we don't want users to wait
too long until they start seeing the
first response and then typically we
want them to be less than one second and
the second latency metric is called time
per incremental token so once you start
generating tokens we don't want them to
be too slow and then we typically want
them to be less than 50 milliseconds so
basically we are seeing every tokens
every 15 milliseconds
and let's look at more details
and another influence consists of two
stages pre-filled and decode and
pre-fill determinants of time to First
token and decoding determines the time
per incremental token and what's
interesting is it has they have very
distinctly different system demand so
preview is about understanding the usual
prompts and then you can work on
multiple tokens from the user prompts in
parallel so that's why it can be very
computer intensive but on the other hand
in the decoding stage
it needs to read huge amount of the
amount of data when it's going over
generating one output token one by one
so that's the reason why it becomes very
memory intensive
so one stage is very compute and the
other stage is memory intensive
so the inference system needs to provide
very high compute throughput and also it
needs to provide very high memory
through and that's the reason why it's
hard to contain an influence
within one host typically with httpus
and going forward we expect we need a
distribute influence for element
inference so we need a small cluster
for inference
so lastly we kept the first part of the
talk so llms requires orders of
magnitude more compute compared to
recognition models and training in
particular requires tens of thousands of
accelerators to finish it in reasonable
amount of time
and because of that we need to use
different types of model types of
parallelization and then that generates
diverse patterns of communication
diverse communication patterns which is
a very interesting problem for laptop
design
and influence also requires a small
cluster and then that influence also
becomes a network problem
now Peter is gonna talk about going to
move in depth on the system design for
element training and inference thank you
Jung so thanks for excellent
presentation my name is Peter I'm a
medical engineer and in my section we're
going to dive deeply in the effect that
the large language models and Jai in
general have on networking topologies
and other parameters of our Fabrics
so as we covered briefly in the previous
section the biggest challenge the
biggest change from llms to llms for
ranking models was the increase in
compute capacity requirements what this
means is that we we now need to build
much larger clusters to support training
office models now a big cluster
naturally separates into two large
domains one is a scale out and that was
a collection of scale aftermates let's
resume briefly the scale out domain is
what connects the compute pods together
you can think of racks of services small
parts scale out is when we use
Technologies like infiniband or rocky to
implement connectivity for tens of hours
of nodes so this is where scalability is
most important not so much of speeds oh
speeds and a joke still you have
connectivity at the rate of 50 gigabytes
per second it's gigabyte and gigabit
on a contrast the scale up Dom domain is
usually contained in one server this is
your enveline technology or xgmi
for a few examples
on the contrast with this scale out
again it was a short distance but very
high bandwidth like in contemporary
system the Delta between scale out
bandwidth and the scale up Bandit is
about 9x but that means from 50 gigabyte
we're now moving to 450 gigabytes per
second
and now as we mentioned previously when
you train parallel when you train model
in part of fashion you generate two
types of pluralism at a very high level
one is data parallel another small
parallel now scale out part of topology
naturally maps to the data parallel
traffic and the scale up nature of the
encapsulates the motor part of traffic
and now let's take a look at how this
looks topologically
jongsu spoke about the goal they need to
build apologies which we currently
contain to 32k gpus or accelerators even
though it's a large number it's of the
limit but here we are looking at the
fabric but instantiates such topology
for Network Engineers this isn't looking
too surprising in fact this is a
well-known cost apology which has
multiple tiers of connectivity at the
very bottom you have your racks in our
case each rack has 16 gpus in two
servers so effectively every rack has
two domains of skill-up topology scale
up bandwidth above those racks you have
your scale out fabric this is where
Infinity band or Roku works in our
example we have a rocky fabric as
mentioned before we deploy both Rocky
and infinity band Fabrics but Rocky is
also unique you see more examples from
Indian band in a public and Rocky so
this is this slide demonstrates the
Baroque instance notice that we have in
each layer above the racks 18 cluster
switches now this is important because
it gives you additional capacity to
protect against failures as you will see
further failures and reliability is of
utmost importance for these clusters and
these designs
now there's a lot of details to
implement Rocky of which ADI will talk
separately in his presentation on a
review public implementation but I want
to stress out that this is pushing Rocky
or anything about to estimates but both
in very large clusters of thousands of
gpus
and this slide captures what happens
inside these Fabrics now just to recap
Once Again when you train models you
generate two types of traffic patterns
one stem from data parallelism another
from model pluralism now the most
challenging part is model parallelism
but before we get there let's take a
look at data parallel patterns there you
generate patterns like or reduce or all
governed use scatter these are well
known they've been known like for many
years for practitioners who've been
trying to train models before imagine
size here is usually substantial but it
grows smaller and smaller as you
increase the size of a scale-out domain
and this is where some of the challenges
become more evident as we'll see later
latency becomes more visible latency
here it means propagation latency
notably however we did a part of
patterns typically can overlap efficient
of compute it's not Universal in some
cases you won't get this for free you
have to work and optimize model to
achieve efficient overlapping however
very often scale out part and data
parallelism can be well overlapped with
compute which makes them less
challenging so to speak for networking
not so much more to protracted model
parallel is a result of slicing the
networks into pieces and trying to pass
activations between those components
there you have your familiar or reduce
or Auto all patterns which come from say
tensor parallelism or pipeline
parallelism and here the benefit is much
much higher this is where you really
need the scalar bandwidth to be
efficient because the messages is still
pretty large and your demand for
bandwidth is 10x if not more to realize
this parallelism
and most importantly and critically it's
much harder to overlap model traffic
model parallel execution with compute
part so this is where latency and
bandwidth are much more important than
for data parallelism
and this diagram demonstrates you how
all this Collective so to speak map to
the network topology here you have a
scale out collectives or reduce reduce
scatter and all gather which map on the
cluster switches above a racks this is
where you see all these Rings often
which result from reduced scatter but
span multiple switches and go across all
the racks in the topology for instance
if you're changing set size is 16k gpus
you typically see the ring size of what
is taking 1000 gpus spanning across all
the switches this is where latency
starts to add up but this is where you
still can overlap this collections with
the computation
now at the bottom of this tree you see
your model pile chart you see all reduce
and others which map to the scale up
domains for example in case of Nvidia is
going to be in the link interconnects
however once you cross single server
single board you have to run this
traffic across the scale out and this is
where you see the impact of much lower
bandwidth and as you will see this
bottleneck dictates the need to grow the
scalp domains Beyond one server
and so now let's just recap and look
back and what changes with large
clusters now now you can say Okay scale
is great but looks like it's the same
traffic same problems over again well
pretty much so however it's important to
reiterate that latency stats become
important what's funny what's we
observed in the AI training is that
latency in the network wasn't so
critical as it is typically it's in nhpc
applications most of the time because
you can overlap collect these
interpretations however with llm
training with very large clusters you
have now machines which span whole
buildings so now we have latency from
switches from the fiber even from the
transceivers which keeps adding up and
as it adds up it starts to be visible
for smaller messages as mentioned before
as you increase the power domain size
your message size does decrease and this
is where you start to see the exposed
latency and you have to really pay
attention and manage which much better
now for the second party liability
naturally it's you should expect but as
you grow Network size as there are more
components and more elements they fail
more frequently now to be fair most of
feathers often happen in the software
land not so much in Hardware but the
hardware this skill also exhibit issues
so often when you start with systems for
the first time you have to go for
burning process identify both the
components eliminate them replace them
and so on and this takes time second
problem is that in a large system it
takes much longer longer time to do
fault isolation you have to track the
issue with much more components and
often which takes much more time when it
does in a smaller setup and all of that
as a training time more time to debug
less time to run the actual computation
and finally the thing with Johnson
mentioned that inference for other lamps
is now becoming a networking problem
previously we can contain the inference
in single GPU you can see often hear
about when you run inference use only
one GPU typically or even like a single
PC card in the LM case you have two
challenges first of all this models
grows so large you can't contain them in
single GPU memory or even a single host
memory you have to go across just to
keep the coefficients and Optimizer
States and our parameters together
secondly you need more compute to
achieve the target latency goals for
example during previous stage you need
much more compute to achieve the let's
say Legacy of one second for first token
for large models and long sequence lens
if you want to go to SQL stats of 32k
64k or even 1000k you have to go with a
distributed inference and that means you
now have a mini cluster that implements
forward paths in distributed fashion you
have to run always tensor slicing water
poloism across multiple systems and if
you are bottlenecked by the scale out
well yeah it's your problem to solve
because now you're much much slower
and so as a result of this trend we
foresee that the mini trans clusters are
going to go to 1642 64 gpus even in
modern generation and you can expand the
strategory future but I don't think it
will be on 64 the next couple of years
and now to to recap what we have done in
this section
and once again the biggest shift for our
lamps was it tremendous increase in
computational demand now this dictates
everything as you have seen higher
computation requires larger clusters it
requires large influence fabrics and so
on large clusters have better issues
right visibility problems with issues
with latency and various political
structure which requires optimization
the biggest Trend we're seeing is that
scale up an activity now supposed to go
beyond a rack or beyond the node this is
probably the biggest change we've seen
in a topologists in the last three four
years is that explosion of bandwidth but
you need to realize model parallelism
and once again inference is is now a
most known problem you have to run
inference across multiple systems it
becomes like a mini cluster similar to
training but only doing forward pass not
the backboard pass and that's it for my
part thank you so much for listening and
Johnson and I can take your questions
now
تصفح المزيد من مقاطع الفيديو ذات الصلة
AI Boom Vs. Internet Boom
Ilya sutskever | Humanity will eventually move towards AGI | The intelligent body will soon appear
Geoffrey Hinton Unpacks The Forward-Forward Algorithm
特斯拉自动驾驶的“通用世界模型”和视频生成技术|Ashok23年CVPR主题演讲
Ilya Sutskever | This will all happen next year | I totally believe | AI is come
Augmentation of Data Governance with ChatGPT and Large LLMs
5.0 / 5 (0 votes)