Overcoming the Complexities of Generative AI

Weights & Biases
17 May 202424:45

Summary

TLDRこのビデオスクリプトでは、NvidiaとWeights & Biasesが共同で開発した新しいツール「Nims」を紹介しています。Nimsは、企業がAIモデルを迅速に展開し、最適化するためのプラットフォームで、言語モデルのデプロイメントを簡素化し、セキュリティとパフォーマンスの両面からサポートします。スクリプトでは、2022年のChat GPTの登場以来のAIの発展とそれに伴う企業のニーズの変化について触れ、2023年に始まった多様なコミュニティモデルの登場とそれに伴う評価、最適化、展開の課題についても説明しています。さらに、NvidiaのAI EnterpriseとWeights & Biasesが提供するソリューションを通じて、これらの課題をどのように克服し、効果的なモデル展開を実現するかについて詳細に紹介しています。

Takeaways

  • 🚀 NvidiaはWeights & Biasesツールを利用し、AIモデルの最適化と展開に注力している。
  • 📈 2022年にChat GPTが登場し、ビジネス分野における生成型AIの可能性が認められるようになった。
  • 🌐 2023年にはコミュニティモデルが登場し、様々なユースケースへの応用が試みられた。
  • 🛠️ 2024年にはそれまでの試みから得られた知識をもとに、実際のビジネスアプリケーションへと展開が進んでいる。
  • 🔧 NvidiaはAIモデルの展開と推論に関する長い歴史を持ち、ハイパースケーラー企業と協力してモデルの最適化を行っている。
  • 💡 企業はAIモデルの展開において、時間やリソースが限られているため、Nvidiaはその問題を解決するソリューションを提供している。
  • 🔄 RAG(Retrieval-Augmented Generation)が導入され、ビジネスデータの埋め込みとファクトチェックが可能になった。
  • 🔒 コミュニティモデルの登場により、セキュリティやバグフィックス、コードの脆弱性などの新たな問題が生じた。
  • 📦 Nvidia Nim(Neural Inference Microservice)は、企業環境におけるAIアプリケーションの展開を簡素化し、制御を維持するプラットフォームとして開発された。
  • 🔄 Lauraアダプターを使用することで、異なるユースケースに対して同じベースモデルを使い、より効率的なモデル展開が可能になる。
  • 🔧 評価とカスタマイズのためのNemoフレームワークが提供されており、エンドツーエンドのデータ管理、モデル構築、評価、展開が可能となっている。

Q & A

  • NvidiaはWeights & Biasesツールを使ってどのような成果を上げていますか?

    -NvidiaはAIの展開と推論に長い間携わっており、Weights & Biasesツールを使ってAIモデルの最適化を行っています。これにより、多くの人が関与するハイパースケーラー企業と比較しても、企業は短時間でモデルを展開し、推論の遅延を最小限に抑えることが可能です。

  • 2022年に起こった重要なAI技術の出来事とは何ですか?

    -2022年に重要な出来事は、Chat GPTの発売でした。これはビジネスが生成的AIを応用する可能性を認識する瞬間であり、AIが画像やビデオだけでなく言語理解にも関わると気づく契機となりました。

  • 生成的AIの実験とコミュニティモデルの登場はどのようにビジネスに影響を与えましたか?

    -生成的AIの実験とコミュニティモデルの登場により、ビジネスは自社の適用事例に合わせてモデルを調整できると認識し、モデルの選定、最適化、展開の方法を模索する動きが広まりました。

  • Nvidia Nim(Nims)とはどのようなプラットフォームですか?

    -Nvidia Nimは、汎用性を保ちながらも生成的AIアプリケーションを制御し、データに近い場所で展開できる柔軟性を提供するプラットフォームです。企業環境でのスムーズな運用を可能にし、迅速な統合と展開を促進します。

  • Nimsはどのようにして企業のAIアプリケーションの展開プロセスを簡素化するのですか?

    -Nimsはモデルの最適化、ドメイン固有のコードの提供、カスタム機能のサポート、業界標準APIの利用など、多岐にわたる機能を組み合わせて、企業のAIアプリケーションの展開プロセスを簡素化しています。

  • LoRAアダプターとは何で、どのような利点がありますか?

    -LoRAアダプターは、モデルの重みの完全なサブセットを最適化する代わりに、追加のパラメータの低い秩分解を表す小さなサブセットを最適化するものです。これにより、基本モデルにアダプターを切り替えることで、モデルの用途を素早く変更することができ、多くの用途に対応するより小さな展開が可能になります。

  • NvidiaはどのようにしてAIモデルの展開においてセキュリティとバグ修正を確保するのですか?

    -NvidiaはNimsを介してAIモデルの展開を管理し、ITおよびビジネス部門からの要求に応じてセキュリティ確保、バグ修正、コードの脆弱性への対処などを行っています。これにより、企業はコアビジネスに集中し、非コアの分野に時間を費やすことを避けることができます。

  • Weights & BiasesはどのようにしてNvidia AI Enterpriseと統合されていますか?

    -Weights & Biasesは、Nvidia AI Enterpriseと統合されており、モデルのトレーニング、評価、展開のプロセスを一元管理できるプラットフォームを提供しています。これにより、開発者はカスタマイズされたNimsを簡単に展開し、管理することができます。

  • Nvidia AI Enterpriseのモデル月曜日とは何ですか?

    -モデル月曜日は、Nvidiaが毎週月曜日に新しいモデルをリリースするプログラムです。これにより、開発者は最新のモデルを試して評価し、自分たちのビジネスに適したモデルを選択することができます。

  • Weights & Biasesのプラットフォームにおける実験の追跡とチームコラボレーションの利点とは何ですか?

    -Weights & Biasesのプラットフォームは、実験の追跡とチーム間の協力を容易に行えるように設計されており、実験がどこで実行されているかにかかわらず、一貫性のある追跡エクスペリエンスを提供します。これにより、チームは効果的に協力してプロジェクトを進めることができます。

Outlines

00:00

😀 AIと言語理解の進歩とNvidiaの取り組み

スピーカーはNvidiaがウェイツアンドバイアスツールを利用し、AIの発展に貢献していることを紹介し、特に2022年にチャットGPTが登場しビジネスにAIを応用する意識が高まったと振り返ります。2023年には多くのコミュニティモデルが登場し、AIの適用分野が拡大しました。2024年にはこれらの試みが実用化され、NvidiaはAIモデルの推論と展開に関する豊富な経験を持ち、企業が最適なAIモデルを展開するためのソリューションを提供しています。

05:01

🛠️ Nvidia Nimの登場とAIアプリケーションの展開

Nvidia NimはAIアプリケーションの展開と制御を簡素化し、データとの近くに保つことを目的としたプラットフォームです。Nimはモデルの最適化、ドメイン固有のコード、カスタム機能、そして業界標準APIを提供し、エンタープライズ環境での迅速な統合と展開を促進します。また、Nimはモデルの種類に応じたさまざまなハードウェア最適化を内蔵しており、柔軟性とセキュリティを提供しています。

10:03

🔧 Lauraアダプターと並列化戦略によるカスタマイズとパフォーマンス

Lauraアダプターはモデルのカスタマイズを簡略化し、異なるユースケースを迅速に切り替えることができます。並列化戦略は大きなモデルにおいても効果的で、TensorRTなどの技術を用いて層を分割し、複数のGPU間で処理を並列化します。これにより、パフォーマンスと総保有コストのバランスを取ることができます。また、精度のバリデーションも行われ、モデルの展開において正確性を保証しています。

15:03

🔧 Nemoフレームワークとカスタマイズのためのマイクロサービス

Nemoフレームワークは2018年に開発され、データキュレーション、カスタマイズ、評価、そして展開のためのエンドツーエンドのフレームワークです。データキュレーター、カスタマイザー、そして評価ツールがマイクロサービスとして提供され、高品質のデータセットを作成し、特定のユースケースに応じたモデルのカスタマイズと評価を行います。これにより、企業は独自の言語やAPIを持つ場合でも、モデルを適切にカスタマイズできます。

20:03

🚀 W&BとNvidia AI EnterpriseによるLLMの展開と自動化

Weights & BiasesとNvidia AI Enterpriseは、LLMの展開と自動化を支援するプラットフォームです。例として、サポートリクエストの優先順位付けのためにLlama 2.7Bモデルをファインチューニングし、W&Bを使用してモデルのパフォーマンスを評価し、Nvidia Nimに統合して最適化されたコンテナを作成しました。これにより、開発者はエンタープライズアプリケーションにLLMを展開する際の多くの課題を解決し、迅速かつ効率的な展開が可能になります。

Mindmap

Keywords

💡Nvidia

Nvidiaは世界有数のグラフィックプロセッシングユニット(GPU)メーカーであり、人工知能(AI)分野においても重要な企業です。ビデオではNvidiaがAIモデルの展開と推論に関する長い歴史と経験を持ち、その技術を他の企業が利用できるようにする取り組みを紹介しています。

💡Weights and Biases

Weights and Biasesは、機械学習モデルのトレーニングと評価を支援するツールです。ビデオでは、Weights and BiasesとNvidiaが提唱するソリューションを組み合わせることで、より効率的なAIモデルの開発と運用が可能になるという点を強調しています。

💡Generative AI

Generative AIは、新しいデータを作成することができるAIのサブ分野です。ビデオでは2022年にChat GPTが登場し、ビジネス分野において言語理解を通じてAIの応用が広がったと触れています。

💡Chat GPT

Chat GPTは、オープンAIによって開発された言語モデルの一つであり、自然言語処理の分野で大きなインパクトを与えました。ビデオではChat GPTの登場がビジネスにAIを応用する契機となり、多くの企業が言語を理解するAIの可能性を認識したと説明しています。

💡RAG (Retrieval-Augmented Generation)

RAGは、ファクトデータの埋め込みと検索を通じて言語モデルの精度を向上させる技術です。ビデオでは、RAGを用いてビジネスデータをベクターデータベースに埋め込み、検索して言語モデルにフィードすることで、より正確な回答を得る方法が紹介されています。

💡Community Models

Community Modelsとは、コミュニティによって開発された言語モデルのことを指します。ビデオでは、これらのモデルが頻繁に登場し、企業がそれらを適応させたり最適化したりする必要があると説明しています。

💡Nims (Nvidia Inference Microservice)

NimsはNvidiaが開発した推論マイクロサービスで、企業がAIアプリケーションを迅速かつ効率的に展開できるように設計されています。ビデオではNimsがAIモデルの展開プロセスを簡素化し、エンタープライズ環境での運用を円滑にする役割を果たしていると紹介しています。

💡Quantization

Quantizationは、AIモデルの精度を損なうことなく、モデルのサイズを小さくし推論速度を向上させる技術です。ビデオでは、モデル展開において精度とレイテンシのトレードオフを検討する際に、Quantizationが重要な役割を果たしていると説明しています。

💡Customization

Customizationは、企業が独自のニーズに合わせてAIモデルをカスタマイズするプロセスです。ビデオでは、Nemoフレームワークを通じてデータキュ레이션、カスタマイズ、評価、そして展開を行う方法が紹介されており、企業が独自の言語モデルを開発し展開する際の重要性が強調されています。

💡Evaluation

Evaluationは、AIモデルのパフォーマンスを測定し、期待される基準に合致しているかを確認するプロセスです。ビデオでは、NemoフレームワークのEvaluatorを使用して、モデルのパフォーマンスを評価し、最適化するプロセスが紹介されています。

Highlights

Nvidia和Weights & Biases工具的集成令人兴奋,展示了AI在企业中的应用潜力。

2022年Chat GPT的推出标志着AI技术在语言理解方面的突破。

2023年社区模型的兴起,促进了生成性AI的实验和应用探索。

Nvidia在AI部署和推理方面拥有丰富经验,助力企业优化AI模型。

企业面临的挑战是如何优化、评估和部署AI模型。

Nvidia推出的Nims平台简化了企业级AI应用的部署流程。

Nims支持Day Zero LLM,与模型构建者合作,提供最新的模型支持。

Nims平台提供灵活性,允许在任何地方部署应用程序,同时控制数据。

Nims内部结构包括模型优化、硬件特性利用和领域特定代码。

LoRA适配器允许在单一基础模型上切换不同的用例,而不影响延迟。

Nims处理了并行策略,包括管道并行、张量并行和专家并行。

Nims平台通过优化技术,如自定义多头注意力和批处理,提高性能。

Nims确保了模型在不同硬件上的准确性和性能优化。

Weights & Biases与Nvidia AI Enterprise集成,简化了自定义Nims的部署和管理。

Nemo框架是一个端到端的自定义和部署工具,支持数据整理、定制和评估。

Nemo框架内部的微服务,如数据策划和定制器,支持企业级AI模型的定制。

Nemo评估器为企业提供了一个全面的工具,用于模型性能的评估和验证。

演示展示了如何使用Weights & Biases和Nvidia AI Enterprise来编译和评估自定义LLM。

通过集成,简化了从模型微调到部署的整个流程,提高了企业级AI应用的开发效率。

Transcripts

play00:00

all right thank you for having me uh

play00:02

good morning everybody hopefully you're

play00:04

enjoying uh the session so far I would

play00:07

like to say that I'm really excited and

play00:09

proud to be standing here because um

play00:12

Nvidia has been using weights and biases

play00:13

tools for quite a while now and um it's

play00:16

really exciting to be up on stage here

play00:18

and and to talk about that we both have

play00:20

two solutions that we're integrating

play00:22

together so hopefully you'll be as

play00:23

excited as I

play00:25

am okay so in 2022 I think we kind of

play00:29

all remember where we were it's like

play00:30

that moment that hit uh when sort of not

play00:33

just the launch of a new technology uh

play00:36

exploded onto the landscape but just

play00:38

that recognition of wow um AI can work

play00:41

for me and we all know what we're

play00:41

talking about it was the launch of chat

play00:43

GPT and even though some of us in the

play00:46

field have been working on generative AI

play00:48

for quite a long time it was the moment

play00:50

where uh businesses start to think

play00:52

started to think about wow I can use

play00:54

this for my applications and not AI

play00:56

isn't just images and video but it's

play00:59

really about language understanding so

play01:02

then what happened in 2023 uh everyone

play01:04

started experimenting with generative AI

play01:07

um trying to figure it out what use

play01:09

cases can I apply it to uh another thing

play01:12

happened was the launch of a lot of

play01:13

community models so the ability to say

play01:15

wow maybe I can adapt this to my use

play01:17

case and then um in

play01:21

2024 it's reaping the benefits of all of

play01:24

that uh experimentation and being going

play01:26

going into production now at Nvidia

play01:28

we've been dealing with um AI um

play01:31

deployments and inference for a very

play01:33

long time and so uh we have all of we

play01:36

work with hyperscalers who actually

play01:38

employ like hundreds of people who work

play01:40

on the optimization of AI models to put

play01:43

them out into inference for these

play01:45

latencies but you know what Enterprises

play01:46

don't have that time and they don't have

play01:48

the resources to do that and so I'm

play01:50

going to walk you through a little bit

play01:51

of a journey really quick about how we

play01:54

got to where we are and why we're

play01:55

building what we're building and

play01:57

announcing what we're

play01:58

announcing okay so this is the most

play02:01

simple user workflow that you could

play02:03

possibly imagine uh that you started

play02:04

with in 20 2023 so it's really just the

play02:07

user a prompt an API to an llm and the

play02:11

answer um and everyone thought it was so

play02:13

amazing and it was until it wasn't

play02:17

because then you realize that wow this

play02:19

is hallucinating um it's just blurting

play02:22

out things that aren't factual it's not

play02:23

talking about my data it's not even

play02:25

talking in my brand voice and maybe it's

play02:27

too verbose or it's to too tur and so in

play02:30

general it just wasn't based in

play02:32

facts and so then we added rag so you

play02:35

just talked about retrieval augmented

play02:36

generation uh to H help with that

play02:38

factual data so embedding your business

play02:40

data and putting into a vector database

play02:43

and being able to retrieve it and cite

play02:44

it and then send that to the llm to

play02:46

generate the answer and honestly I could

play02:49

stand up here all for the an entire hour

play02:52

talking about embedding Vector search

play02:54

acceleration and uh retrieval tactics

play02:57

like reranking but I'm not going to so

play02:59

you're welcome but um but so you added

play03:04

that and that was great until it wasn't

play03:07

because then you started to think wow

play03:08

I'm sending all of this data uh and

play03:11

answers and prompts away to an API and I

play03:13

did hear about this thing uh all these

play03:15

Community models that came out and so

play03:18

with the launch of community models it

play03:19

proliferated experimentation even

play03:21

further um so in community models were

play03:24

these are just a few right we have a

play03:25

list of hundreds that we're working on

play03:28

but um Community models were coming at

play03:30

first every month and then it was like

play03:33

every week and now I bet you feeling it

play03:35

models are coming out like every day and

play03:37

so how are you evaluating those models

play03:39

how are you optimizing them and how are

play03:41

you deploying

play03:43

them and so the generative AI deployment

play03:46

options are seemingly endless so uh what

play03:48

are the trade-offs that you're making

play03:50

for um quantization reduced Precision

play03:53

latency uh compute footprint the just

play03:57

the total cost of ownership are you

play03:58

trying to deploy on a single GPU with a

play04:00

smaller model but you're not getting the

play04:02

accuracy um are you trying to do a

play04:05

multi-gpu deployment option or in some

play04:07

cases and maybe you're not there yet we

play04:09

have customers who are deploying multi-

play04:11

node so that means many eight gpus is in

play04:14

a node and so like 16 gpus at least to

play04:16

be able to serve these models and so

play04:20

having to think about those trade-offs

play04:21

for total cost of um optimization total

play04:24

cost of ownership and so then once you

play04:26

go to deploy and you feel pretty good

play04:27

that you figured it all out and then

play04:29

your your it and line of business is

play04:30

starting asking you is it secure how do

play04:33

we get bugs fixed what are this pile of

play04:35

code that you're handing me are there

play04:36

vulnerabilities in it how do I deal with

play04:38

the next model uh what is the API that

play04:41

you've developed and using and so

play04:42

basically you just start down this Rat

play04:44

Hole of everything that ises not core to

play04:46

your

play04:47

business right you are now spending all

play04:49

this time and maybe even building up a

play04:51

team that is not core to your business

play04:52

and so it's enough to drive you crazy

play04:54

and I I think look at this poor guy he

play04:56

literally ripped his hair out

play05:00

and I trust me I think my hair has been

play05:01

falling out over the last few years too

play05:04

um but so the Nvidia Nim we call it Nims

play05:07

has anyone here heard of a Nim

play05:10

before no no okay so it originally

play05:13

started to we thought of it as an Nvidia

play05:15

inference microservice but it's just Nim

play05:17

for short and now Nim is just Nim so

play05:20

it's not redundant it's not Nvidia Nim

play05:21

it's it's just they're Nims and um

play05:25

you're going to get used to it

play05:26

everyone's going to start talking about

play05:27

Nims at this conference I can feel it

play05:29

and uh anyway so we developed this Nim

play05:32

platform to have the flexibility to

play05:34

deploy your applications anywhere while

play05:36

maintaining control of your generative

play05:38

AI applications and keeping it close to

play05:40

your data and so we basically simplified

play05:42

that um deployment process for you for

play05:44

smooth operation for Enterprise

play05:46

environments so that means quicker

play05:48

integration uh and deployment for your

play05:50

teams so let's take a look in inside a

play05:53

Nim and before I do you probably notice

play05:54

that all those um little retrieval icons

play05:58

turn green as well for

play06:00

and I can't stress enough that this is

play06:02

the most simple diagram of a of a rag

play06:05

that you can see because honestly I

play06:08

didn't add guard rails I didn't um add

play06:10

prompt rewriting query competition query

play06:13

routing there's many many llms that are

play06:16

running for every single query which is

play06:19

why you need to have the latency for

play06:20

every single llm model that you have out

play06:23

there um okay so let's look inside an N

play06:26

so it all starts with the model at the

play06:28

heart um and we are proud to say that we

play06:32

work with we what we call Day Zero llm

play06:34

support not just for Community models we

play06:37

go and curate the most important

play06:38

Community models that we think for

play06:39

Enterprise um generative AI applications

play06:42

but we also work with the world's most

play06:44

important model Builders so the people

play06:46

who are putting out new models seemingly

play06:49

daily so the AI 21s of the world the

play06:51

coher uh Google meta so we're working

play06:54

with them before they even launch a

play06:56

model to make sure that the day that

play06:58

they launch that model that we support

play07:00

in an M um and so now if I go down to

play07:02

the bottom we inference engines so each

play07:05

model has um a different optimization

play07:08

for every skew what we call a skew or a

play07:10

different set of Hardware so uh

play07:12

different Hardware have different

play07:13

requirements or different features that

play07:14

you can exploit the features and

play07:16

functionality so an example would be

play07:18

floating point8 in our Hopper generation

play07:21

of of our gpus which wasn't available in

play07:23

our ampere generation so being able to

play07:26

understand which Hardware you're running

play07:27

on and be able to exploit those features

play07:29

of

play07:30

functionalities we also um offer domain

play07:33

specific code so every model has a pre-

play07:36

and post-processing that happens for

play07:38

that for that domain um and it's kind of

play07:41

like amell's law so if you optimize the

play07:43

the model the llm itself you you might

play07:46

be stuck in the bottleneck of the

play07:47

processing the pre and post-processing

play07:49

and so for things like um a generative

play07:52

ASR or an ASR model you might have a

play07:55

beam search decoding that you have to be

play07:57

able to accelerate or different

play07:58

tokenizers for llm so all these domain

play08:01

specific code um get wrapped into the

play08:03

Nim as well and also sometimes these are

play08:05

models themselves they need to be

play08:07

collocated with the compute Hardware so

play08:09

that can add latency again you don't

play08:11

have to worry about that we handle that

play08:13

inside the Nim um we also recognize the

play08:16

importance of customization which I'll

play08:18

talk about in a little bit so we offer

play08:20

the support of Laura adapters within Nim

play08:23

to be able to serve multiple use cases

play08:25

per base model we also offer industry

play08:28

standard apis so if you saw that Journey

play08:30

at the beginning with that that red API

play08:33

that you started with you don't want to

play08:34

have to rewrite your code or refactor

play08:36

your code you want to be able to use the

play08:37

same API and honestly in some use cases

play08:40

there are no uh industry standard apis

play08:43

an example would be um our generative

play08:45

model where we have um what we call it

play08:47

audio to face it's basically taking ASR

play08:50

and and animating a digital Avatar face

play08:52

so there's there's no industry standard

play08:54

for that one but we're working in the

play08:55

open to make sure that uh we're getting

play08:57

feedback on on those apis

play08:59

um and again these are all built on our

play09:03

our NV AI Enterprise blessed containers

play09:07

so they they're robust they're tested

play09:09

they're validated we test them across

play09:11

various environments including um again

play09:13

we're hybrid Cloud so um Azure AWS

play09:16

Google Cloud you can make sure that they

play09:18

work across all these different

play09:20

environments and then um and so

play09:22

basically this is a a Nim and a nutshell

play09:25

and so let me talk a little bit about um

play09:27

some of the the core components that I

play09:29

get a lot of questions about frequently

play09:31

asked questions so Laura adapters so

play09:35

today's organizations have multiple use

play09:37

cases that they need to serve or

play09:39

customize for but Deploy on a limited

play09:41

pool of hardware and so you can as you

play09:43

start to think about customization use

play09:45

cases it can quickly explode into any

play09:47

need dozens or hundreds of models um but

play09:51

so instead of

play09:52

optimizing all of the weights the the

play09:54

full weights what we call full

play09:56

fine-tuning of a model Laura optimizes

play09:59

this smaller subset of additional

play10:00

parameters that of represent the low

play10:03

rank decomposition of the changes in the

play10:05

dense layers of the network uh AKA

play10:08

called Laura adapter it's so much easier

play10:10

to say Laura adapter Just Go With It um

play10:13

and so what does that mean for inference

play10:15

uh that means you have a singular base

play10:17

model and you're able to switch the

play10:19

adapter and therefore the model's use

play10:21

case uh on the Fly and so you can have a

play10:24

smaller deployment but be able to deploy

play10:26

many use cases of tens hundreds

play10:28

thousands

play10:30

um and and it offers no impact to the

play10:32

latency of the model at all so we do

play10:35

have customers right now with hundreds

play10:38

of Laura adapters in production and

play10:39

they're and they're eagerly thinking

play10:41

about

play10:43

thousands okay parallel strategies I

play10:47

know you think about this every day you

play10:48

probably wake up and think about

play10:49

parallel strategies but um so this can

play10:52

become tricky for larger models again

play10:54

I'm just showing you the technology

play10:56

inside you don't have to worry about it

play10:57

n handles it for you um so things like

play11:00

um we we have what we call tensor RT

play11:03

inside the container this is something

play11:04

that we've been building uh since 2016

play11:07

2015 was our tensor T version one it's

play11:09

been around for a while but so things

play11:11

like pipeline parallelism and so what

play11:14

that means is it it's interlayer so you

play11:16

split the contiguous layers sets of

play11:18

layers across multiple gpus or tensor

play11:20

parallelism which is intra layer uh for

play11:23

the model so you're splitting the

play11:25

individual uh layers across multiple

play11:27

gpus and then what we call Expert

play11:30

parallelism so if anyone here has heard

play11:31

about ume models or mixture of experts

play11:34

uh being able to split the individual

play11:36

experts they're stored on different gpus

play11:38

depending on the size of the expert and

play11:40

being able to um copy that uh the routy

play11:43

the router to each GPU so all of that

play11:46

technology is handled under the covers

play11:47

for

play11:49

you so things that we also think about

play11:52

um is we're always thinking about

play11:53

performance so things like custom multi

play11:56

multi-head attention inflight batching

play11:58

uh Quant KV cache so anything to drive

play12:01

performance It actually drives um a a

play12:04

total reduction in total cost of

play12:06

ownership again all of this is being

play12:08

handled for you and I I talked a little

play12:10

bit about po per GPU skew and the

play12:12

performance you can get out of it by

play12:14

applying each little different

play12:16

performance uh to that performance

play12:18

optimization to that now these are just

play12:22

an example this is not what it is I'm

play12:24

just giving you an example the things

play12:25

that you might think about so if you

play12:27

have a a Gemma 7B and we and again you

play12:30

want the latency and maybe you go down

play12:32

to a floating point8 well you know what

play12:34

Nim does for you we we validate the

play12:36

accuracy because if anyone here has

play12:38

dealt with um being able to do

play12:40

post-training quantization you can find

play12:41

that if you go from floating Point 16 to

play12:44

quantize in8 or fp8 uh you might have

play12:47

some changes in accuracy so we validate

play12:48

all these Community models that there is

play12:50

no change in accuracy um we have a I

play12:54

think the mixtur what do I have here the

play12:57

8 by 22b here is served on Multi

play13:00

multi-gpu but I'll give an example of

play13:02

one that's not here so uh llama

play13:04

70b um you can serve that on two

play13:07

gpus but what we do is we actually

play13:09

quantize it to fp8 and we actually serve

play13:11

it on eight gpus because that adds to

play13:14

the resiliency and uh and the latency of

play13:18

the model and the serving to meet the

play13:19

latency requirements and so these are

play13:21

the types of trade-offs that you have to

play13:23

have if people were thinking about I

play13:25

want a more accurate model but I want

play13:27

the latency to go down and so we have

play13:29

have customers who really said they

play13:31

wanted to deploy on a single GPU but the

play13:33

larger model with higher accuracy was

play13:35

worth it in the end because you want

play13:37

that delightful experience and it's got

play13:38

to be accurate and so they moved to the

play13:40

larger GPU footprint but again if you

play13:43

make that decision the Nim's going to be

play13:45

able to handle that for

play13:48

you okay so uh I know you got you all

play13:52

were told a URL that you have to

play13:53

remember I'm going to give you another

play13:54

one it's build. nvidia.com and so you

play13:58

can go out and experience all these

play14:00

Community models what that we put out

play14:02

there we have model Mondays so we're we

play14:04

are releasing new models every single

play14:06

Monday and in sometimes in in some cases

play14:09

during the week when that aligns with

play14:10

the model builder launch day and so you

play14:13

can go experience these models we even

play14:16

apis to play with but then you can build

play14:18

deploy and manage with weights and

play14:20

biases which we'll see in a second and

play14:22

then you can deploy

play14:23

anywhere um since we talk we're going to

play14:26

talk about customization I'm going to

play14:27

talk a little bit about um what we call

play14:31

the Nemo framework uh Nemo has nothing

play14:33

to do with the cute little fish it

play14:35

stands for neural modules uh we built

play14:37

this framework in 2018 it was all it is

play14:39

still all open source um and the Nemo

play14:42

team is Nemo platform is used internally

play14:45

for our own um supervised fine tuning

play14:47

reinforcement learning and our model

play14:49

building and we actually use weights and

play14:51

biases internally within the Nemo team

play14:53

so to track experiments and to enable

play14:55

team collaboration and so it doesn't

play14:57

matter where an experiment is being run

play14:58

at whether it's on a internal or

play15:01

collocated cluster at a desktop or in

play15:03

the cloud the same experiences for

play15:05

tracking experiments is um seamless

play15:08

across the team but uh the Nemo is a

play15:11

full end to-end framework for data

play15:12

curation customization and evaluation

play15:15

and then deploying so I'm just going to

play15:16

talk about three microservices really

play15:19

quickly so data curator is a multi-stage

play15:22

creation of high D high quality data

play15:24

sets for um both pre-training and

play15:26

fine-tuning workflows and so things like

play15:29

text reformatting uh D duplication of

play15:31

data we love to accelerate things so

play15:33

everything that we do is accelerated so

play15:34

being able to take that from CPUs of 40

play15:37

hours on a CPU of cleaning data down to

play15:39

three hours on a GPU we also do um

play15:42

document filtering classifier based

play15:44

filtering and multilingual heuristic

play15:47

based filtering lots of filtering and

play15:50

multilingual Downstream task de uh

play15:54

decontamination and so what you do is

play15:55

you take all these and then you blend it

play15:57

together to come up with your special

play15:58

data set plan for the the type of

play16:00

customization that you want to do and

play16:02

then you send that off to um the uh

play16:06

customizer and so instead of talking

play16:08

about features and functionality of the

play16:09

customizer I'm going to give you an

play16:10

example so we have a a model that we

play16:14

built called chip Nemo it's our

play16:16

electronic design automation tooling we

play16:18

did U domain adaptive pre-training and

play16:21

uh fine-tuning on that model and so what

play16:23

we did was we took some Community models

play16:25

and we asked it you know can you uh find

play16:27

the partitions in a design uh write some

play16:30

vivid code and Vivid is our own

play16:31

proprietary coding language now I'm sure

play16:34

a lot of you have dealt with Enterprises

play16:36

who have their own proprietary coding

play16:37

language or their own apis and it's just

play16:40

something that um rag can't handle yet

play16:42

it's fine for documentation but

play16:43

literally can't generate code and so

play16:45

when you ask it to generate this code

play16:48

Vivid is a very common term that's very

play16:50

specific to Nvidia and this is why the

play16:52

importance of customization and so you

play16:54

can see here that a typical La llama

play16:56

model gives you a theoretical paragraph

play16:58

and steps that you can take but it's

play16:59

just like blatantly wrong and you're

play17:01

like fine there's code there's code

play17:03

generation models out there so we did

play17:05

put it into a code generative model

play17:07

which uh is not here in this in this

play17:10

particular deck but what we did is it

play17:11

generated like maybe 20 lines of code

play17:13

that seemed really convincing uh but it

play17:17

was just wrong and so the real line of

play17:19

code was just a single line of code and

play17:21

so this is how you get the partitions in

play17:23

a design and this is again a proprietary

play17:25

language and uh the gold standard that

play17:28

you would normally

play17:29

judge um an llm again so the task is

play17:33

about right there at the level of right

play17:35

below 60% and you can see that using a

play17:38

larger model with domain adaptation gave

play17:41

uh an accuracy much above the gold

play17:44

standard and even better than the

play17:46

smaller model so larger models and

play17:49

domain adaptation are something that you

play17:51

will start thinking about in the future

play17:53

for sure and lastly evaluation how well

play17:57

is my model doing um Enterprises today

play18:00

as I showed have this option of all

play18:01

these different llms um we just saw a

play18:04

demo about evaluation and so it's this

play18:07

the Nemo evaluator is a One-Stop shop

play18:09

for either academic um benchmarking to

play18:12

make sure that you're not drifting from

play18:14

the the base models capabilities when

play18:16

you go and domain adapt it to even being

play18:19

able to use llm as a judge which we saw

play18:21

and so even a a platform for human

play18:23

evaluation and scoring and so when you

play18:27

put all these together the being able to

play18:29

use all these customization

play18:31

microservices and evaluation and create

play18:33

custom Nims and so it seems like a lot

play18:35

right you're like man I just want to be

play18:37

able to be an Enterprise developer and

play18:38

deploy my llm and not have to do

play18:40

anything well luckily uh weights and

play18:42

biases did integrate that into a

play18:44

platform for you to be able to do custom

play18:46

Nims and be able to deploy them so I am

play18:48

going to invite Chris up to Stage you

play18:50

can call him

play18:51

CVP as everyone sees him just go hi CVP

play18:55

and come on up here

play19:01

wo so Chris put together this amazing

play19:04

demo but before we get to the demo um

play19:07

with some of the pain points that I

play19:08

talked about here did it resonate with

play19:09

you do you see your users running into

play19:11

the same types of problems like what's

play19:13

maybe what's your favorite problem yeah

play19:15

for sure um I've been deploying web

play19:18

services for like 20 years now and when

play19:22

you think about like a llm web service

play19:25

at First it's like okay I'll find a GPU

play19:27

and run like a python on server

play19:29

somewhere that just like calls the model

play19:33

but it gets like very much more

play19:35

complicated than that very quickly very

play19:37

quickly um if I want to scale up I have

play19:39

to like batch up the requests I can't

play19:41

just call the GPU once at a time I need

play19:43

to cue them up in memory and send them

play19:44

off and then when you get into the world

play19:46

of like multi- GPU or multi- node

play19:49

there's you know distributed it's very

play19:51

very hard computer science problem so I

play19:53

am incredibly grateful that uh Nvidia

play19:56

has has kind of thought about those

play19:58

things and made it a lot simpler for

play19:59

your average engineer to just get a

play20:00

service deployed before we roll This

play20:03

exciting uh demo how many here have

play20:05

tried deploying their own large language

play20:08

models for the Enterprise applications

play20:10

show fans is any of this resonating with

play20:13

you any of the

play20:14

problems all of it I just heard that all

play20:17

right Chris you have anything else you

play20:18

want to add no uh yeah I I occasionally

play20:22

get some Moonlight as a machine learning

play20:23

engineer and uh recently took a problem

play20:27

um that we have at wait biases around

play20:29

our support requests and made a little

play20:32

demo so I think we're going to go ahead

play20:33

and and roll the demo you can listen to

play20:36

me do an incredible voice over if I

play20:38

might say it's great thank you everyone

play20:40

thank you so

play20:41

much hello my name is Chris Van Pelt and

play20:44

I love to fine tune

play20:46

llms I'm going to show you how we

play20:48

integrated weights and biases with

play20:50

Nvidia AI Enterprise to compile and

play20:53

evaluate a custom llm for a use case

play20:56

that we've been working on here at wnb

play20:59

we get a lot of support requests from

play21:00

our users we need to quickly triage and

play21:03

prioritize them so the most urgent

play21:05

requests are getting addressed as soon

play21:06

as possible we created a data set of

play21:09

support requests and their Associated

play21:11

tags such as priority whether it's a bug

play21:14

or a feature request and what part of

play21:16

our application the request is actually

play21:18

relevant to the first step in our

play21:20

journey is to fine-tune in existing

play21:23

language model for our specific task

play21:25

using this data set we chose the Llama 2

play21:28

7B Foundation

play21:30

model this dashboard is showing some of

play21:32

the experiments that we ran from here we

play21:35

can quickly find the best performing

play21:37

model to advance to The Next Step

play21:40

because we've instrumented all of our

play21:42

code with W andb we're able to see the

play21:44

full lineage of data sets and

play21:46

experiments used to produce our fine

play21:48

tune model once we've chosen a model to

play21:50

deploy we can promote it to our model

play21:53

registry I created a registered model

play21:55

named support llama 27b for our us case

play21:59

this is now our team's source of Truth

play22:01

for all of the models that are showing

play22:03

the most promise at this point we might

play22:06

think the hard part is over in reality

play22:08

we' just gotten started there are still

play22:11

so many unanswered questions how do we

play22:13

make our model run fast how can we be

play22:16

sure our services using updated packages

play22:18

free from

play22:19

vulnerabilities how do we rate limit and

play22:21

monitor service usage that's where Nim

play22:24

comes in we've integrated our model

play22:26

registry with Nvidia Nim a Nim is a

play22:29

secure and highly optimized container

play22:31

for deployment you can think of it

play22:33

almost like an operating system a Nim

play22:36

can securely execute gen apps that

play22:38

generate text images audio or video

play22:42

custom Nim models must be compiled for

play22:44

the specific accelerator they're meant

play22:46

to run on I've started a wnb launch

play22:49

agent on an a100 with 80 GB of RAM which

play22:52

will act as our compiler a launch agent

play22:54

is the equivalent of a Ci or CD Runner

play22:57

it's constantly phoning home asking if

play23:00

there's work to be done and then

play23:01

executes that work when it's asked to we

play23:04

can manually trigger model compilation

play23:06

from the registry UI or we can integrate

play23:09

with our existing Ci or CD system to

play23:11

automatically run compilation when an

play23:13

alias has been added we've configured

play23:16

our automation to perform the

play23:17

compilation when the Nim Alias is added

play23:20

to a specific model

play23:21

version I can add the Alias via the UI

play23:24

or do the same programmatically once I

play23:26

do the agent picks up the task and

play23:29

optimizes our model for the exact

play23:31

accelerator we want to deploy to after

play23:34

the model has been converted it's saved

play23:36

as an artifact in W andb as the final

play23:38

output of our

play23:39

pipeline now that we have a Nim

play23:42

optimized model it's important to

play23:44

validate performance before we ship it

play23:45

to users our launch agent automatically

play23:48

runs the Nim service so we can test it

play23:52

we've set up an evaluation that asserts

play23:54

both the accuracy and the latency

play23:55

metrics we care about for our specific

play23:57

use case

play23:59

we can verify that our model's

play24:00

performance is in line with our

play24:02

expectations and then share this newly

play24:04

created Nim artifact with the deployment

play24:07

team all of these steps have been

play24:09

captured in a central system of record

play24:12

and many of the steps have been

play24:13

completely automated this gives teams

play24:16

Peace of Mind in the ability to audit

play24:18

debug and optimize every component of

play24:21

their llm Ops pipeline because the final

play24:24

asset is wrapped in a Nim container we

play24:26

can rest assured it'll be vulnerability

play24:29

free and Performance Tuned for the exact

play24:31

accelerator we're deploying

play24:33

to that's W andb in Nvidia AI Enterprise

play24:37

if you're interested in joining the

play24:39

Early Access program sign up at

play24:42

wb. Nim

Rate This

5.0 / 5 (0 votes)

Related Tags
NvidiaWeights & BiasesAIデプロイメント言語理解チャットGPT生成AI企業向けモデル最適化ハイブリッドクラウド開発支援
Do you need a summary in English?