Overcoming the Complexities of Generative AI
Summary
TLDRこのビデオスクリプトでは、NvidiaとWeights & Biasesが共同で開発した新しいツール「Nims」を紹介しています。Nimsは、企業がAIモデルを迅速に展開し、最適化するためのプラットフォームで、言語モデルのデプロイメントを簡素化し、セキュリティとパフォーマンスの両面からサポートします。スクリプトでは、2022年のChat GPTの登場以来のAIの発展とそれに伴う企業のニーズの変化について触れ、2023年に始まった多様なコミュニティモデルの登場とそれに伴う評価、最適化、展開の課題についても説明しています。さらに、NvidiaのAI EnterpriseとWeights & Biasesが提供するソリューションを通じて、これらの課題をどのように克服し、効果的なモデル展開を実現するかについて詳細に紹介しています。
Takeaways
- 🚀 NvidiaはWeights & Biasesツールを利用し、AIモデルの最適化と展開に注力している。
- 📈 2022年にChat GPTが登場し、ビジネス分野における生成型AIの可能性が認められるようになった。
- 🌐 2023年にはコミュニティモデルが登場し、様々なユースケースへの応用が試みられた。
- 🛠️ 2024年にはそれまでの試みから得られた知識をもとに、実際のビジネスアプリケーションへと展開が進んでいる。
- 🔧 NvidiaはAIモデルの展開と推論に関する長い歴史を持ち、ハイパースケーラー企業と協力してモデルの最適化を行っている。
- 💡 企業はAIモデルの展開において、時間やリソースが限られているため、Nvidiaはその問題を解決するソリューションを提供している。
- 🔄 RAG(Retrieval-Augmented Generation)が導入され、ビジネスデータの埋め込みとファクトチェックが可能になった。
- 🔒 コミュニティモデルの登場により、セキュリティやバグフィックス、コードの脆弱性などの新たな問題が生じた。
- 📦 Nvidia Nim(Neural Inference Microservice)は、企業環境におけるAIアプリケーションの展開を簡素化し、制御を維持するプラットフォームとして開発された。
- 🔄 Lauraアダプターを使用することで、異なるユースケースに対して同じベースモデルを使い、より効率的なモデル展開が可能になる。
- 🔧 評価とカスタマイズのためのNemoフレームワークが提供されており、エンドツーエンドのデータ管理、モデル構築、評価、展開が可能となっている。
Q & A
NvidiaはWeights & Biasesツールを使ってどのような成果を上げていますか?
-NvidiaはAIの展開と推論に長い間携わっており、Weights & Biasesツールを使ってAIモデルの最適化を行っています。これにより、多くの人が関与するハイパースケーラー企業と比較しても、企業は短時間でモデルを展開し、推論の遅延を最小限に抑えることが可能です。
2022年に起こった重要なAI技術の出来事とは何ですか?
-2022年に重要な出来事は、Chat GPTの発売でした。これはビジネスが生成的AIを応用する可能性を認識する瞬間であり、AIが画像やビデオだけでなく言語理解にも関わると気づく契機となりました。
生成的AIの実験とコミュニティモデルの登場はどのようにビジネスに影響を与えましたか?
-生成的AIの実験とコミュニティモデルの登場により、ビジネスは自社の適用事例に合わせてモデルを調整できると認識し、モデルの選定、最適化、展開の方法を模索する動きが広まりました。
Nvidia Nim(Nims)とはどのようなプラットフォームですか?
-Nvidia Nimは、汎用性を保ちながらも生成的AIアプリケーションを制御し、データに近い場所で展開できる柔軟性を提供するプラットフォームです。企業環境でのスムーズな運用を可能にし、迅速な統合と展開を促進します。
Nimsはどのようにして企業のAIアプリケーションの展開プロセスを簡素化するのですか?
-Nimsはモデルの最適化、ドメイン固有のコードの提供、カスタム機能のサポート、業界標準APIの利用など、多岐にわたる機能を組み合わせて、企業のAIアプリケーションの展開プロセスを簡素化しています。
LoRAアダプターとは何で、どのような利点がありますか?
-LoRAアダプターは、モデルの重みの完全なサブセットを最適化する代わりに、追加のパラメータの低い秩分解を表す小さなサブセットを最適化するものです。これにより、基本モデルにアダプターを切り替えることで、モデルの用途を素早く変更することができ、多くの用途に対応するより小さな展開が可能になります。
NvidiaはどのようにしてAIモデルの展開においてセキュリティとバグ修正を確保するのですか?
-NvidiaはNimsを介してAIモデルの展開を管理し、ITおよびビジネス部門からの要求に応じてセキュリティ確保、バグ修正、コードの脆弱性への対処などを行っています。これにより、企業はコアビジネスに集中し、非コアの分野に時間を費やすことを避けることができます。
Weights & BiasesはどのようにしてNvidia AI Enterpriseと統合されていますか?
-Weights & Biasesは、Nvidia AI Enterpriseと統合されており、モデルのトレーニング、評価、展開のプロセスを一元管理できるプラットフォームを提供しています。これにより、開発者はカスタマイズされたNimsを簡単に展開し、管理することができます。
Nvidia AI Enterpriseのモデル月曜日とは何ですか?
-モデル月曜日は、Nvidiaが毎週月曜日に新しいモデルをリリースするプログラムです。これにより、開発者は最新のモデルを試して評価し、自分たちのビジネスに適したモデルを選択することができます。
Weights & Biasesのプラットフォームにおける実験の追跡とチームコラボレーションの利点とは何ですか?
-Weights & Biasesのプラットフォームは、実験の追跡とチーム間の協力を容易に行えるように設計されており、実験がどこで実行されているかにかかわらず、一貫性のある追跡エクスペリエンスを提供します。これにより、チームは効果的に協力してプロジェクトを進めることができます。
Outlines
😀 AIと言語理解の進歩とNvidiaの取り組み
スピーカーはNvidiaがウェイツアンドバイアスツールを利用し、AIの発展に貢献していることを紹介し、特に2022年にチャットGPTが登場しビジネスにAIを応用する意識が高まったと振り返ります。2023年には多くのコミュニティモデルが登場し、AIの適用分野が拡大しました。2024年にはこれらの試みが実用化され、NvidiaはAIモデルの推論と展開に関する豊富な経験を持ち、企業が最適なAIモデルを展開するためのソリューションを提供しています。
🛠️ Nvidia Nimの登場とAIアプリケーションの展開
Nvidia NimはAIアプリケーションの展開と制御を簡素化し、データとの近くに保つことを目的としたプラットフォームです。Nimはモデルの最適化、ドメイン固有のコード、カスタム機能、そして業界標準APIを提供し、エンタープライズ環境での迅速な統合と展開を促進します。また、Nimはモデルの種類に応じたさまざまなハードウェア最適化を内蔵しており、柔軟性とセキュリティを提供しています。
🔧 Lauraアダプターと並列化戦略によるカスタマイズとパフォーマンス
Lauraアダプターはモデルのカスタマイズを簡略化し、異なるユースケースを迅速に切り替えることができます。並列化戦略は大きなモデルにおいても効果的で、TensorRTなどの技術を用いて層を分割し、複数のGPU間で処理を並列化します。これにより、パフォーマンスと総保有コストのバランスを取ることができます。また、精度のバリデーションも行われ、モデルの展開において正確性を保証しています。
🔧 Nemoフレームワークとカスタマイズのためのマイクロサービス
Nemoフレームワークは2018年に開発され、データキュレーション、カスタマイズ、評価、そして展開のためのエンドツーエンドのフレームワークです。データキュレーター、カスタマイザー、そして評価ツールがマイクロサービスとして提供され、高品質のデータセットを作成し、特定のユースケースに応じたモデルのカスタマイズと評価を行います。これにより、企業は独自の言語やAPIを持つ場合でも、モデルを適切にカスタマイズできます。
🚀 W&BとNvidia AI EnterpriseによるLLMの展開と自動化
Weights & BiasesとNvidia AI Enterpriseは、LLMの展開と自動化を支援するプラットフォームです。例として、サポートリクエストの優先順位付けのためにLlama 2.7Bモデルをファインチューニングし、W&Bを使用してモデルのパフォーマンスを評価し、Nvidia Nimに統合して最適化されたコンテナを作成しました。これにより、開発者はエンタープライズアプリケーションにLLMを展開する際の多くの課題を解決し、迅速かつ効率的な展開が可能になります。
Mindmap
Keywords
💡Nvidia
💡Weights and Biases
💡Generative AI
💡Chat GPT
💡RAG (Retrieval-Augmented Generation)
💡Community Models
💡Nims (Nvidia Inference Microservice)
💡Quantization
💡Customization
💡Evaluation
Highlights
Nvidia和Weights & Biases工具的集成令人兴奋,展示了AI在企业中的应用潜力。
2022年Chat GPT的推出标志着AI技术在语言理解方面的突破。
2023年社区模型的兴起,促进了生成性AI的实验和应用探索。
Nvidia在AI部署和推理方面拥有丰富经验,助力企业优化AI模型。
企业面临的挑战是如何优化、评估和部署AI模型。
Nvidia推出的Nims平台简化了企业级AI应用的部署流程。
Nims支持Day Zero LLM,与模型构建者合作,提供最新的模型支持。
Nims平台提供灵活性,允许在任何地方部署应用程序,同时控制数据。
Nims内部结构包括模型优化、硬件特性利用和领域特定代码。
LoRA适配器允许在单一基础模型上切换不同的用例,而不影响延迟。
Nims处理了并行策略,包括管道并行、张量并行和专家并行。
Nims平台通过优化技术,如自定义多头注意力和批处理,提高性能。
Nims确保了模型在不同硬件上的准确性和性能优化。
Weights & Biases与Nvidia AI Enterprise集成,简化了自定义Nims的部署和管理。
Nemo框架是一个端到端的自定义和部署工具,支持数据整理、定制和评估。
Nemo框架内部的微服务,如数据策划和定制器,支持企业级AI模型的定制。
Nemo评估器为企业提供了一个全面的工具,用于模型性能的评估和验证。
演示展示了如何使用Weights & Biases和Nvidia AI Enterprise来编译和评估自定义LLM。
通过集成,简化了从模型微调到部署的整个流程,提高了企业级AI应用的开发效率。
Transcripts
all right thank you for having me uh
good morning everybody hopefully you're
enjoying uh the session so far I would
like to say that I'm really excited and
proud to be standing here because um
Nvidia has been using weights and biases
tools for quite a while now and um it's
really exciting to be up on stage here
and and to talk about that we both have
two solutions that we're integrating
together so hopefully you'll be as
excited as I
am okay so in 2022 I think we kind of
all remember where we were it's like
that moment that hit uh when sort of not
just the launch of a new technology uh
exploded onto the landscape but just
that recognition of wow um AI can work
for me and we all know what we're
talking about it was the launch of chat
GPT and even though some of us in the
field have been working on generative AI
for quite a long time it was the moment
where uh businesses start to think
started to think about wow I can use
this for my applications and not AI
isn't just images and video but it's
really about language understanding so
then what happened in 2023 uh everyone
started experimenting with generative AI
um trying to figure it out what use
cases can I apply it to uh another thing
happened was the launch of a lot of
community models so the ability to say
wow maybe I can adapt this to my use
case and then um in
2024 it's reaping the benefits of all of
that uh experimentation and being going
going into production now at Nvidia
we've been dealing with um AI um
deployments and inference for a very
long time and so uh we have all of we
work with hyperscalers who actually
employ like hundreds of people who work
on the optimization of AI models to put
them out into inference for these
latencies but you know what Enterprises
don't have that time and they don't have
the resources to do that and so I'm
going to walk you through a little bit
of a journey really quick about how we
got to where we are and why we're
building what we're building and
announcing what we're
announcing okay so this is the most
simple user workflow that you could
possibly imagine uh that you started
with in 20 2023 so it's really just the
user a prompt an API to an llm and the
answer um and everyone thought it was so
amazing and it was until it wasn't
because then you realize that wow this
is hallucinating um it's just blurting
out things that aren't factual it's not
talking about my data it's not even
talking in my brand voice and maybe it's
too verbose or it's to too tur and so in
general it just wasn't based in
facts and so then we added rag so you
just talked about retrieval augmented
generation uh to H help with that
factual data so embedding your business
data and putting into a vector database
and being able to retrieve it and cite
it and then send that to the llm to
generate the answer and honestly I could
stand up here all for the an entire hour
talking about embedding Vector search
acceleration and uh retrieval tactics
like reranking but I'm not going to so
you're welcome but um but so you added
that and that was great until it wasn't
because then you started to think wow
I'm sending all of this data uh and
answers and prompts away to an API and I
did hear about this thing uh all these
Community models that came out and so
with the launch of community models it
proliferated experimentation even
further um so in community models were
these are just a few right we have a
list of hundreds that we're working on
but um Community models were coming at
first every month and then it was like
every week and now I bet you feeling it
models are coming out like every day and
so how are you evaluating those models
how are you optimizing them and how are
you deploying
them and so the generative AI deployment
options are seemingly endless so uh what
are the trade-offs that you're making
for um quantization reduced Precision
latency uh compute footprint the just
the total cost of ownership are you
trying to deploy on a single GPU with a
smaller model but you're not getting the
accuracy um are you trying to do a
multi-gpu deployment option or in some
cases and maybe you're not there yet we
have customers who are deploying multi-
node so that means many eight gpus is in
a node and so like 16 gpus at least to
be able to serve these models and so
having to think about those trade-offs
for total cost of um optimization total
cost of ownership and so then once you
go to deploy and you feel pretty good
that you figured it all out and then
your your it and line of business is
starting asking you is it secure how do
we get bugs fixed what are this pile of
code that you're handing me are there
vulnerabilities in it how do I deal with
the next model uh what is the API that
you've developed and using and so
basically you just start down this Rat
Hole of everything that ises not core to
your
business right you are now spending all
this time and maybe even building up a
team that is not core to your business
and so it's enough to drive you crazy
and I I think look at this poor guy he
literally ripped his hair out
and I trust me I think my hair has been
falling out over the last few years too
um but so the Nvidia Nim we call it Nims
has anyone here heard of a Nim
before no no okay so it originally
started to we thought of it as an Nvidia
inference microservice but it's just Nim
for short and now Nim is just Nim so
it's not redundant it's not Nvidia Nim
it's it's just they're Nims and um
you're going to get used to it
everyone's going to start talking about
Nims at this conference I can feel it
and uh anyway so we developed this Nim
platform to have the flexibility to
deploy your applications anywhere while
maintaining control of your generative
AI applications and keeping it close to
your data and so we basically simplified
that um deployment process for you for
smooth operation for Enterprise
environments so that means quicker
integration uh and deployment for your
teams so let's take a look in inside a
Nim and before I do you probably notice
that all those um little retrieval icons
turn green as well for
and I can't stress enough that this is
the most simple diagram of a of a rag
that you can see because honestly I
didn't add guard rails I didn't um add
prompt rewriting query competition query
routing there's many many llms that are
running for every single query which is
why you need to have the latency for
every single llm model that you have out
there um okay so let's look inside an N
so it all starts with the model at the
heart um and we are proud to say that we
work with we what we call Day Zero llm
support not just for Community models we
go and curate the most important
Community models that we think for
Enterprise um generative AI applications
but we also work with the world's most
important model Builders so the people
who are putting out new models seemingly
daily so the AI 21s of the world the
coher uh Google meta so we're working
with them before they even launch a
model to make sure that the day that
they launch that model that we support
in an M um and so now if I go down to
the bottom we inference engines so each
model has um a different optimization
for every skew what we call a skew or a
different set of Hardware so uh
different Hardware have different
requirements or different features that
you can exploit the features and
functionality so an example would be
floating point8 in our Hopper generation
of of our gpus which wasn't available in
our ampere generation so being able to
understand which Hardware you're running
on and be able to exploit those features
of
functionalities we also um offer domain
specific code so every model has a pre-
and post-processing that happens for
that for that domain um and it's kind of
like amell's law so if you optimize the
the model the llm itself you you might
be stuck in the bottleneck of the
processing the pre and post-processing
and so for things like um a generative
ASR or an ASR model you might have a
beam search decoding that you have to be
able to accelerate or different
tokenizers for llm so all these domain
specific code um get wrapped into the
Nim as well and also sometimes these are
models themselves they need to be
collocated with the compute Hardware so
that can add latency again you don't
have to worry about that we handle that
inside the Nim um we also recognize the
importance of customization which I'll
talk about in a little bit so we offer
the support of Laura adapters within Nim
to be able to serve multiple use cases
per base model we also offer industry
standard apis so if you saw that Journey
at the beginning with that that red API
that you started with you don't want to
have to rewrite your code or refactor
your code you want to be able to use the
same API and honestly in some use cases
there are no uh industry standard apis
an example would be um our generative
model where we have um what we call it
audio to face it's basically taking ASR
and and animating a digital Avatar face
so there's there's no industry standard
for that one but we're working in the
open to make sure that uh we're getting
feedback on on those apis
um and again these are all built on our
our NV AI Enterprise blessed containers
so they they're robust they're tested
they're validated we test them across
various environments including um again
we're hybrid Cloud so um Azure AWS
Google Cloud you can make sure that they
work across all these different
environments and then um and so
basically this is a a Nim and a nutshell
and so let me talk a little bit about um
some of the the core components that I
get a lot of questions about frequently
asked questions so Laura adapters so
today's organizations have multiple use
cases that they need to serve or
customize for but Deploy on a limited
pool of hardware and so you can as you
start to think about customization use
cases it can quickly explode into any
need dozens or hundreds of models um but
so instead of
optimizing all of the weights the the
full weights what we call full
fine-tuning of a model Laura optimizes
this smaller subset of additional
parameters that of represent the low
rank decomposition of the changes in the
dense layers of the network uh AKA
called Laura adapter it's so much easier
to say Laura adapter Just Go With It um
and so what does that mean for inference
uh that means you have a singular base
model and you're able to switch the
adapter and therefore the model's use
case uh on the Fly and so you can have a
smaller deployment but be able to deploy
many use cases of tens hundreds
thousands
um and and it offers no impact to the
latency of the model at all so we do
have customers right now with hundreds
of Laura adapters in production and
they're and they're eagerly thinking
about
thousands okay parallel strategies I
know you think about this every day you
probably wake up and think about
parallel strategies but um so this can
become tricky for larger models again
I'm just showing you the technology
inside you don't have to worry about it
n handles it for you um so things like
um we we have what we call tensor RT
inside the container this is something
that we've been building uh since 2016
2015 was our tensor T version one it's
been around for a while but so things
like pipeline parallelism and so what
that means is it it's interlayer so you
split the contiguous layers sets of
layers across multiple gpus or tensor
parallelism which is intra layer uh for
the model so you're splitting the
individual uh layers across multiple
gpus and then what we call Expert
parallelism so if anyone here has heard
about ume models or mixture of experts
uh being able to split the individual
experts they're stored on different gpus
depending on the size of the expert and
being able to um copy that uh the routy
the router to each GPU so all of that
technology is handled under the covers
for
you so things that we also think about
um is we're always thinking about
performance so things like custom multi
multi-head attention inflight batching
uh Quant KV cache so anything to drive
performance It actually drives um a a
total reduction in total cost of
ownership again all of this is being
handled for you and I I talked a little
bit about po per GPU skew and the
performance you can get out of it by
applying each little different
performance uh to that performance
optimization to that now these are just
an example this is not what it is I'm
just giving you an example the things
that you might think about so if you
have a a Gemma 7B and we and again you
want the latency and maybe you go down
to a floating point8 well you know what
Nim does for you we we validate the
accuracy because if anyone here has
dealt with um being able to do
post-training quantization you can find
that if you go from floating Point 16 to
quantize in8 or fp8 uh you might have
some changes in accuracy so we validate
all these Community models that there is
no change in accuracy um we have a I
think the mixtur what do I have here the
8 by 22b here is served on Multi
multi-gpu but I'll give an example of
one that's not here so uh llama
70b um you can serve that on two
gpus but what we do is we actually
quantize it to fp8 and we actually serve
it on eight gpus because that adds to
the resiliency and uh and the latency of
the model and the serving to meet the
latency requirements and so these are
the types of trade-offs that you have to
have if people were thinking about I
want a more accurate model but I want
the latency to go down and so we have
have customers who really said they
wanted to deploy on a single GPU but the
larger model with higher accuracy was
worth it in the end because you want
that delightful experience and it's got
to be accurate and so they moved to the
larger GPU footprint but again if you
make that decision the Nim's going to be
able to handle that for
you okay so uh I know you got you all
were told a URL that you have to
remember I'm going to give you another
one it's build. nvidia.com and so you
can go out and experience all these
Community models what that we put out
there we have model Mondays so we're we
are releasing new models every single
Monday and in sometimes in in some cases
during the week when that aligns with
the model builder launch day and so you
can go experience these models we even
apis to play with but then you can build
deploy and manage with weights and
biases which we'll see in a second and
then you can deploy
anywhere um since we talk we're going to
talk about customization I'm going to
talk a little bit about um what we call
the Nemo framework uh Nemo has nothing
to do with the cute little fish it
stands for neural modules uh we built
this framework in 2018 it was all it is
still all open source um and the Nemo
team is Nemo platform is used internally
for our own um supervised fine tuning
reinforcement learning and our model
building and we actually use weights and
biases internally within the Nemo team
so to track experiments and to enable
team collaboration and so it doesn't
matter where an experiment is being run
at whether it's on a internal or
collocated cluster at a desktop or in
the cloud the same experiences for
tracking experiments is um seamless
across the team but uh the Nemo is a
full end to-end framework for data
curation customization and evaluation
and then deploying so I'm just going to
talk about three microservices really
quickly so data curator is a multi-stage
creation of high D high quality data
sets for um both pre-training and
fine-tuning workflows and so things like
text reformatting uh D duplication of
data we love to accelerate things so
everything that we do is accelerated so
being able to take that from CPUs of 40
hours on a CPU of cleaning data down to
three hours on a GPU we also do um
document filtering classifier based
filtering and multilingual heuristic
based filtering lots of filtering and
multilingual Downstream task de uh
decontamination and so what you do is
you take all these and then you blend it
together to come up with your special
data set plan for the the type of
customization that you want to do and
then you send that off to um the uh
customizer and so instead of talking
about features and functionality of the
customizer I'm going to give you an
example so we have a a model that we
built called chip Nemo it's our
electronic design automation tooling we
did U domain adaptive pre-training and
uh fine-tuning on that model and so what
we did was we took some Community models
and we asked it you know can you uh find
the partitions in a design uh write some
vivid code and Vivid is our own
proprietary coding language now I'm sure
a lot of you have dealt with Enterprises
who have their own proprietary coding
language or their own apis and it's just
something that um rag can't handle yet
it's fine for documentation but
literally can't generate code and so
when you ask it to generate this code
Vivid is a very common term that's very
specific to Nvidia and this is why the
importance of customization and so you
can see here that a typical La llama
model gives you a theoretical paragraph
and steps that you can take but it's
just like blatantly wrong and you're
like fine there's code there's code
generation models out there so we did
put it into a code generative model
which uh is not here in this in this
particular deck but what we did is it
generated like maybe 20 lines of code
that seemed really convincing uh but it
was just wrong and so the real line of
code was just a single line of code and
so this is how you get the partitions in
a design and this is again a proprietary
language and uh the gold standard that
you would normally
judge um an llm again so the task is
about right there at the level of right
below 60% and you can see that using a
larger model with domain adaptation gave
uh an accuracy much above the gold
standard and even better than the
smaller model so larger models and
domain adaptation are something that you
will start thinking about in the future
for sure and lastly evaluation how well
is my model doing um Enterprises today
as I showed have this option of all
these different llms um we just saw a
demo about evaluation and so it's this
the Nemo evaluator is a One-Stop shop
for either academic um benchmarking to
make sure that you're not drifting from
the the base models capabilities when
you go and domain adapt it to even being
able to use llm as a judge which we saw
and so even a a platform for human
evaluation and scoring and so when you
put all these together the being able to
use all these customization
microservices and evaluation and create
custom Nims and so it seems like a lot
right you're like man I just want to be
able to be an Enterprise developer and
deploy my llm and not have to do
anything well luckily uh weights and
biases did integrate that into a
platform for you to be able to do custom
Nims and be able to deploy them so I am
going to invite Chris up to Stage you
can call him
CVP as everyone sees him just go hi CVP
and come on up here
wo so Chris put together this amazing
demo but before we get to the demo um
with some of the pain points that I
talked about here did it resonate with
you do you see your users running into
the same types of problems like what's
maybe what's your favorite problem yeah
for sure um I've been deploying web
services for like 20 years now and when
you think about like a llm web service
at First it's like okay I'll find a GPU
and run like a python on server
somewhere that just like calls the model
but it gets like very much more
complicated than that very quickly very
quickly um if I want to scale up I have
to like batch up the requests I can't
just call the GPU once at a time I need
to cue them up in memory and send them
off and then when you get into the world
of like multi- GPU or multi- node
there's you know distributed it's very
very hard computer science problem so I
am incredibly grateful that uh Nvidia
has has kind of thought about those
things and made it a lot simpler for
your average engineer to just get a
service deployed before we roll This
exciting uh demo how many here have
tried deploying their own large language
models for the Enterprise applications
show fans is any of this resonating with
you any of the
problems all of it I just heard that all
right Chris you have anything else you
want to add no uh yeah I I occasionally
get some Moonlight as a machine learning
engineer and uh recently took a problem
um that we have at wait biases around
our support requests and made a little
demo so I think we're going to go ahead
and and roll the demo you can listen to
me do an incredible voice over if I
might say it's great thank you everyone
thank you so
much hello my name is Chris Van Pelt and
I love to fine tune
llms I'm going to show you how we
integrated weights and biases with
Nvidia AI Enterprise to compile and
evaluate a custom llm for a use case
that we've been working on here at wnb
we get a lot of support requests from
our users we need to quickly triage and
prioritize them so the most urgent
requests are getting addressed as soon
as possible we created a data set of
support requests and their Associated
tags such as priority whether it's a bug
or a feature request and what part of
our application the request is actually
relevant to the first step in our
journey is to fine-tune in existing
language model for our specific task
using this data set we chose the Llama 2
7B Foundation
model this dashboard is showing some of
the experiments that we ran from here we
can quickly find the best performing
model to advance to The Next Step
because we've instrumented all of our
code with W andb we're able to see the
full lineage of data sets and
experiments used to produce our fine
tune model once we've chosen a model to
deploy we can promote it to our model
registry I created a registered model
named support llama 27b for our us case
this is now our team's source of Truth
for all of the models that are showing
the most promise at this point we might
think the hard part is over in reality
we' just gotten started there are still
so many unanswered questions how do we
make our model run fast how can we be
sure our services using updated packages
free from
vulnerabilities how do we rate limit and
monitor service usage that's where Nim
comes in we've integrated our model
registry with Nvidia Nim a Nim is a
secure and highly optimized container
for deployment you can think of it
almost like an operating system a Nim
can securely execute gen apps that
generate text images audio or video
custom Nim models must be compiled for
the specific accelerator they're meant
to run on I've started a wnb launch
agent on an a100 with 80 GB of RAM which
will act as our compiler a launch agent
is the equivalent of a Ci or CD Runner
it's constantly phoning home asking if
there's work to be done and then
executes that work when it's asked to we
can manually trigger model compilation
from the registry UI or we can integrate
with our existing Ci or CD system to
automatically run compilation when an
alias has been added we've configured
our automation to perform the
compilation when the Nim Alias is added
to a specific model
version I can add the Alias via the UI
or do the same programmatically once I
do the agent picks up the task and
optimizes our model for the exact
accelerator we want to deploy to after
the model has been converted it's saved
as an artifact in W andb as the final
output of our
pipeline now that we have a Nim
optimized model it's important to
validate performance before we ship it
to users our launch agent automatically
runs the Nim service so we can test it
we've set up an evaluation that asserts
both the accuracy and the latency
metrics we care about for our specific
use case
we can verify that our model's
performance is in line with our
expectations and then share this newly
created Nim artifact with the deployment
team all of these steps have been
captured in a central system of record
and many of the steps have been
completely automated this gives teams
Peace of Mind in the ability to audit
debug and optimize every component of
their llm Ops pipeline because the final
asset is wrapped in a Nim container we
can rest assured it'll be vulnerability
free and Performance Tuned for the exact
accelerator we're deploying
to that's W andb in Nvidia AI Enterprise
if you're interested in joining the
Early Access program sign up at
wb. Nim
Browse More Related Video
![](https://i.ytimg.com/vi/gvqsPhd27LE/hq720.jpg)
Code-First LLMOps from prototype to production with GenAI tools | BRK110
![](https://i.ytimg.com/vi/LnZ07X99xnA/hq720.jpg)
Microsoft's new "Embodied AI" SHOCKS the Entire Industry! | Microsoft's Robots, Gaussian Splat & EMO
![](https://i.ytimg.com/vi/sgPLle8rdSE/hq720.jpg?sqp=-oaymwEmCIAKENAF8quKqQMa8AEB-AH-CYAC0AWKAgwIABABGEUgJSh_MA8=&rs=AOn4CLDoU7zWiah1U5XQKpsWHznFGIUS4A)
🚀 VivaTech 2024 : Keynote - More than a Model: The Gen AI Essentials for Business Innovation
![](https://i.ytimg.com/vi/O4tynWB49fM/hq720.jpg)
ChatGPTの大型アップデートを徹底解説セミナー ~新機能:GPTー4o、デスクトップアプリ、音声会話~
![](https://i.ytimg.com/vi/ghr_ydkOA4Y/hq720.jpg)
【徹底解剖:OpenAI 日本進出の4つの狙い】①技術開発/②セールス拠点/③ルールメイキング④人材採用/顧客獲得戦争と人材獲得競争/日本にとってのメリットとリスク【Algomatic大野峻典】
![](https://i.ytimg.com/vi/99fS2bXmY6c/hq720.jpg)
これさえ押さえておけばOK! 生成AI時流を解説 〜3月後半の生成AIトレンドをご紹介〜(2024/04/03)
5.0 / 5 (0 votes)