Which nVidia GPU is BEST for Local Generative AI and LLMs in 2024?
Summary
TLDRThe video discusses advancements in open-source AI, emphasizing the ease of running generative AI locally for images, video, and podcast transcription. It explores the cost-effectiveness of Nvidia GPUs for compute tasks, comparing them to Apple and AMD. The script delves into the latest Nvidia RTX 40 Super Series GPUs, their AI capabilities, and the potential of using older models for AI development tasks. It also highlights the significance of Nvidia's Tensor RT platform for deep learning inference and showcases the impressive performance of modified enterprise-grade GPUs in a DIY setup.
Takeaways
- đ Open source AI has seen significant advancements, enabling local generation of AI content for images, video, and even podcast transcriptions at a rapid pace.
- đĄ Nvidia GPUs are currently leading in terms of cost of compute for AI tasks, with Apple and AMD being close competitors.
- đ» The decision between renting or buying GPUs often leans towards buying for those who wish to experiment and develop with AI tools like merge kits.
- đ Nvidia's messaging is confusing due to the variety of GPUs available, ranging from enterprise-specific to general consumer products.
- đ The release of Nvidia's RTX 40 Super Series in early January introduced GPUs with enhanced AI capabilities, starting at $600.
- đą The new GPUs boast improved performance metrics such as Shader teraflops, RT teraflops, and AI tops, catering to gaming and AI-powered applications.
- đź Nvidia's DLSS (Deep Learning Super Sampling) technology allows for AI-generated pixels to increase resolution in games, enhancing performance.
- đ€ The AI tensor cores in the new GPUs are highlighted for their role in high-performance deep learning inference, beneficial for AI models and applications.
- đ§ Techniques like model quantization have made it possible to run large AI models on smaller GPUs, opening up more affordable options for AI development.
- đ Nvidia's Tensor RT platform is an SDK that optimizes deep learning inference, improving efficiency and performance for AI applications.
- đĄ The script also discusses the use of enterprise-grade GPUs in consumer settings, highlighting the potential for high-performance AI tasks outside of professional environments.
Q & A
What advancements in open source AI have been made in the last year according to the transcript?
-The transcript mentions that there have been massive advancements in open source AI, including the ease of running local large language models (LLMs) for generative AI like Stable Diffusion for images and video, and the capability to transcribe entire podcasts in minutes.
What is the current best option in terms of cost of compute for AI tasks?
-The transcript suggests that Nvidia GPUs are currently the best option in terms of cost of compute for AI tasks, with Apple and AMD being close competitors.
Should one rent or buy GPUs for AI tasks according to the transcript?
-The transcript recommends buying your own GPU instead of renting for those who want to experiment and mix and match with tools or for developers who want to do more in-depth work.
What is the latest series of GPUs released by Nvidia as of the transcript's recording?
-Nvidia has released the new RTX 40 Super Series, which is a performance improvement over the previous generation, aimed at gaming and creative applications with AI capabilities.
What is the starting price for the new RTX 40 Super Series GPUs mentioned in the transcript?
-The starting price for the new RTX 40 Super Series GPUs is $600, which is around the same price as used RTX 3090s or 3090 Ti.
What is the significance of the AI tensor cores in the new Nvidia GPUs?
-The AI tensor cores in the new Nvidia GPUs deliver high performance for deep learning inference, which is crucial for AI tasks and applications, including low latency and high throughput for inference applications.
How does Nvidia's DLSS technology work, and what does it offer?
-DLSS, or Deep Learning Super Sampling, is a technology that infers pixels to increase resolution without the need for more ray tracing. It can accelerate full rate racing by up to four times with better image quality.
What is the role of Nvidia's Tensor RT in AI and deep learning?
-Nvidia's Tensor RT is an SDK for high-performance deep learning inference, which includes optimizations for runtime that deliver low latency and high throughput for inference applications, improving efficiency and performance.
What is the potential of quantization in making large AI models run on smaller GPUs?
-Quantization allows for the adjustment of the representation of underlying datasets, enabling large AI models that would normally require multiple GPUs to run on smaller ones, like the 3090 or even a 4060, with reasonable accuracy.
What are some of the models that Nvidia has enhanced with Tensor RT as mentioned in the transcript?
-Nvidia has enhanced models like Code Llama 70b, Cosmos 2 from Microsoft Research, and a lesser-known model called Seamless M4T, which is a multimodal foundational model capable of translating speech and text.
What is the situation with the availability of Nvidia's A100 GPUs in the SXM4 format according to the transcript?
-The transcript mentions that due to the discovery of how to run A100 GPUs outside of Nvidia's own hardware, it has become nearly impossible to find reasonably priced A100 40 and 80 GB GPUs in the SXM4 format on eBay.
Outlines
đ Advancements in Open Source AI and GPU Options
The script discusses the significant strides made in open source AI, particularly in 2024, enabling local deployment of generative AI models for images, video, and podcast transcription. It emphasizes the importance of Nvidia GPUs for compute efficiency and questions whether renting or buying GPUs is more cost-effective. The video aims to clarify the value of the latest Nvidia GPUs, comparing them with older models and enterprise hardware, and hints at a future discussion on the high-end of enterprise hardware options.
đĄ Nvidia's New GPU Releases and AI Capabilities
This paragraph delves into Nvidia's recent release of the RTX 40 Super Series GPUs, highlighting their AI capabilities and performance improvements. It mentions the use of AI for upscaling in gaming through DLSS technology and the new GPUs' ability to handle AI tasks more efficiently. The paragraph also touches on the potential of these GPUs for developers and the comparison of their performance with previous models, suggesting that while the new GPUs offer enhanced capabilities, their pricing may not always reflect better value for money.
đ Exploring AI Model Quantization and GPU Performance
The script explores the concept of AI model quantization, which allows for the reduction of model sizes to fit on smaller GPUs without significant loss of accuracy. It discusses the progress made in this area, particularly with models like LLaMA 2 and Hugging Face Transformers, and how this advancement makes GPUs like the 4060 capable of running models that were previously too large. The paragraph also addresses the different requirements for inference and training in terms of memory and bandwidth, and the potential of the new AQM method to enable even more efficient model deployment.
đ Nvidia Tensor RT Platform and DIY Enterprise GPU Setups
The final paragraph discusses the significance of Nvidia's Tensor RT platform for deep learning inference, its benefits in terms of performance and efficiency, and its integration with other Nvidia technologies. It also covers the recent enablement of Tensor RT for large models like Code Llama and Cosmos 2. Additionally, the script narrates a Reddit user's experience with setting up a custom system using Nvidia's enterprise-grade GPUs in a non-traditional configuration, demonstrating the potential for high-performance DIY setups and the challenges involved in such endeavors.
Mindmap
Keywords
đĄOpen Source AI
đĄLocal LLMs
đĄNvidia GPUs
đĄCompute Cost
đĄDLSS
đĄQuantization
đĄTensor RT
đĄAI Flux
đĄSXM Form Factor
đĄInference
đĄEXL 2
Highlights
Open-source AI has seen massive advancements, with local generative AI models like Stable Diffusion for images and video, and tools to transcribe podcasts in minutes.
Nvidia GPUs are currently the best in terms of cost of compute for AI advancements, with Apple and AMD being close competitors.
The decision between renting or buying GPUs depends on the user's need for experimentation and in-depth development.
Nvidia's messaging is confusing due to the tagging of AI onto various products, with many releases in a short period.
Nvidia's 40 Series GPUs, specifically the RTX 40 Super Series, offer improved performance and generative AI capabilities starting at $600.
The RTX 40 Super Series features AI as a superpower, with significant performance improvements in gaming and creative applications.
Nvidia's DLSS (Deep Learning Super Sampling) technology can infer pixels to increase resolution without additional ray tracing.
AI tensor cores in the new GeForce RTX super GPUs deliver high performance for deep learning inference applications.
Nvidia's RTX 4070 Super offers 20% more performance than the RTX 470 at the same price point, making it a compelling option.
The RTX 4090 is faster than a 3090 at a fraction of the power, but the price remains the same, questioning its value.
Advancements in LLM quantization allow for running large models on smaller GPUs, making high-parameter models accessible on consumer-grade hardware.
Nvidia Tensor RT platform is an SDK for high-performance deep learning inference, improving efficiency and reducing latency.
Nvidia has enabled Tensor RT with large models like Code Llama 70b, Cosmos 2, and Seismic M4T, enhancing multi-language and multimodal capabilities.
The user innovation of running A100 GPUs outside of Nvidia's hardware chassis has led to high-performance, albeit complex, DIY setups.
The cost-effectiveness of older Nvidia GPUs like the 3090 makes them a hard option to beat, especially for inference tasks.
The video concludes with recommendations for the 4070 Super for inference tasks and a comparison with the 3090 for overall value.
Transcripts
open source AI has made massive
advancements in the last year and even
in the first month of 2024 it's never
been easier to run local llms generative
AI like stable diffusion for both images
and video and even do things like
transcribe entire podcasts in minutes
and the question is how do you do that
and the best tools for this in terms of
the cost of compute in terms of tokens
per dollar I believe is still Nvidia
gpus uh apple and AMD are getting really
close but if we're looking at uh what's
going to give you the most options
Nvidia gpus are basically it and the
question comes down well do you rent
them or do you buy them and for a lot of
people especially people who want to
experiment and mix and match with things
like merge kits or developers that want
to do some more in-depth work buying
your own GPU makes way more sense than
renting on a service like runp pod fast
AI or tensor Dock and then the question
is which GPU do you buy and Nvidia their
messaging is all over the place right
now there are obviously Enterprise CPUs
that are meant to do very specific
things but are maybe not as general or
accessible to most consumers and they
just tag AI onto everything now and
given they've made a lot of releases in
the last week I wanted to kind of
condense a lot of this information and
show you what's possible now and share
if I think the latest Nvidia gpus are
really a great deal whether or not older
Nvidia gpus really are still a better
value and then for those of you who just
have way too much money sitting around
what the uh farthest you can stretch
into Enterprise Hardware is and I think
you'll be surprised what I found but
stay tuned for that just a bit later so
that's what I want to go over in this
video we're going to go over some new
models that Nvidia has released features
that are enabled in the newest gpus and
whether or not the 5090 is actually
coming out in 2024 so welcome to AI flux
let's get into it so I should first off
start by saying that we're still not
really sure when the 5090 is coming we
know that uh tsmc is building a brand
new new facility in Japan unfortunately
not in the USA likely to try to stem uh
how much uh China is now affecting uh
the future plans for tsmc in Taiwan and
what's also interesting is we know that
the whole game of making more AI compute
is getting much more competitive and
Nvidia clearly cares about AI way more
than consumer gpus so the 590 is
probably not coming anytime soon at the
soonest probably the end of 2024 and I
hope I have to eat my words and we see
the 590 before that but let's get into
the latest 40 series gpus that were
released in early January and what's
good is we now know all these features
are relatively real and that the gpus
are actually being produced Nvidia let
out a press release in January saying
that they released new RTX 40 Super
Series and for those of you who don't
know the the super series is pretty much
when Nvidia has peace me GPU performance
improvements they want to make to
stretch a generation of gpus out a bit
more so for instance sometimes we'll
have gpus once every cycle but usually
we see kind of the the ti come out um a
year or two after the initial release of
the new generation so they debut these
as new heroes in the gaming and creative
Universe with AI as their superpower and
of course Nvidia is pretty fast and
loose with the actual specs that come
along with this so they say here that
gaming GPU is Amplified with more
performance and generative AI
capabilities starting at $600 which is a
curious price point because that's right
around where you start looking at uh
what used 309s or 3090 TI can be had for
so the question here is what did they
release so they there's the 4080 super
and the 4070 super which they say
supercharge the latest games and form
the core of AI powered PCS so they say
this latest iteration of Nvidia Ada love
lace architecture based gpus delivers up
to 52 Shader teraflops 121 RT teraflops
and 836 AI tops uh these are all just
units of compute basically and the 4070
super being the cheapest of these starts
at $600 and when they say AI powered
most of what they're referring to is
dlss or one of their new Nvidia deep
learning super sampling Technologies
basically saying they can infer pixels
to increase resolution without actually
having to do more Ray tracing and in
their words they they're basically
saying with dlss 7 out of 8 pixels can
be AI generated accelerating full rate
racing by up to four times with better
image quality which uh this is actually
getting pretty good I don't really play
a lot of video games these days but it
is cool to see this getting more common
and it basically being capable in games
that don't even natively support this
they also say an AI powered leap in PC
Computing so they say the new GeForce
RTX super gpus are the ultimate way to
experience AI on PCS the AI tensor cors
deliver up to the same specs they showed
before and they are really big on Nvidia
tensor RT which is actually pretty cool
and Nvidia has released a lot of fine
tunes of models that enable this
technology and what they call this is
its software for high performance deep
learning inference which includes deep
learning inference optimizations for the
Run time that delivers low latency high
throughput for inference applications so
this is mostly focused at llms a lot of
the work that Nvidia has done here is
just Windows tooling to make a lot of
this stuff work since most of the
development that goes on with a lot of
this AI stuff is really on Linux and
again they mentioned dlss and a few
other kind of clever um video
manipulation things for streamers um
like removing background noise and doing
chroma key in real time which it's crazy
uh that that's actually just like a
common thing that all these cards do now
now when they get in into what these
cards actually are it's pretty
interesting so they say there's the 4K
monster the 480 super uh they say here
that that you're getting blistering
performance compared to the RTX 480 and
that the RTX 480 is twice as fast as the
RTX 380 TI and the 380 TI is really a
pretty old GPU at this point so it's
curious if that's the case and you're
getting kind of gouged here because this
is a card that they're now selling for
basically $11,000 the next is the 4070
TI super they mostly aim this card at
gaming this card still has 16 gigs of
memory so but the bus is basically the
same and they actually compare the
performance here to the 3070 ti so it's
only about 1.6 times faster than a 3070
TI uh cous comparison and then they show
the 4070 super where they're saying is
20% Which has 20% more chorus than the
RTX 470 uh the biggest claim of all
these here is claiming that the RTX 470
is faster than a 3090 at a fraction of
the power which is true but the irony is
the price is still the same and RTX 390s
are only going to get cheaper and the
390 if you have two of them within vlink
I would say is still the best value in
any GPU available the thing is is I
think the places to look here the 480
super is an interesting case it's kind
of expensive though and the 4070 super
is curious but the question is is it
actually more performant than the 3090
at doing things like llms or doing other
kind of AI Dev tasks now you might think
oh it only has 16 gigs of RAM you know
how useful could that really be why
would you even consider recommending the
card of that caliber and the reason I
say that is we've made a lot of progress
in really what I would call more of an
art than a science of llm quantization
and this is basically a set of methods
and there are a lot of different ways
you can do this nowadays where you can
adjust the representation of the
underlying data sets to take models that
in their full form would would require
dozens of gpus down to something that
you can actually fit on a smaller GPU so
for instance you you can quantize a a 70
billion parameter model into something
that's actually small enough you can fit
on something as small at least here as a
3090 or a
4060 and initially the challenge with
doing this was that you would actually
have significantly lower accuracy and in
certain cases have much less capability
but what's kind of cool um especially
with a lot of work done with llama 2 and
with hugging face Transformers is we can
now make these models small enough that
they can work on something as small as a
4060 which um previously even just 3 to
four months ago would have been thought
as something kind of crazy and to be
fair these are um still relatively full
implementations of models these are not
models that have been reduced to the
point that you can run them on an iPhone
or that you can run on M LX or gdf and
for instance Miku 70b which we covered
on this channel prior was actually one
of the first that we saw um be reduced
to the point with a method called EXL 2
to run on a single 3090 which is pretty
cool and it is important to mention that
the process of fine-tuning and the
process of actually just running
inference or training are all separate
and generally inference is the lightest
in terms of memory use however bandwidth
wise you'll see a different utilization
there and that's why it's sometimes
common again to bring up risers where
you'll see gpus be just fine in training
and kind of a distributed format uh but
then when you start to do inference um
you'll start to have issues because um
it's more bandwidth dependent and what's
really cool with this new uh aqm method
is they make it small enough to actually
just run with 5 gigs of RAM and this
chart is showing kind of what EXL 2 is
capable of and it's pretty crazy so this
is actually looking at mixt 8X 7B and in
theory what could be achieved in terms
of bits per weight which you can roughly
equate on this side to uh the requisite
amount of ram you'd need to run these
and although you wouldn't be able to run
a 34 billion parameter bottle like code
llama on something as small as an 8 gig
uh GPU what's pretty cool is with 2bit
quantization you could comfortably fit
that into a card that has just 12 gigs
of vram and the 4070 has 16 which is
pretty cool now obviously the 39d has 24
gigs and I still think it's kind of a
better option but if you can't find one
or you have to buy a new GPU I would
recommend the 4070 because now with
these methods if you're just doing
inference and you're more kind of using
models as opposed to actually developing
with them this could be a really great
choice so why is this Nvidia tensor RT
platform such a big deal at a high level
this is just an SDK for high performance
deep learning inference which includes a
deep learning inference optimiz and
runtime that delivers low latency high
throughput for inference applications
and by improving efficiency sometimes
that means it uses a bit less Ram so it
speeds up inference uh you can optimize
performance and uh they also tie this in
with some other Nvidia technology like
Nvidia Triton so basically it's a
bolt-on Improvement for Cuda and
improves uh for certain applications
ways you would go about deploying these
models and what's pretty cool is I mean
these are all benchmarks on the h100
obviously but you can see that with uh
this tensor RT llm you can eek out about
an 8X performance boost compared to the
a100 uh when you're comparing just to
the h100 you get about a 2X boost which
is pretty cool and in theory this also
makes their gpus a little bit more
efficient but what have they actually
shown we can do with this so what
they've shown here is kind of three big
models they're really proud of and this
was actually just released a few days
ago so the three that they have enhanced
with tensor RT is code llama 70b which
is quite cool I like this more than gp4s
coding assistant right now Cosmos 2
which is according to Nvidia is the
largest multilanguage language model um
from Microsoft research and they' have
also uh enabled this with tensor RT
which is pretty cool and the other thing
that's nice is they've actually built
some nice wrappers to actually run this
in Windows so if you don't want to run
with Linux um I still recommend like
learning how to use Linux if you really
want to develop on this stuff because
it's just easier and saves time but yeah
you can run this on Windows and they
have a nice interface for it so for
beginners this is actually pretty cool
and um it was also curious that they
chose to also uh enable and sort of
implement tensor RT with seamless m4t
which is a lesser known meta model which
is also a multimodal foundational model
capable of translating speech and text
and um with really approach with really
a focus on on um ASR and kind of
communication obstacles which is kind of
interesting so they're pretty proud of
the cosmos interface clearly this this
was meant to be um demoed and they do
show what you can do here which is
pretty cool so Nvidia is clearly proud
of tensor RT and it is interesting to
see that they're finally kind of
Dripping this down into consumer cards
for the longest time a lot of these
improvements were heavily restricted to
their workstation cards like the a5000
and the a6000 obviously the a8000 now
being the flagship of that class of kind
of AI developer gpus now we did Save The
Best For Last which is one of the
coolest things I've seen now to be fair
the first inklings of this model uh or
this way of deploying Nvidia gpus I saw
about 5 months ago but it was really
cool to see that the secret really had
kind of gotten out on Reddit at least
and someone really took it to it to its
logical conclusion or logical end and by
this I mean people finding clever ways
to find cheaper a100 40 and 80 gig gpus
that were actually intended to be used
in nvidia's own chassis so these use the
sxm 4 or sxm 5 um form factor so there's
there that weird chiplet that lets you
put like nine of them in one server and
um people have actually found ways to
run up to about I think eight or 10 of
them even and there was a really really
cool Reddit post that I found that
showed someone doing this and the funny
thing is they managed to do this but it
was actually so hard to put together and
such a hassle that they're actually
selling all these they said it was cool
it worked and it made about $5 to $6,000
a month when they rented it on vast but
that they actually gave up because it
was just too much work and I will say
what's funny now you can actually find
these uh Nvidia a100 baseboards now on
um eBay these were previously really
hard to find and unfortunately since the
secret is out with how to actually run
A1 100's outside of Nvidia hard Ware
like this um using some really clever
engineering from China it's now actually
impossible to find reasonably priced um
a140 gig or 80 gig gpus in the sxm 4
format on eBay so it's kind of
unfortunate that is the way it goes now
but granted I never had like an extra 20
grand to buy four of these so what's
interesting is this user on Reddit which
I'll link this post below break it Boris
pretty much um explains how what what
all he did here so he said it took a
while but I finally got everything wired
up powered and connected so he managed
to get five a140 gigs running at 450
watts each uh with their dedicated um
four port pcie switch with extenders um
basically he had this very complicated
way of connecting them to a conventional
motherboard which is pretty cool uh each
GPU has its own power supply pulling it
about 200 Watts uh at idle which is kind
of crazy and what's also pretty
interesting is gpus like this uh they're
really meant to use Envy switch which is
a faster version of Envy link that can
connect all number of gpus together and
what's kind of cool is with P2P RDMA if
you have fast enough switching on the
PCI Express Bus technically all these
gpus can talk to each other without
needing um physical connections other
than PCI Express which is why a lot of
ADV Advanced Nvidia gpus now don't
actually need Envy link it looks like
the biggest thing he ran on this was
Goliath 8bit with ggf which he say
weirdly outperforms the EXL 2 6bit model
he's not sure if this has to do with how
these gpus are doing transfers but he
did manage to max out this this whole
system with uh 12 tokens a second and to
think that you were actually running 12
tokens a second with something this
nutty is kind of crazy uh and then again
as I've mentioned Christian Payne or C
pay his risers were used here a lot of
his Hardware was used here and um there
a lot of these extenders on eBay and
aliia Express but um the original design
came from this guy and you should
definitely buy directly from him if you
can he's the guy who designed this whole
thing so this is what it looked like so
the the funny thing is these Nvidia um
heat SNS are actually surprisingly large
and you can see this uh pcie switch here
so this is these two connectors go to an
actual motherboard and then you can see
that there are these daughter boards
where the SX4 gpus are actually
connected and then there's a PCI Express
ribbon going from each of those into
these which is kind of interesting this
is another view you can see the the
actual host adapter here which is pretty
cool and it's another view so he
actually had um you know $40,000 of gpus
and another you know I'd say $6,000 of
PCI Express switching Hardware on what
looks to be a bamboo shelf which is very
impressive if I I don't say so myself
and um if you're in the market for this
much Hardware I don't know if he sold it
yet it's been about about two weeks you
can definitely see if uh he's there and
curious curious about that so the other
funny thing is this is an example of
where he bought one of these uh gpus so
at one point this was a wildly cheap way
to find these a lot of a00 sxm 4 40 gig
um GPU graphics cards for um just
1750 which is pretty crazy and it looks
like these were actually the gpus
themselves not just the coolers but he
actually had to fix some of the pins
which were bent which you know that's
kind of a it's a big risk when you're
buying gpus that are this expensive
especially from China so we now that
this is possible uh if you want to try
this definitely uh get on Reddit and
send break it Boris a DM if uh you guys
are planning on buying the 4070 or the
4080 super please let me know I think
they're pretty good options um I mostly
use 390s and 490s but because of my work
I have kind of a different amount of
budget I can use for this which is uh
pretty cool you know it's for work so I
can do that but uh yeah the 3090 is
really hard to beat and it's so
affordable on eBay now uh so that's kind
of my recommendation the 4070 is the
4070 super is also a really good option
if you're just going to be doing
inference and want to do just enough
locally um the 16 gigs of RAM there is
fast enough that it really enables a lot
so those are my thoughts there if you
disagree with me please let me know in
the comments um as always if you learned
something or you like this video video
and please like subscribe and share it
helps us out a lot and we'll see you in
the next
video
Voir Plus de Vidéos Connexes
AMD's Hidden $100 Stable Diffusion Beast!
GET IN EARLY! I'm Investing In This HUGE AI Chip Breakthrough
Nvidia's meteoric rise to $3 trillion | About That
Introduction to Generative AI
WATCH ASAP! TECHNICAL ANALYSIS FOR NVDA - AUTONOMOUS AI BREAKTHROUGH BY NVIDIA
CUDA Explained - Why Deep Learning uses GPUs
5.0 / 5 (0 votes)