Which nVidia GPU is BEST for Local Generative AI and LLMs in 2024?

Ai Flux
9 Feb 202419:06

Summary

TLDRThe video discusses advancements in open-source AI, emphasizing the ease of running generative AI locally for images, video, and podcast transcription. It explores the cost-effectiveness of Nvidia GPUs for compute tasks, comparing them to Apple and AMD. The script delves into the latest Nvidia RTX 40 Super Series GPUs, their AI capabilities, and the potential of using older models for AI development tasks. It also highlights the significance of Nvidia's Tensor RT platform for deep learning inference and showcases the impressive performance of modified enterprise-grade GPUs in a DIY setup.

Takeaways

  • ๐Ÿš€ Open source AI has seen significant advancements, enabling local generation of AI content for images, video, and even podcast transcriptions at a rapid pace.
  • ๐Ÿ’ก Nvidia GPUs are currently leading in terms of cost of compute for AI tasks, with Apple and AMD being close competitors.
  • ๐Ÿ’ป The decision between renting or buying GPUs often leans towards buying for those who wish to experiment and develop with AI tools like merge kits.
  • ๐Ÿ” Nvidia's messaging is confusing due to the variety of GPUs available, ranging from enterprise-specific to general consumer products.
  • ๐Ÿ“… The release of Nvidia's RTX 40 Super Series in early January introduced GPUs with enhanced AI capabilities, starting at $600.
  • ๐Ÿ”ข The new GPUs boast improved performance metrics such as Shader teraflops, RT teraflops, and AI tops, catering to gaming and AI-powered applications.
  • ๐ŸŽฎ Nvidia's DLSS (Deep Learning Super Sampling) technology allows for AI-generated pixels to increase resolution in games, enhancing performance.
  • ๐Ÿค– The AI tensor cores in the new GPUs are highlighted for their role in high-performance deep learning inference, beneficial for AI models and applications.
  • ๐Ÿ”ง Techniques like model quantization have made it possible to run large AI models on smaller GPUs, opening up more affordable options for AI development.
  • ๐ŸŒ Nvidia's Tensor RT platform is an SDK that optimizes deep learning inference, improving efficiency and performance for AI applications.
  • ๐Ÿ’ก The script also discusses the use of enterprise-grade GPUs in consumer settings, highlighting the potential for high-performance AI tasks outside of professional environments.

Q & A

  • What advancements in open source AI have been made in the last year according to the transcript?

    -The transcript mentions that there have been massive advancements in open source AI, including the ease of running local large language models (LLMs) for generative AI like Stable Diffusion for images and video, and the capability to transcribe entire podcasts in minutes.

  • What is the current best option in terms of cost of compute for AI tasks?

    -The transcript suggests that Nvidia GPUs are currently the best option in terms of cost of compute for AI tasks, with Apple and AMD being close competitors.

  • Should one rent or buy GPUs for AI tasks according to the transcript?

    -The transcript recommends buying your own GPU instead of renting for those who want to experiment and mix and match with tools or for developers who want to do more in-depth work.

  • What is the latest series of GPUs released by Nvidia as of the transcript's recording?

    -Nvidia has released the new RTX 40 Super Series, which is a performance improvement over the previous generation, aimed at gaming and creative applications with AI capabilities.

  • What is the starting price for the new RTX 40 Super Series GPUs mentioned in the transcript?

    -The starting price for the new RTX 40 Super Series GPUs is $600, which is around the same price as used RTX 3090s or 3090 Ti.

  • What is the significance of the AI tensor cores in the new Nvidia GPUs?

    -The AI tensor cores in the new Nvidia GPUs deliver high performance for deep learning inference, which is crucial for AI tasks and applications, including low latency and high throughput for inference applications.

  • How does Nvidia's DLSS technology work, and what does it offer?

    -DLSS, or Deep Learning Super Sampling, is a technology that infers pixels to increase resolution without the need for more ray tracing. It can accelerate full rate racing by up to four times with better image quality.

  • What is the role of Nvidia's Tensor RT in AI and deep learning?

    -Nvidia's Tensor RT is an SDK for high-performance deep learning inference, which includes optimizations for runtime that deliver low latency and high throughput for inference applications, improving efficiency and performance.

  • What is the potential of quantization in making large AI models run on smaller GPUs?

    -Quantization allows for the adjustment of the representation of underlying datasets, enabling large AI models that would normally require multiple GPUs to run on smaller ones, like the 3090 or even a 4060, with reasonable accuracy.

  • What are some of the models that Nvidia has enhanced with Tensor RT as mentioned in the transcript?

    -Nvidia has enhanced models like Code Llama 70b, Cosmos 2 from Microsoft Research, and a lesser-known model called Seamless M4T, which is a multimodal foundational model capable of translating speech and text.

  • What is the situation with the availability of Nvidia's A100 GPUs in the SXM4 format according to the transcript?

    -The transcript mentions that due to the discovery of how to run A100 GPUs outside of Nvidia's own hardware, it has become nearly impossible to find reasonably priced A100 40 and 80 GB GPUs in the SXM4 format on eBay.

Outlines

00:00

๐Ÿš€ Advancements in Open Source AI and GPU Options

The script discusses the significant strides made in open source AI, particularly in 2024, enabling local deployment of generative AI models for images, video, and podcast transcription. It emphasizes the importance of Nvidia GPUs for compute efficiency and questions whether renting or buying GPUs is more cost-effective. The video aims to clarify the value of the latest Nvidia GPUs, comparing them with older models and enterprise hardware, and hints at a future discussion on the high-end of enterprise hardware options.

05:01

๐Ÿ’ก Nvidia's New GPU Releases and AI Capabilities

This paragraph delves into Nvidia's recent release of the RTX 40 Super Series GPUs, highlighting their AI capabilities and performance improvements. It mentions the use of AI for upscaling in gaming through DLSS technology and the new GPUs' ability to handle AI tasks more efficiently. The paragraph also touches on the potential of these GPUs for developers and the comparison of their performance with previous models, suggesting that while the new GPUs offer enhanced capabilities, their pricing may not always reflect better value for money.

10:01

๐Ÿ” Exploring AI Model Quantization and GPU Performance

The script explores the concept of AI model quantization, which allows for the reduction of model sizes to fit on smaller GPUs without significant loss of accuracy. It discusses the progress made in this area, particularly with models like LLaMA 2 and Hugging Face Transformers, and how this advancement makes GPUs like the 4060 capable of running models that were previously too large. The paragraph also addresses the different requirements for inference and training in terms of memory and bandwidth, and the potential of the new AQM method to enable even more efficient model deployment.

15:02

๐ŸŒ Nvidia Tensor RT Platform and DIY Enterprise GPU Setups

The final paragraph discusses the significance of Nvidia's Tensor RT platform for deep learning inference, its benefits in terms of performance and efficiency, and its integration with other Nvidia technologies. It also covers the recent enablement of Tensor RT for large models like Code Llama and Cosmos 2. Additionally, the script narrates a Reddit user's experience with setting up a custom system using Nvidia's enterprise-grade GPUs in a non-traditional configuration, demonstrating the potential for high-performance DIY setups and the challenges involved in such endeavors.

Mindmap

Keywords

๐Ÿ’กOpen Source AI

Open Source AI refers to artificial intelligence software whose source code is available to the public, allowing anyone to view, use, modify, and distribute it without restrictions. In the video, the script mentions advancements in open source AI, indicating the rapid progress in this field and its accessibility to a broader audience. The script also mentions generative AI models like 'stable diffusion,' which are examples of open source AI technologies.

๐Ÿ’กLocal LLMs

Local LLMs, or Large Language Models, are AI systems designed to process and generate human-like text based on large datasets. They are referred to in the script as being easily runnable on personal devices, indicating a shift towards more accessible AI technology that doesn't require cloud-based computation.

๐Ÿ’กNvidia GPUs

Nvidia GPUs, or Graphics Processing Units, are specialized hardware accelerated for the computation of graphics and, in recent years, for general-purpose computing tasks, including AI. The script discusses Nvidia GPUs as a preferred choice for AI tasks due to their computational efficiency and the company's focus on AI capabilities.

๐Ÿ’กCompute Cost

Compute cost in the context of the script refers to the financial expense associated with running computationally intensive tasks, such as AI model training or inference. The script compares different GPU options in terms of their cost-effectiveness for AI workloads.

๐Ÿ’กDLSS

DLSS, or Deep Learning Super Sampling, is a technology developed by Nvidia that uses AI to upscale lower resolution images to higher resolutions with improved quality. The script mentions DLSS as an example of how AI is integrated into consumer GPUs to enhance gaming experiences.

๐Ÿ’กQuantization

Quantization in AI refers to the process of reducing the precision of the numbers used in a neural network to enable more efficient computation. The script discusses the advancements in quantization that allow large models to run on smaller GPUs, which is crucial for making AI more accessible.

๐Ÿ’กTensor RT

Tensor RT is an SDK by Nvidia that is optimized for high-performance deep learning inference. The script highlights the importance of Tensor RT in improving the efficiency and performance of AI applications on Nvidia GPUs.

๐Ÿ’กAI Flux

AI Flux appears to be the name of the video series or channel where the script is from. It likely focuses on topics related to artificial intelligence, its rapid developments, and the impact on technology and society.

๐Ÿ’กSXM Form Factor

The SXM form factor refers to a specific design of GPU modules used in high-performance computing systems, allowing for high-speed interconnects between GPUs. The script discusses how enthusiasts have adapted these enterprise-grade GPUs for use in standard PC environments.

๐Ÿ’กInference

Inference in AI is the process of making predictions or decisions based on a trained model without the need for further learning. The script discusses inference in the context of running AI models on GPUs and the importance of memory and bandwidth for this task.

๐Ÿ’กEXL 2

EXL 2, or Excellence 2, seems to be a method or technology mentioned in the script that allows for efficient running of AI models with reduced computational requirements. It's an example of how advancements in AI technology can make it possible to run sophisticated models on less powerful hardware.

Highlights

Open-source AI has seen massive advancements, with local generative AI models like Stable Diffusion for images and video, and tools to transcribe podcasts in minutes.

Nvidia GPUs are currently the best in terms of cost of compute for AI advancements, with Apple and AMD being close competitors.

The decision between renting or buying GPUs depends on the user's need for experimentation and in-depth development.

Nvidia's messaging is confusing due to the tagging of AI onto various products, with many releases in a short period.

Nvidia's 40 Series GPUs, specifically the RTX 40 Super Series, offer improved performance and generative AI capabilities starting at $600.

The RTX 40 Super Series features AI as a superpower, with significant performance improvements in gaming and creative applications.

Nvidia's DLSS (Deep Learning Super Sampling) technology can infer pixels to increase resolution without additional ray tracing.

AI tensor cores in the new GeForce RTX super GPUs deliver high performance for deep learning inference applications.

Nvidia's RTX 4070 Super offers 20% more performance than the RTX 470 at the same price point, making it a compelling option.

The RTX 4090 is faster than a 3090 at a fraction of the power, but the price remains the same, questioning its value.

Advancements in LLM quantization allow for running large models on smaller GPUs, making high-parameter models accessible on consumer-grade hardware.

Nvidia Tensor RT platform is an SDK for high-performance deep learning inference, improving efficiency and reducing latency.

Nvidia has enabled Tensor RT with large models like Code Llama 70b, Cosmos 2, and Seismic M4T, enhancing multi-language and multimodal capabilities.

The user innovation of running A100 GPUs outside of Nvidia's hardware chassis has led to high-performance, albeit complex, DIY setups.

The cost-effectiveness of older Nvidia GPUs like the 3090 makes them a hard option to beat, especially for inference tasks.

The video concludes with recommendations for the 4070 Super for inference tasks and a comparison with the 3090 for overall value.

Transcripts

play00:00

open source AI has made massive

play00:02

advancements in the last year and even

play00:04

in the first month of 2024 it's never

play00:06

been easier to run local llms generative

play00:09

AI like stable diffusion for both images

play00:12

and video and even do things like

play00:14

transcribe entire podcasts in minutes

play00:17

and the question is how do you do that

play00:19

and the best tools for this in terms of

play00:21

the cost of compute in terms of tokens

play00:24

per dollar I believe is still Nvidia

play00:26

gpus uh apple and AMD are getting really

play00:29

close but if we're looking at uh what's

play00:31

going to give you the most options

play00:33

Nvidia gpus are basically it and the

play00:36

question comes down well do you rent

play00:37

them or do you buy them and for a lot of

play00:39

people especially people who want to

play00:40

experiment and mix and match with things

play00:42

like merge kits or developers that want

play00:44

to do some more in-depth work buying

play00:45

your own GPU makes way more sense than

play00:48

renting on a service like runp pod fast

play00:51

AI or tensor Dock and then the question

play00:53

is which GPU do you buy and Nvidia their

play00:56

messaging is all over the place right

play00:57

now there are obviously Enterprise CPUs

play01:00

that are meant to do very specific

play01:01

things but are maybe not as general or

play01:03

accessible to most consumers and they

play01:07

just tag AI onto everything now and

play01:10

given they've made a lot of releases in

play01:11

the last week I wanted to kind of

play01:13

condense a lot of this information and

play01:15

show you what's possible now and share

play01:17

if I think the latest Nvidia gpus are

play01:20

really a great deal whether or not older

play01:22

Nvidia gpus really are still a better

play01:24

value and then for those of you who just

play01:26

have way too much money sitting around

play01:28

what the uh farthest you can stretch

play01:30

into Enterprise Hardware is and I think

play01:33

you'll be surprised what I found but

play01:35

stay tuned for that just a bit later so

play01:37

that's what I want to go over in this

play01:38

video we're going to go over some new

play01:39

models that Nvidia has released features

play01:41

that are enabled in the newest gpus and

play01:44

whether or not the 5090 is actually

play01:45

coming out in 2024 so welcome to AI flux

play01:49

let's get into it so I should first off

play01:52

start by saying that we're still not

play01:54

really sure when the 5090 is coming we

play01:56

know that uh tsmc is building a brand

play01:59

new new facility in Japan unfortunately

play02:02

not in the USA likely to try to stem uh

play02:06

how much uh China is now affecting uh

play02:09

the future plans for tsmc in Taiwan and

play02:12

what's also interesting is we know that

play02:14

the whole game of making more AI compute

play02:16

is getting much more competitive and

play02:18

Nvidia clearly cares about AI way more

play02:20

than consumer gpus so the 590 is

play02:23

probably not coming anytime soon at the

play02:25

soonest probably the end of 2024 and I

play02:27

hope I have to eat my words and we see

play02:28

the 590 before that but let's get into

play02:31

the latest 40 series gpus that were

play02:33

released in early January and what's

play02:35

good is we now know all these features

play02:37

are relatively real and that the gpus

play02:38

are actually being produced Nvidia let

play02:40

out a press release in January saying

play02:43

that they released new RTX 40 Super

play02:45

Series and for those of you who don't

play02:47

know the the super series is pretty much

play02:48

when Nvidia has peace me GPU performance

play02:51

improvements they want to make to

play02:52

stretch a generation of gpus out a bit

play02:55

more so for instance sometimes we'll

play02:56

have gpus once every cycle but usually

play02:59

we see kind of the the ti come out um a

play03:01

year or two after the initial release of

play03:03

the new generation so they debut these

play03:06

as new heroes in the gaming and creative

play03:08

Universe with AI as their superpower and

play03:11

of course Nvidia is pretty fast and

play03:12

loose with the actual specs that come

play03:15

along with this so they say here that

play03:16

gaming GPU is Amplified with more

play03:18

performance and generative AI

play03:19

capabilities starting at $600 which is a

play03:22

curious price point because that's right

play03:24

around where you start looking at uh

play03:26

what used 309s or 3090 TI can be had for

play03:30

so the question here is what did they

play03:31

release so they there's the 4080 super

play03:34

and the 4070 super which they say

play03:37

supercharge the latest games and form

play03:39

the core of AI powered PCS so they say

play03:41

this latest iteration of Nvidia Ada love

play03:43

lace architecture based gpus delivers up

play03:45

to 52 Shader teraflops 121 RT teraflops

play03:49

and 836 AI tops uh these are all just

play03:52

units of compute basically and the 4070

play03:54

super being the cheapest of these starts

play03:56

at $600 and when they say AI powered

play03:59

most of what they're referring to is

play04:01

dlss or one of their new Nvidia deep

play04:04

learning super sampling Technologies

play04:06

basically saying they can infer pixels

play04:09

to increase resolution without actually

play04:10

having to do more Ray tracing and in

play04:12

their words they they're basically

play04:14

saying with dlss 7 out of 8 pixels can

play04:16

be AI generated accelerating full rate

play04:17

racing by up to four times with better

play04:19

image quality which uh this is actually

play04:21

getting pretty good I don't really play

play04:23

a lot of video games these days but it

play04:25

is cool to see this getting more common

play04:27

and it basically being capable in games

play04:29

that don't even natively support this

play04:30

they also say an AI powered leap in PC

play04:33

Computing so they say the new GeForce

play04:35

RTX super gpus are the ultimate way to

play04:36

experience AI on PCS the AI tensor cors

play04:39

deliver up to the same specs they showed

play04:42

before and they are really big on Nvidia

play04:45

tensor RT which is actually pretty cool

play04:48

and Nvidia has released a lot of fine

play04:50

tunes of models that enable this

play04:51

technology and what they call this is

play04:53

its software for high performance deep

play04:55

learning inference which includes deep

play04:57

learning inference optimizations for the

play04:59

Run time that delivers low latency high

play05:00

throughput for inference applications so

play05:03

this is mostly focused at llms a lot of

play05:05

the work that Nvidia has done here is

play05:07

just Windows tooling to make a lot of

play05:08

this stuff work since most of the

play05:10

development that goes on with a lot of

play05:11

this AI stuff is really on Linux and

play05:13

again they mentioned dlss and a few

play05:16

other kind of clever um video

play05:18

manipulation things for streamers um

play05:20

like removing background noise and doing

play05:22

chroma key in real time which it's crazy

play05:25

uh that that's actually just like a

play05:26

common thing that all these cards do now

play05:28

now when they get in into what these

play05:30

cards actually are it's pretty

play05:31

interesting so they say there's the 4K

play05:33

monster the 480 super uh they say here

play05:37

that that you're getting blistering

play05:39

performance compared to the RTX 480 and

play05:42

that the RTX 480 is twice as fast as the

play05:44

RTX 380 TI and the 380 TI is really a

play05:48

pretty old GPU at this point so it's

play05:49

curious if that's the case and you're

play05:51

getting kind of gouged here because this

play05:53

is a card that they're now selling for

play05:56

basically $11,000 the next is the 4070

play05:59

TI super they mostly aim this card at

play06:02

gaming this card still has 16 gigs of

play06:04

memory so but the bus is basically the

play06:07

same and they actually compare the

play06:09

performance here to the 3070 ti so it's

play06:11

only about 1.6 times faster than a 3070

play06:14

TI uh cous comparison and then they show

play06:17

the 4070 super where they're saying is

play06:20

20% Which has 20% more chorus than the

play06:22

RTX 470 uh the biggest claim of all

play06:25

these here is claiming that the RTX 470

play06:27

is faster than a 3090 at a fraction of

play06:29

the power which is true but the irony is

play06:32

the price is still the same and RTX 390s

play06:35

are only going to get cheaper and the

play06:37

390 if you have two of them within vlink

play06:39

I would say is still the best value in

play06:43

any GPU available the thing is is I

play06:45

think the places to look here the 480

play06:47

super is an interesting case it's kind

play06:50

of expensive though and the 4070 super

play06:52

is curious but the question is is it

play06:54

actually more performant than the 3090

play06:56

at doing things like llms or doing other

play06:59

kind of AI Dev tasks now you might think

play07:02

oh it only has 16 gigs of RAM you know

play07:03

how useful could that really be why

play07:06

would you even consider recommending the

play07:08

card of that caliber and the reason I

play07:11

say that is we've made a lot of progress

play07:14

in really what I would call more of an

play07:16

art than a science of llm quantization

play07:19

and this is basically a set of methods

play07:22

and there are a lot of different ways

play07:23

you can do this nowadays where you can

play07:25

adjust the representation of the

play07:28

underlying data sets to take models that

play07:30

in their full form would would require

play07:33

dozens of gpus down to something that

play07:35

you can actually fit on a smaller GPU so

play07:38

for instance you you can quantize a a 70

play07:41

billion parameter model into something

play07:43

that's actually small enough you can fit

play07:45

on something as small at least here as a

play07:47

3090 or a

play07:50

4060 and initially the challenge with

play07:53

doing this was that you would actually

play07:55

have significantly lower accuracy and in

play07:58

certain cases have much less capability

play08:01

but what's kind of cool um especially

play08:03

with a lot of work done with llama 2 and

play08:05

with hugging face Transformers is we can

play08:08

now make these models small enough that

play08:10

they can work on something as small as a

play08:12

4060 which um previously even just 3 to

play08:15

four months ago would have been thought

play08:17

as something kind of crazy and to be

play08:19

fair these are um still relatively full

play08:22

implementations of models these are not

play08:24

models that have been reduced to the

play08:25

point that you can run them on an iPhone

play08:27

or that you can run on M LX or gdf and

play08:31

for instance Miku 70b which we covered

play08:33

on this channel prior was actually one

play08:35

of the first that we saw um be reduced

play08:37

to the point with a method called EXL 2

play08:41

to run on a single 3090 which is pretty

play08:44

cool and it is important to mention that

play08:46

the process of fine-tuning and the

play08:48

process of actually just running

play08:50

inference or training are all separate

play08:52

and generally inference is the lightest

play08:55

in terms of memory use however bandwidth

play08:58

wise you'll see a different utilization

play09:00

there and that's why it's sometimes

play09:02

common again to bring up risers where

play09:04

you'll see gpus be just fine in training

play09:06

and kind of a distributed format uh but

play09:08

then when you start to do inference um

play09:10

you'll start to have issues because um

play09:11

it's more bandwidth dependent and what's

play09:13

really cool with this new uh aqm method

play09:17

is they make it small enough to actually

play09:19

just run with 5 gigs of RAM and this

play09:22

chart is showing kind of what EXL 2 is

play09:24

capable of and it's pretty crazy so this

play09:28

is actually looking at mixt 8X 7B and in

play09:32

theory what could be achieved in terms

play09:34

of bits per weight which you can roughly

play09:36

equate on this side to uh the requisite

play09:39

amount of ram you'd need to run these

play09:41

and although you wouldn't be able to run

play09:44

a 34 billion parameter bottle like code

play09:46

llama on something as small as an 8 gig

play09:49

uh GPU what's pretty cool is with 2bit

play09:52

quantization you could comfortably fit

play09:55

that into a card that has just 12 gigs

play09:57

of vram and the 4070 has 16 which is

play10:00

pretty cool now obviously the 39d has 24

play10:04

gigs and I still think it's kind of a

play10:05

better option but if you can't find one

play10:07

or you have to buy a new GPU I would

play10:09

recommend the 4070 because now with

play10:11

these methods if you're just doing

play10:12

inference and you're more kind of using

play10:14

models as opposed to actually developing

play10:16

with them this could be a really great

play10:17

choice so why is this Nvidia tensor RT

play10:21

platform such a big deal at a high level

play10:23

this is just an SDK for high performance

play10:25

deep learning inference which includes a

play10:27

deep learning inference optimiz and

play10:29

runtime that delivers low latency high

play10:31

throughput for inference applications

play10:33

and by improving efficiency sometimes

play10:35

that means it uses a bit less Ram so it

play10:38

speeds up inference uh you can optimize

play10:40

performance and uh they also tie this in

play10:43

with some other Nvidia technology like

play10:45

Nvidia Triton so basically it's a

play10:47

bolt-on Improvement for Cuda and

play10:49

improves uh for certain applications

play10:52

ways you would go about deploying these

play10:53

models and what's pretty cool is I mean

play10:56

these are all benchmarks on the h100

play10:58

obviously but you can see that with uh

play11:01

this tensor RT llm you can eek out about

play11:04

an 8X performance boost compared to the

play11:07

a100 uh when you're comparing just to

play11:09

the h100 you get about a 2X boost which

play11:12

is pretty cool and in theory this also

play11:15

makes their gpus a little bit more

play11:16

efficient but what have they actually

play11:18

shown we can do with this so what

play11:21

they've shown here is kind of three big

play11:23

models they're really proud of and this

play11:25

was actually just released a few days

play11:27

ago so the three that they have enhanced

play11:30

with tensor RT is code llama 70b which

play11:34

is quite cool I like this more than gp4s

play11:37

coding assistant right now Cosmos 2

play11:39

which is according to Nvidia is the

play11:41

largest multilanguage language model um

play11:44

from Microsoft research and they' have

play11:47

also uh enabled this with tensor RT

play11:50

which is pretty cool and the other thing

play11:51

that's nice is they've actually built

play11:53

some nice wrappers to actually run this

play11:54

in Windows so if you don't want to run

play11:56

with Linux um I still recommend like

play11:59

learning how to use Linux if you really

play12:00

want to develop on this stuff because

play12:02

it's just easier and saves time but yeah

play12:04

you can run this on Windows and they

play12:06

have a nice interface for it so for

play12:07

beginners this is actually pretty cool

play12:10

and um it was also curious that they

play12:12

chose to also uh enable and sort of

play12:15

implement tensor RT with seamless m4t

play12:18

which is a lesser known meta model which

play12:20

is also a multimodal foundational model

play12:23

capable of translating speech and text

play12:25

and um with really approach with really

play12:28

a focus on on um ASR and kind of

play12:30

communication obstacles which is kind of

play12:32

interesting so they're pretty proud of

play12:34

the cosmos interface clearly this this

play12:36

was meant to be um demoed and they do

play12:39

show what you can do here which is

play12:40

pretty cool so Nvidia is clearly proud

play12:43

of tensor RT and it is interesting to

play12:46

see that they're finally kind of

play12:47

Dripping this down into consumer cards

play12:49

for the longest time a lot of these

play12:51

improvements were heavily restricted to

play12:54

their workstation cards like the a5000

play12:57

and the a6000 obviously the a8000 now

play12:59

being the flagship of that class of kind

play13:02

of AI developer gpus now we did Save The

play13:05

Best For Last which is one of the

play13:07

coolest things I've seen now to be fair

play13:09

the first inklings of this model uh or

play13:12

this way of deploying Nvidia gpus I saw

play13:15

about 5 months ago but it was really

play13:17

cool to see that the secret really had

play13:19

kind of gotten out on Reddit at least

play13:21

and someone really took it to it to its

play13:23

logical conclusion or logical end and by

play13:25

this I mean people finding clever ways

play13:28

to find cheaper a100 40 and 80 gig gpus

play13:33

that were actually intended to be used

play13:34

in nvidia's own chassis so these use the

play13:37

sxm 4 or sxm 5 um form factor so there's

play13:41

there that weird chiplet that lets you

play13:43

put like nine of them in one server and

play13:45

um people have actually found ways to

play13:47

run up to about I think eight or 10 of

play13:49

them even and there was a really really

play13:51

cool Reddit post that I found that

play13:54

showed someone doing this and the funny

play13:56

thing is they managed to do this but it

play13:58

was actually so hard to put together and

play14:01

such a hassle that they're actually

play14:02

selling all these they said it was cool

play14:04

it worked and it made about $5 to $6,000

play14:06

a month when they rented it on vast but

play14:09

that they actually gave up because it

play14:10

was just too much work and I will say

play14:12

what's funny now you can actually find

play14:15

these uh Nvidia a100 baseboards now on

play14:19

um eBay these were previously really

play14:21

hard to find and unfortunately since the

play14:24

secret is out with how to actually run

play14:26

A1 100's outside of Nvidia hard Ware

play14:29

like this um using some really clever

play14:31

engineering from China it's now actually

play14:34

impossible to find reasonably priced um

play14:36

a140 gig or 80 gig gpus in the sxm 4

play14:40

format on eBay so it's kind of

play14:42

unfortunate that is the way it goes now

play14:44

but granted I never had like an extra 20

play14:46

grand to buy four of these so what's

play14:48

interesting is this user on Reddit which

play14:50

I'll link this post below break it Boris

play14:53

pretty much um explains how what what

play14:56

all he did here so he said it took a

play14:58

while but I finally got everything wired

play14:59

up powered and connected so he managed

play15:01

to get five a140 gigs running at 450

play15:04

watts each uh with their dedicated um

play15:07

four port pcie switch with extenders um

play15:11

basically he had this very complicated

play15:13

way of connecting them to a conventional

play15:15

motherboard which is pretty cool uh each

play15:18

GPU has its own power supply pulling it

play15:21

about 200 Watts uh at idle which is kind

play15:23

of crazy and what's also pretty

play15:25

interesting is gpus like this uh they're

play15:28

really meant to use Envy switch which is

play15:30

a faster version of Envy link that can

play15:32

connect all number of gpus together and

play15:35

what's kind of cool is with P2P RDMA if

play15:38

you have fast enough switching on the

play15:41

PCI Express Bus technically all these

play15:43

gpus can talk to each other without

play15:45

needing um physical connections other

play15:47

than PCI Express which is why a lot of

play15:50

ADV Advanced Nvidia gpus now don't

play15:52

actually need Envy link it looks like

play15:53

the biggest thing he ran on this was

play15:56

Goliath 8bit with ggf which he say

play15:58

weirdly outperforms the EXL 2 6bit model

play16:02

he's not sure if this has to do with how

play16:04

these gpus are doing transfers but he

play16:07

did manage to max out this this whole

play16:09

system with uh 12 tokens a second and to

play16:12

think that you were actually running 12

play16:14

tokens a second with something this

play16:16

nutty is kind of crazy uh and then again

play16:19

as I've mentioned Christian Payne or C

play16:21

pay his risers were used here a lot of

play16:22

his Hardware was used here and um there

play16:26

a lot of these extenders on eBay and

play16:28

aliia Express but um the original design

play16:31

came from this guy and you should

play16:33

definitely buy directly from him if you

play16:34

can he's the guy who designed this whole

play16:36

thing so this is what it looked like so

play16:38

the the funny thing is these Nvidia um

play16:40

heat SNS are actually surprisingly large

play16:43

and you can see this uh pcie switch here

play16:46

so this is these two connectors go to an

play16:48

actual motherboard and then you can see

play16:50

that there are these daughter boards

play16:52

where the SX4 gpus are actually

play16:54

connected and then there's a PCI Express

play16:56

ribbon going from each of those into

play16:59

these which is kind of interesting this

play17:02

is another view you can see the the

play17:04

actual host adapter here which is pretty

play17:06

cool and it's another view so he

play17:08

actually had um you know $40,000 of gpus

play17:12

and another you know I'd say $6,000 of

play17:15

PCI Express switching Hardware on what

play17:17

looks to be a bamboo shelf which is very

play17:20

impressive if I I don't say so myself

play17:23

and um if you're in the market for this

play17:25

much Hardware I don't know if he sold it

play17:27

yet it's been about about two weeks you

play17:29

can definitely see if uh he's there and

play17:31

curious curious about that so the other

play17:33

funny thing is this is an example of

play17:35

where he bought one of these uh gpus so

play17:39

at one point this was a wildly cheap way

play17:42

to find these a lot of a00 sxm 4 40 gig

play17:46

um GPU graphics cards for um just

play17:51

1750 which is pretty crazy and it looks

play17:54

like these were actually the gpus

play17:56

themselves not just the coolers but he

play17:59

actually had to fix some of the pins

play18:00

which were bent which you know that's

play18:02

kind of a it's a big risk when you're

play18:03

buying gpus that are this expensive

play18:05

especially from China so we now that

play18:07

this is possible uh if you want to try

play18:08

this definitely uh get on Reddit and

play18:11

send break it Boris a DM if uh you guys

play18:14

are planning on buying the 4070 or the

play18:17

4080 super please let me know I think

play18:19

they're pretty good options um I mostly

play18:21

use 390s and 490s but because of my work

play18:25

I have kind of a different amount of

play18:27

budget I can use for this which is uh

play18:30

pretty cool you know it's for work so I

play18:32

can do that but uh yeah the 3090 is

play18:35

really hard to beat and it's so

play18:37

affordable on eBay now uh so that's kind

play18:40

of my recommendation the 4070 is the

play18:42

4070 super is also a really good option

play18:44

if you're just going to be doing

play18:45

inference and want to do just enough

play18:47

locally um the 16 gigs of RAM there is

play18:49

fast enough that it really enables a lot

play18:52

so those are my thoughts there if you

play18:53

disagree with me please let me know in

play18:55

the comments um as always if you learned

play18:57

something or you like this video video

play18:58

and please like subscribe and share it

play19:00

helps us out a lot and we'll see you in

play19:02

the next

play19:05

video

Rate This
โ˜…
โ˜…
โ˜…
โ˜…
โ˜…

5.0 / 5 (0 votes)

Related Tags
AI AdvancementsNvidia GPUsGenerative AIDeep LearningAI InferenceQuantizationAI TranscriptionLLMsTensor RTGPU Comparison