NVIDIA Reveals STUNNING Breakthroughs: Blackwell, Intelligence Factory, Foundation Agents [SUPERCUT]
Summary
TLDRThe transcript discusses the significant growth of the AI industry, particularly in large language models post the invention of the Transformer model. It highlights the computational demands of training such models, with the latest OpenAI model requiring 1.8 trillion parameters and several trillion tokens. The introduction of Blackwell, a new GPU platform, is announced, which promises to reduce the energy and cost associated with training next-generation AI models. The transcript also touches on the importance of inference in AI and the potential of Nvidia's technologies in training humanoid robots through the Groot model and Isaac lab.
Takeaways
- π The AI industry has seen tremendous growth due to the scaling of large language models, doubling in size approximately every six months.
- π§ Doubling the model size requires a proportional increase in training token count, leading to a significant computational scale.
- π State-of-the-art models like OpenAI's GPT require training on several trillion tokens, resulting in massive floating-point operations.
- π Training such models would take millennia with a petaflop GPU, highlighting the need for more advanced hardware.
- π The development of multimodality models is the next step, incorporating text, images, graphs, and charts to provide a more grounded understanding of the world.
- π€ Synthetic data generation and reinforcement learning will play crucial roles in training future AI models.
- π’ The Blackwell GPU platform represents a significant leap in computational capabilities, offering a memory-coherent system for efficient AI training.
- π Blackwell's design allows for two dies to function as one chip, with 10 terabytes per second of data transfer between them.
- π‘ Blackwell's introduction aims to reduce the cost and energy consumption associated with training the next generation of AI models.
- π The future of data centers is envisioned as AI Factories, focused on generating intelligence rather than electricity.
- π€ Nvidia's Project Groot is an example of an AI model designed for humanoid robots, capable of learning from human demonstrations and executing tasks with human-like movements.
Q & A
How did the invention of the Transformer model impact the scaling of language models?
-The invention of the Transformer model allowed for the scaling of large language models at an incredible rate, effectively doubling every six months. This scaling is due to the ability to increase the model size and parameter count, which in turn requires a proportional increase in training token count.
What is the computational scale required to train the state-of-the-art OpenAI model?
-The state-of-the-art OpenAI model, with approximately 1.8 trillion parameters, requires several trillion tokens to train. The combination of the parameter count and training token count results in a computation scale that demands high performance computing resources.
What is the significance of doubling the size of a model in terms of computational requirements?
-Doubling the size of a model means that you need twice as much information to fill it. Consequently, every time the parameter count is doubled, the training token count must also be appropriately increased to support the computational scale needed for training.
How does the development of larger models affect the need for more data and computational resources?
-As models grow larger, they require more data for training and more powerful computational resources to handle the increased parameter count and token count. This leads to a continuous demand for bigger GPUs and higher energy efficiency to train the next generation of AI models.
What is the role of multimodality data in training AI models?
-Multimodality data, which includes text, images, graphs, and charts, is used to train AI models to provide them with a more comprehensive understanding of the world. This approach helps models develop common sense and grounded knowledge in physics, similar to how humans learn from watching TV and experiencing the world around them.
How does synthetic data generation contribute to the learning process of AI models?
-Synthetic data generation allows AI models to use simulated data for learning, similar to how humans use imagination to predict outcomes. This technique enhances the model's ability to learn and adapt to various scenarios without the need for extensive real-world data.
What is the significance of the Blackwell GPU platform in the context of AI model training?
-The Blackwell GPU platform represents a significant advancement in AI model training by offering a highly efficient and energy-saving solution. It is designed to handle the computational demands of training large language models and can reduce the number of GPUs needed, as well as the energy consumption, compared to previous generations of GPUs.
How does the Blackwell system differ from traditional GPU designs?
-The Blackwell system is a platform that includes a chip with two dies connected in such a way that they function as one, with no memory locality or cache issues. It offers 10 terabytes per second of data transfer between the two sides, making it a highly integrated and coherent system for AI computations.
What is the expected training time for a 1.8 trillion parameter GPT model with the Blackwell system?
-Using the Blackwell system, the training time for a 1.8 trillion parameter GPT model is expected to be the same as with Hopper, approximately 90 days, but with a significant reduction in the number of GPUs required (from 8,000 to 2,000) and a decrease in energy consumption from 15 megawatts to only four megawatts.
How does the Blackwell system enhance inference capabilities for large language models?
-The Blackwell system is designed for trillion parameter generative AI and offers an inference capability that is 30 times greater than Hopper for large language models. This is due to its advanced features like the FP4 tensor core, the new Transformer engine, and the Envy link switch, which allows for faster communication between GPUs.
What is the role of the Jetson Thor robotics chips in the future of AI-powered robotics?
-The Jetson Thor robotics chips are designed to power the next generation of AI-powered robots, enabling them to learn from human demonstrations and emulate human movements. These chips, along with technologies like Isaac Lab and Osmo, provide the building blocks for advanced AI-driven robotics that can assist with everyday tasks.
Outlines
π Growth of Large Language Models and Computational Requirements
This paragraph discusses the significant growth in the industry due to the scaling of large language models following the invention of the Transformer model. It highlights the exponential increase in computational requirements, doubling every six months, and the need for a proportional increase in training token count with the parameter count. The state-of-the-art open AI model, with 1.8 trillion parameters, requires several trillion tokens for training, leading to an immense computational scale measured in floating-point operations per second (FLOPS). The challenge of training such models is exemplified by the time and resources needed, which is addressed by the emergence of technologies like the chat GPT and the need for even larger models trained with multimodality data.
π Introducing Blackwell: The Next-Generation GPU Platform
The paragraph introduces Blackwell, a groundbreaking GPU platform named after David Blackwell, which represents a significant advancement in computational capabilities. It explains how Blackwell's design, with two chip dies acting as one, allows for 10 terabytes per second data transfer, eliminating memory locality and cache issues. The integration of Blackwell with existing Hopper infrastructure is discussed, along with its potential to drastically reduce the energy consumption and computational resources needed for training large AI models, such as GPT, from thousands of GPUs and megawatts to a more efficient and cost-effective solution.
π€ Inference and the Future of AI: Blackwell's Impact
This section emphasizes the importance of training, inference, and generation in the evolution of AI. It points out the challenges of inference for large language models due to their size and the need for a system designed for generative AI like Blackwell. Blackwell's inference capability is highlighted as being 30 times more powerful than Hopper, with improvements due to the new FP4 tensor core, Transformer engine, and Envy link switch. The potential of data centers as AI factories and the goal of generating intelligence rather than electricity is also discussed, along with the vision of pushing the boundaries of what AI can achieve.
π€π Isaac and Groot: AI-Powered Robotics for the Future
The final paragraph introduces Isaac, an AI-powered robot learning application, and Groot, a general-purpose foundation model for humanoid robot learning. It explains how these technologies enable robots to learn from human demonstrations and emulate human movements through observation. The use of Nvidia's technologies for understanding humans from videos, training models, and deploying them to physical robots is highlighted. The paragraph also discusses the Jetson Thor robotics chips designed for Groot, and how these components collectively provide the building blocks for the next generation of AI-powered robotics.
Mindmap
Keywords
π‘Transformer
π‘Parameter Count
π‘Computational Requirements
π‘Multimodality Data
π‘Synthetic Data Generation
π‘Reinforcement Learning
π‘Hopper GPU
π‘Blackwell GPU
π‘AI Factory
π‘Project Groot
π‘Jetson Thor
Highlights
The industry of large language models has seen tremendous growth due to scaling capabilities, effectively doubling every six months.
Doubling the size of the model requires twice as much information to fill it, leading to an increase in training token count.
The computational requirements for training large language models have grown exponentially, with the latest Open AI model requiring 1.8 trillion parameters.
Training such large models necessitates several trillion tokens, resulting in an immense computational scale.
The state-of-the-art AI models demand massive computational resources, with the example given requiring 30-50 billion quadrillion floating-point operations per second.
The development of larger models is ongoing, with plans to incorporate multimodality data including text, images, graphs, and charts.
Future models will be grounded in physics and common sense, understanding concepts like an arm not going through a wall.
The use of synthetic data generation and reinforcement learning will be key in training these models, similar to human learning processes.
A new, very large GPU named after David Blackwell is introduced, which is not a chip but a platform.
The Blackwell system features a unique design where two chips act as one, with 10 terabytes of data transfer per second between them.
Blackwell aims to reduce the cost and energy associated with computing, making it more efficient for training next-generation models.
Inference or generation is crucial for AI, with Nvidia GPUs in the cloud often used for token generation.
The Blackwell system has an exceptional inference capability, being 30 times faster than Hopper for large language models.
Blackwell's new Transformer engine and Envy link switch contribute to its superior performance.
Data centers will evolve into AI Factories, focused on generating intelligence rather than electricity.
Nvidia Project Groot is a general-purpose foundation model for humanoid robot learning, capable of taking multimodal instructions.
Isaac Lab and Osmo are tools developed by Nvidia for training and simulation, enabling robots to learn from human demonstrations.
The Jetson Thor robotics chips are designed for Groot, providing the building blocks for the next generation of AI-powered robotics.
Transcripts
one of the industries that benefited
tremendously from scale and you know you
all know this one very well large
language
models basically after the Transformer
was
invented we were able to scale large
language models and incredible rates
effectively doubling every six months
now how is it possible that by doubling
every six months that we have grown the
industry we have grown the computational
requirements so far and the reason for
that is quite simply this if you double
the size of the model you double the
size of your brain you need twice as
much information to go fill it and so
every time you double your parameter
count you also have to appropriately
increase your training token count the
combination of those two
numbers becomes the computation scale
you have to
support the latest the state-of-the-art
open AI model is approximately 1.8
trillion parameters 1.8 trillion
parameters required several trillion
tokens to go
train so a few trillion parameters on
the order of a few trillion tokens on
the order of when you multiply the two
of them together approximately 30 40 50
billion quadrillion floating Point
operations per second now we just have
to do some Co math right now just hang
hang with me so you have 30 billion
quadrillion a quadrillion is like a paa
and so if you had a p flop GPU you would
need
30 billion seconds to go compute to go
train that model 30 billion seconds is
approximately 1,000
years and here we
are as we see the miracle of chat GPT
emerg in front of us we also realize we
have a long ways to go we need even
larger models we're going to train it
with multimodality data not just text on
the internet but we're going to we're
going to train it on text and images and
graphs and
charts and just as we learn watching TV
and so there's going to be a whole bunch
of watching video so that these Mo
models can be grounded in physics
understands that an arm doesn't go
through a wall and so these models would
have common sense by watching a lot of
the world's video combined with a lot of
the world's languages they'll use things
like synthetic data generation just as
you and I do when we try to learn we
might use our imagination to simulate
how it's going to end up just as I did
when I Was preparing for this keynote I
was simulating it all along the way
we're sitting here using synthetic data
generation we're going to use
reinforcement learning we're going to
practice it in our mind we're going to
have ai working with AI training each
other just like student teacher
Debaters all of that is going to
increase the size of our model it's
going to increase the amount of the
amount of data that we have and we're
going to have to build even bigger
gpus Hopper is
fantastic but we need bigger
gpus and so ladies and
gentlemen I would like to introduce
you to a very very big
GPU
named after David
Blackwell Blackwell is not a chip
Blackwell is the name of a
platform uh people think we make
gpus and and we do but gpus don't look
the way they used
to uh here here's the here's the here's
the the if you will the heart of the
black well system and this inside the
company is not called Blackwell it's
just a number and um uh
this this is Blackwell sitting next to
oh this is the most advanced GPU in the
world in production
today this is
Hopper this is hopper Hopper changed the
world this is
Blackwell
it's okay
Hopper you're you're very
good good good
boy well good
girl 208 billion transistors and so so
you could see you I can see that there's
a small line between two dots this is
the first time two dies have abutted
like this together in such a way that
the two chip the two dieses think it's
one chip there's 10 terabytes of data
between it 10 terabytes per second so
that these two these two sides of the
Blackwell Chip have no clue which side
they're on there's no memory locality
issues no cash issues it's just one
giant chip and so uh when we were told
that Blackwell's Ambitions were Beyond
the limits of physics the engineer said
so what and so this is what what
happened and so this is the Blackwell
chip and it goes into two types of
systems the first
one is form fit function compatible to
Hopper and so you slide on Hopper and
you push in Blackwell that's the reason
why one of the challenges of ramping is
going to be so efficient there are
installations of Hoppers all over the
world and they could be they could be
you know the same infrastructure same
design the power the electricity The
Thermals the software identical push it
right back and so this is a hopper
version for the current hgx
configuration and this is what the other
the second Hopper looks like this now
this is a prototype board and um Janine
could I just
borrow ladies and gentlemen J
Paul
and so this this is the this is a fully
functioning board and I just be careful
here this right here is I don't know10
billion the second one's
five it gets cheaper after that so any
customers in the audience it's
okay all right but this is this one's
quite expensive this is to bring up
board and um and the the way it's going
to go to production is like this one
here okay and so you're going to take
take this it has two Blackwell D two two
Blackwell chips and four Blackwell dyes
connected to a Grace CPU the gray CPU
has a super fast chipto chip link what's
amazing is this computer is the first of
its kind where this much computation
first of all fits into this small of a
place second it's memory coherent they
feel like they're just one big happy
family working on one application
together and so everything is coherent
within it um the just the amount of you
know you saw the numbers there's a lot
of terabytes this and terabytes that's
um but this is this is a miracle this is
a this let's see what are some of the
things on here uh there's um uh mvy link
on top PCI Express on the
bottom on on uh
your which one is mine and your left one
of them it doesn't matter uh one of them
one of them is a c CPU chipto chip link
it's my left or your depending on which
side I was just I was trying to sort
that out and I just kind of doesn't
matter
hopefully it comes plugged in
so okay so this is the grace Blackwell
system if you were to train a GPT model
1.8 trillion parameter
model it took it took about apparently
about you know 3 to 5 months or so uh
with 25,000 amp uh if we were to do it
with hopper it would probably take
something like 8,000 gpus and it would
consume 15 megawatt 8,000 gpus and 15
megawatts it would take 90 days about 3
months and that would allows you to
train something that is you know this
groundbreaking AI model and this it's
obviously not as expensive as as um as
anybody would think but it's 8,000 8,000
gpus it's still a lot of money and so
8,000 gpus 15 megawatts if you were to
use Blackwell to do this it would only
take 2,000
gpus 2,000 gpus same 90 days but this is
the amazing part only four megawatts of
power so from 15 yeah that's
right and that's and that's our goal our
goal is to continuously drive down the
cost and the energy they're directly
proportional to each other cost and
energy associated with the Computing so
that we can continue to expand and scale
up
the computation that we have to do to
train the Next Generation models well
this is
training inference or generation is
vitally important going forward you know
probably some half of the time that
Nvidia gpus are in the cloud these days
it's being used for token generation you
know they're either doing co-pilot this
or chat you know chat GPT that or um all
these different models that are being
used when you're interacting with it or
generating IM generating images or
generating videos generating proteins
generating chemicals there's a bunch of
gener generation going on all of that is
B in the category of computing we call
inference but inference is extremely
hard for large language models because
these large language models have several
properties one they're very large and so
it doesn't fit on one GPU so now that
you understand the basics let's take a
look at inference of Blackwell compared
to Hopper and this is this is the
extraordinary thing in one generation
because we created a system that's
designed for trillion parameter gener
generative AI the inference capability
of Blackwell is off the
charts and in fact it is some 30 times
Hopper
yeah for large language models for large
language models like Chad GPT and others
like it the blue line is Hopper I gave
you imagine we didn't change the
architecture of Hopper we just made it a
bigger
chip we just used the latest you know
greatest uh 10 ter you know terabytes
per second we connected the two chips
together we got this giant 208 billion
parameter chip how would we have
performed if nothing else changed and it
turns out quite
wonderfully quite wonderfully and that's
the purple line but not as great as it
could be and that's where the fp4 tensor
core the new Transformer engine and very
importantly the Envy link switch and the
reason for that is because all these
gpus have to share the results partial
products whenever they do all to all all
all gather whenever they communicate
with each
other that mvlink switch is
communicating almost 10 times faster
than what we could do in the past using
the fastest Networks
okay so Blackwell is going to be just an
amazing system for a generative Ai and
in the
future in the future data centers are
going to be thought of as I mentioned
earlier as an AI Factory an AI Factory's
goal in life is to generate revenues
generate in this
case
intelligence in this facility not
generating electricity as in AC
generators but of the last Industrial
Revolution and this Industrial
Revolution the generation of
intelligence it's not enough for humans
to
[Music]
imagine we have to
invent and
explore and push Beyond what's been
done am of
detail
we create
smarter and
faster we push it to
fail so it can
learn we teach it then help it teach
itself we broaden its
[Music]
understanding to take on new
challenges with absolute
precision and
succeed we make it
perceive and
move and even
reason so it can share our world with
[Applause]
[Music]
us
[Music]
this is where inspiration leads us the
next
Frontier this is Nvidia Project
[Music]
Groot a general purpose Foundation model
for humanoid robot
learning the group model takes
multimodal instructions and past
interactions as input and produces the
next action for the robot to
execute we developed Isaac lab a robot
learning application to train grp on
Omniverse Isaac
Sim and we scale out with osmo a new
compute orchestration service that
coordinates workflows across dgx systems
for training and ovx systems for
simulation with these tools we can train
Groot in physically based simulation and
transfer zero shot to the real
world the Groot model will enable a
robot to learn from a handful of human
demonstrations so it can help with
everyday
tasks and emulate human movement just by
observing
us this is made possible with nvidia's
technologies that can understand humans
from videos train models and simulation
and ultimately deploy them directly to
physical robots connecting group to a
large language model even allows it to
generate motions by following natural
language instructions hi G1 can you give
me a high five sure thing let's high
five can you give us some cool moves
sure check this
out all this incredible intelligence is
powered by the new Jetson Thor robotics
chips designed for Groot built for the
future with Isaac lab osmo and Groot
we're providing the building blocks for
the next generation of AI powered
robotics
Browse More Related Video
Nvidia's Breakthrough AI Chip Defies Physics
LLAMA 3 Released - All You Need to Know
Microsoft Reveals SECRET NEW MODEL | GPT-5 DELAYED | Sam Altman speaks out against "Doomers"
ChatGPT Explained Completely.
LLM Foundations (LLM Bootcamp)
STUNNING Step for Autonomous AI Agents PLUS OpenAI Defense Against JAILBROKEN Agents
5.0 / 5 (0 votes)