OpenAI's 'AGI Robot' Develops SHOCKING NEW ABILITIES | Sam Altman Gives Figure 01 Get a Brain
Summary
TLDRFigure AI, in partnership with OpenAI, presents a groundbreaking amalgamation of robotics and neural networks. The robot, Figure 01, demonstrates advanced capabilities such as full conversations, understanding and responding to visual cues, and executing complex tasks like picking up trash and handling objects. The robot operates through a combination of a large language model for reasoning and neural network policies for physical actions, showcasing a significant step towards general AI and practical applications in embodied AI.
Takeaways
- 🤖 Figure AI is a robotics company that has partnered with OpenAI to integrate robotics and AI expertise.
- 🔄 The collaboration aims to showcase the synergy of metal and neural networks, marking a significant advancement in the field.
- 🍎 Figure One, the robot, can engage in full conversations with people, demonstrating high-level visual and language intelligence.
- 🧠 The robot's actions are powered by neural networks, not pre-programmed scripts, allowing for adaptive and dexterous movements.
- 🎥 The video demonstrates the robot's ability to understand and execute tasks based on visual input and verbal commands.
- 🔄 The robot can multitask, such as picking up trash while responding to questions, showcasing its ability to handle complex tasks.
- 🤔 Figure AI's technology suggests the potential for general AI robots (AGI), with capabilities beyond scripted movements.
- 💻 The robot's neural networks are likely running on a model provided by OpenAI, possibly GPT-4 or a similar advanced model.
- 📈 The partnership between Figure AI and OpenAI highlights the potential for scaling up embodied AI technology.
- 🚀 The demonstration of the robot's capabilities, such as handling objects and understanding context, indicates significant progress in AI and robotics.
- 📌 The video serves as a clear example of the integration of AI and robotics, moving towards more autonomous and interactive machines.
Q & A
What is the significance of the partnership between Figure AI and OpenAI?
-The partnership between Figure AI and OpenAI is significant because it combines Figure AI's expertise in building robotics with OpenAI's advanced AI and neural network capabilities. This collaboration aims to create robots that can understand and interact with their environment in a more sophisticated and autonomous manner.
How does Figure AI's robot utilize neural networks?
-Figure AI's robot uses neural networks to perform a variety of tasks. These include low-level dexterous actions, such as picking up objects, as well as high-level visual and language intelligence to understand and respond to human commands and questions.
What does the term 'end-to-end neural networks' refer to in the context of the video?
-In the context of the video, 'end-to-end neural networks' refers to the complete system of neural networks that handle all aspects of a task, from perception (seeing or understanding the environment) to action (physically manipulating objects), without the need for human intervention or pre-programming.
How does the robot determine which objects are useful for a given task?
-The robot determines which objects are useful for a task by processing visual input from its cameras and understanding the context through a large multimodal model trained by OpenAI. This model uses common sense reasoning to decide which objects are appropriate for the task at hand.
What is the role of the 'speech to speech' component in the robot's operation?
-The 'speech to speech' component allows the robot to engage in full conversations with humans. It processes spoken commands, understands the context, and generates appropriate verbal responses through text-to-speech technology.
How does the robot handle ambiguous or high-level requests?
-The robot handles ambiguous or high-level requests by using its AI model to interpret the request in the context of its surroundings and the objects available. It then selects and executes the most appropriate behavior to fulfill the command, such as handing a person an apple when they express hunger.
What is the significance of the robot's ability to reflect on its memory?
-The robot's ability to reflect on its memory allows it to understand and respond to questions about its past actions. This capability provides a richer interaction experience, as the robot can explain why it performed certain actions based on its memory of the situation.
How does the robot's whole body controller contribute to its stability and safety?
-The whole body controller ensures that the robot maintains balance and stability while performing actions, such as picking up objects or moving around. It manages the robot's dynamics across all its 'degrees of freedom', which include leg movements, arm positions, and other body adjustments necessary for safe and effective task execution.
What is the role of the 'Vis motor Transformer policies' in the robot's learned behaviors?
-The 'Vis motor Transformer policies' are neural network policies that map visual input directly to physical actions. They take onboard images at a high frequency and generate detailed instructions for the robot's movements, such as wrist poses and finger joint angles, enabling the robot to perform complex manipulations and react quickly to its environment.
What are some of the challenges that Figure AI and OpenAI aim to overcome with their partnership?
-Figure AI and OpenAI aim to overcome challenges related to creating robots that can operate autonomously in complex, real-world environments. This includes developing AI that can understand and reason about its surroundings, learn from new experiences, and execute a wide range of physical tasks without human intervention.
Outlines
🤖 Introduction to Figure AI and OpenAI Partnership
The script introduces Figure AI, a robotics company that has partnered with OpenAI to combine their expertise in robotics with OpenAI's capabilities. The video showcases the first instance of the amalgamation of metal and neural networks, highlighting the significance of the collaboration. The robot, Figure One, is seen interacting with its environment, picking up trash, and engaging in conversation, demonstrating the integration of high-level visual and language intelligence with dextrous robotic actions.
🧠 Behind the Scenes: Neural Networks and Multitasking
This paragraph delves into the technical aspects of how Figure One operates, emphasizing the role of neural networks in its functions. It explains that the robot is not pre-programmed for specific motions or responses, but rather learns from a neural network. The video clarifies that everything shown is powered by neural networks, and the robot's actions are not scripted. It also compares Figure One's capabilities to Google DeepMind's RT2 robot, highlighting the advanced level of Figure One's learning from vision and language models.
🗣️ Conversational Abilities and Memory
The script discusses Figure One's ability to have full conversations with people, thanks to the partnership with OpenAI. It explains how the robot can describe its visual experience, plan future actions, and reflect on its memory, all verbally. The paragraph also touches on the technical aspects of how the robot processes conversation history and images to generate language responses and execute actions, highlighting the integration of large multimodal models trained by OpenAI.
🤹♂️ Advanced Robotics and Learning from Observation
This section of the script focuses on the advanced capabilities of Figure One, such as learning from observations and performing complex tasks like manipulating deformable objects. It explains how the robot's behaviors are learned and not teleoperated, and how it can multitask and respond to commands while carrying out actions. The script also discusses the robot's ability to understand and execute ambiguous requests, showcasing its advanced reasoning and manipulation skills.
🎉 Conclusion and Future Prospects
The script concludes with a reflection on the impressive advancements in robotics and AI, particularly the partnership between Figure AI and OpenAI. It acknowledges the skepticism around such technologies but emphasizes the significant progress made. The speaker expresses excitement about the future of embodied AI and the potential for Figure AI to scale up this technology. The script ends with a call to action for viewers interested in joining Figure AI's efforts in advancing robotics and AI.
Mindmap
Keywords
💡Figure AI
💡OpenAI
💡Neural Networks
💡Robotics
💡Conversations
💡Multitasking
💡AGI (Artificial General Intelligence)
💡Vision Language Model
💡Teleoperation
💡Embodied AI
💡Multimodal Model
Highlights
Figure AI is a robotics company that has partnered with OpenAI to combine their expertise in building robotics with AI.
The collaboration aims to showcase the amalgamation of metal and neural networks, marking a significant advancement in the field.
Figure One, a robot developed by Figure AI, is capable of having full conversations with people, demonstrating high-level visual and language intelligence.
The robot's actions are not scripted or pre-programmed, but rather determined by neural networks, allowing for independent and adaptive behavior.
The robot can describe its visual experience, plan future actions, reflect on its memory, and explain its reasoning verbally.
Figure One's capabilities are the result of a partnership with OpenAI, integrating their models to provide advanced visual and language processing.
The robot's behaviors are learned, not teleoperated, meaning it operates at normal speed without human mimicry or VR guidance.
Figure One learns from both web data and robotics data, watching videos and reading information to understand and apply knowledge.
The robot's system involves a large multimodal model trained by OpenAI that understands both images and text.
Figure One can process the entire history of a conversation, including past images, to come up with language responses.
The robot can decide which learned closed-loop behavior to run to fulfill a given command, loading neural network weights onto the GPU and executing a policy.
Figure One can multitask, explaining its actions in real-time as it performs them, such as picking up trash while responding to a question.
The robot's system involves a loop of observe, decide, act, allowing it to balance while performing tasks like picking up objects.
Figure One can reflect on memory with a pre-trained model analog in the conversation's image and text history, showing short-term memory capabilities.
The robot's low-level manual manipulation is driven by neural network vis motor transformer policies, mapping pixels directly to actions.
Figure One's whole body controller ensures safe, stable dynamics, allowing it to maintain balance during tasks like grasping and lifting.
The partnership between Figure AI and OpenAI is expected to significantly advance the field of embodied AI, with Figure AI being a prime location for scaling up this technology.
The demonstration of Figure One's capabilities has shifted perceptions of advanced robotics, with even critics reconsidering their past skepticism.
Transcripts
so figure AI is the robotics company
that partnered up with open AI to
combine their expertise in building
robotics with openi well Ai and this is
the first time we can behold the
amalgamation of metal and neural Nets
now I'm exaggerating but this is kind of
a big deal take a
look
hey figure one what do you see right
now I see a red apple on a plate in the
center of the table a drying rack with
cups and a plate and you standing nearby
with your hand on the table great can I
have something to
eat sure
thing great
can you explain why you did what you
just did while you pick up this
trash on it so I gave you the Apple
because it's the only uh edible item I
could provide you with from the
table great so based on the scene right
now where do you think the dishes in
front of you go
next
the dishes on the table like that plate
and cup are likely to go into the drying
rack next great can you put them
[Music]
there of
[Music]
course
[Music]
all right so how do you think you
did I I think I did pretty well the
apple found its new owner the trash is
gone and the table wear is right where
it
belongs I agree thanks figure
one you're welcome if there's anything
else you need help with just let me
know
[Applause]
[Music]
so they're saying with opening eye
figure one can now have full
conversations with with people open ey
models provide high level Visual and
language intelligence figure neural
networks deliver fast low-level dextrous
robot actions everything in this video
is a neural network this I think is as
clear as it gets previous videos I think
I maybe questioned like exactly what was
happening like I wish they kind of
spelled out I mean this is pretty pretty
clear in what they're saying here
everything here is a neural network mean
the robot isn't scripted it's not
pre-programmed for certain motions it's
not pre-programmed to answer in a
certain way that's you know Chad GPT
that's GPT 4 or higher maybe that's
answering the questions and then the
robot is moving based on the neural
networks that figure AI has it sounds
like already independently developed
just to make sure that the people that
are like asking questions kind of like
know so this is film speed 1.0 right
endtoend neural networks speech to
speech reasoning and this is the figure
01 the robot Plus open Ai and look I
think it's fair that people are
skeptical we had somebody you know roll
a truck off a hill and then be like this
thing really functions and you know it
did not but at this point it's pretty
obvious I feel like to most people that
kind of been following these
developments that okay they they really
have the real deal you know for example
if Google deep Minds rt2 robot you know
they have they kind of explain what's
running it they have this Vision
language action model so almost kind of
like you can think of it of as like a
large language model with vision with
this action model kind of on top of it
that translates with the vision language
models kind of reasoning and seeing
translates that into movements of the
hand and the grippers ETC so text goes
in here and again this doesn't have
anything to do with figure this is
Google deep mine but maybe figure has
something similar right so you have the
large language model here and all the
speech to speech right so here's what
the should do and the language model
like reasons through it you also have
like the um Vision input right that kind
of combines and the output is in terms
of the movements of the robots so kind
of it's movements in the physical space
and that's why you're able to say
somewhat weird things like you know move
banana right as you can see in this
image there's a couple flags on this
table and there's a banana and you get
the command move banana to Germany right
and it figures out you know kind of what
you mean so it puts it on top of the
flag on top of the correct flag right it
doesn't pick up the banana and head to
Frankfurt so it would be very curious to
know if they're doing something like
this now of course it's a small startup
they're not going to post all their
Insider secrets on the internet maybe
sometime in the future we'll find out
but it is getting pretty impressive like
when you tell it to pick up the trash
while responding to a question you know
you know you tell me to do that like
that's something that I might struggle
with myself multitasking hard so it
looks like this Google Deep Mind rt2
paper was published in uh July 2023 and
here's Cory Lynch so Cory Lynch works at
figure AI so he's describing how this
robot is able to do what it's doing so
Corey his background was robotics at
Google and looks like he joined figure
AI uh around July 2023 crazy so he's
saying we are now having full
conversations with figure 01 oh is this
him I wonder so that's Cor
that that might be him interacting with
the robot so he's saying we are not
having full conversations with figure
one thanks to our partnership with open
AI our robot can describe its visual
experience plan future actions reflect
on its memory explain its reasoning
verbally technical Deep dive thread so
he's saying let's break down what we see
in the video so all the behaviors are
learned not teleoperated so this is kind
of a sticking point with some of these
demonstrations we've seen some
incredible things done by robots for
fair like low price points but they're
done with tele operation so for example
you're seeing here he's a robot cooking
a three course uh Cantonese meal right
grilling up some delicious steaks or
chicken eggs vegetables cracking some
eggs like this is absolutely incredible
so this is Telly operated and what that
means is for example here's an example
of it so as you can see here there's a
person standing behind it and they're
kind of mimicking those gestures up in
the air and the robot kind of like
repeats them another way of doing it is
with virtual reality where you kind of
demonstrate those movements and the
robot kind of repeats it and the robots
can learn from this and can generalize
but that's quite different from
something like this this rt2 from Google
deep mins you can see here it's kind of
showing you is like sucking up all the
information right it's watching all
these videos and all the previous robots
and the Wikipedia and YouTube and
everything everything everything so as
they say here it learns from both web
and Robotics data so it just looks at
things and reads things and it knows
things right it it drinks and it knows
things so they're kind of specifying
here okay so it's not teleoperated and
it's running at normal speed so the
camera footage is not sped up and again
we all kind of appreciate that are
skeptical because they've been burnt
before so I I think it just helps to
have little disclaimers like this
somewhere because I mean for the people
that follow this stuff you know I think
to most people probably this is what
they think of like Peak Advanced
robotics right the dancing robots like
to somebody that's not really following
looks at this and goes okay this this is
probably the most advanced thing there
ever is look at the Motions but the
people that are following this stuff
they're like yeah cool but this thing
learning from watching videos how to
operate in the real world is much cooler
right it being able to generalize like
if you say I need to hammer a nail what
objects from this scene might be useful
it's like let's grab a rock and then
prints out the action of how to use that
rock to hammer whatever it needs to
hammer kind of using that Chain of
Thought reasoning which is something
that we know from interacting with large
language models like chpt like you you
know that chpt can probably figure out
the rock can do something like this
right and then it links to the execution
the code that manipulates that rock in
the physical world right so this idea of
vision language model you know it may
not be pretty but this is what Peak
Performance looks like and it seems like
figure is there it's you know there's
definitely competition for these General
r robots so I think people begin to kind
of refer to them off hand as like AGI
robots as in to point out that these are
General robots which you know as cool as
these are they probably wouldn't say
this about the Boston Dynamic robots as
impressive as the the movements are the
movements are prescripted they're
following a routine so he continues we
feed images from the robots cameras and
transcribed text from speech captured by
the onboard microphones to a large
multimodal model trained by open AI that
understands both images and texts so
it's interesting that they don't spell
which model so they don't say GPT 4 you
know with vision they're like a model
maybe Jimmy apples will clue Us in so I
can't find the Tweet but somewhere a
while back he was saying that's weird
somewhere he was saying that um there's
a robot heading to your favorite AI lab
and I feel like maybe it was uh related
to this and so we feed images from the
robot's cameras and transcribe text from
the speech captured by onboard
microphone so this is a little detail
that's interesting so open ey has their
open source whisper which is a You Know
audio to text transcription I believe
it's open source as part of their kind
of Suite of products so in a previous
video I kind of try to paint AGI as a
series of parts that you're trying to
combine together and this is kind of a
good representation of it right cuz
think about what we have here we have
figure this robotics company likely
running with opening eyes whisper a
large language model it's called a GPT 4
could be something else something
specific right something of vision that
can understand images and text and all
that is hooked into the robot so all the
pieces are working together to create
this robot with a general ability which
is kind of what we think of AGI as we
don't have a specific definition but you
know obviously this is moving in that
direction if it can like walk around and
learn and reason like if it can hear
commands and then respond to them like
that's like Are We There Are we almost
there at the very least close the model
processed the entire history of the
conversation including past images to
come up with language responses which
are spoken back to the human via text to
speech the same model is responsible for
deciding which learned closed loop
Behavior to run on the robot to fulfill
a given command loading particular
neural network weights uh onto the GPU
and executing a policy and so this last
part that to me reads as different from
what Google Deep Mind has with their
little like they kind of toonize those
movements so here's from Google Deep
Mind to control a robot it must be
trained to Output actions we address
this challenge by representing actions
as tokens in the model's output similar
to language tokens and describe actions
as strings that can be processed by
standard natural language tokenizers
shown here so if somebody super smart in
the comments can maybe unpack this a
little bit just put the word GPU
somewhere in the comment and I'll kind
of search by that but you know what does
this tell us so this is something
different from the rt2 and I'll do my
best to post an update if we get more
information but to me I'm reading this
as there's a maybe like a finite
pre-trained number of actions that it
has like pick up or or whatever I
don't know you know pour some liquid
into this thing or push a button and
then the GPT model the open AI model
kind of selects which specific thing to
run and then runs it I mean I don't know
if that's true I I might be misreading
it but we'll we'll know more soon
hopefully so speech to text this is
where the person says can I have
something to eat right that feeds into
the open AI model Common Sense reasoning
from images it responds sure thing then
it goes into Behavior selection so it
has sort of these preexisting seems like
neural network policies for fast
dextrous manipulation and so those lead
into the whole body controller of safe
stable Dynamics so that's whole body
movement from legs and arms and Etc so
balancing while it's you know picking
stuff up Etc and then at the same time
the vision the robot as it's seeing
things right so it's seeing what it's
doing that kind of feeds back into you
know the GPT open AI model and the
neural network policies which is kind of
I guess an UDA Loop Orient observe
decide act all right so here's the
command here's like the acknowledgement
sure and then it goes into this Loop to
complete the task but interestingly it
sounds like it it can multitask so it
can like tell you what it's doing as
it's doing it and so the continue
connecting figure1 to a large
pre-trained multimodal model gives us
some interesting new capabilities so
again that's the big Point here that we
have a company good at building metal
robots figure one we have a company
that's excellent at building AI neural
Nets open AI so one plus the other now
it's able to describe its surroundings
use common sense reasoning when making
decisions for example the ishes on the
table like that plate and cup are likely
to go into the drawing rack next
translate ambiguous highle request like
I'm hungry to some context appropriate
behavior like hand the person and Apple
so that was the big test with rt2 SL RTX
whatever you want to call you know the
RT series of models you could give it a
general request I think one of the
examples was they said I'm tired and
it's like okay you know we have a Red
Bull would you like a Red Bull which you
know shows that we make them in our
image don't we and then describe why it
executed a particular action in plain
English for example it was the only
edible item I could provide you with
from the table and a large pre-trained
model that understands conversation
history gives figure one a powerful
short-term memory consider the question
can you put them there where does them
refer to and where is there answering
correctly requires the ability to uh
reflect on memory with a pre-trained
model analog in the conversation's image
and text history figure one quickly
forms and carries out the plan place the
cup on the drawing rack place the plate
on the drawing rack so one and two steps
one and two so I'm rewinding this to see
so when GPT 4 Vision just came out there
was a report testing a lot of its
abilities and so one thing that they
found is it's really good at figuring
out what we're pointing at so if you're
watching on video like if I do this like
I'm pointing to something like you know
which word I'm pointing to right and so
the vision models also pick up on that
very well like if I have an arrow
pointing to something like you know what
I'm pointing to so do they like the the
models they understand it very well
instead of for example needing to show
it coordinates or having to Circle
something like or or anything like that
basically just an arrow or something
pointing at something works very very
well so I was curious if he had to if he
was pointing at all or it was just sort
of an in context so I'm not seeing any
sort of visual Clues other than the
verbal command but I'm going to make my
prediction now we as a society as humans
we will be pointing a lot more I'm just
guessing but like if you point at
something and you stay in action that
might be like the easiest way to give
the most amount of information about
what needs to be done so next they
continue finally let's talk about the
Learned low-level B manual manipulation
so all behaviors are driven by neural
network Vis motor Transformer policies
mapping pixels directly to actions these
networks taken onboard images at 10
Hertz and generate 24 doof actions so in
robotics this is directions of Freedom
so for example wrist poses and finger
joint angles at 200 htz these actions
serve as high rate set points for the
even higher rate whole body controller
to track this is a useful separation of
concerns internet pre-trained models do
common sense reasoning over images and
text to come up with a highlevel plan
learned VIs motor policies execute the
plan performing fast reactive behaviors
that are hard to specify manually like
manipulating a deformable bag in any
position so like if you're picking up a
bag of potato chips right it's not hard
it's not solid so it will deform as you
grab it right you're not going to be
able to grab it if you don't provide
enough pressure If you provide too much
pressure you'll pop it so those look
like trash bags of potato chips so it's
grabbing it and even though it's kind of
crumbling out of the weight the robot's
still able to grasp it and throw it in
the trash they continue meanwhile a
whole body controller ensures safe
stable Dynamics for example maintaining
balance even just a few years ago I
would have thought having a full
conversation with a humanoid robot while
it plans and carries out its own fully
learned behaviors would be something we
would have to wait decades to see
obviously a lot has changed in my mind
figure is the best place in the world
right now to scale up embodied Ai and so
they are recruiting recruiting hard so
you can go to figure. A/C careers yeah I
mean it's very exciting even more so
with the new partnership with openai
which they announced so looks like the
first time I think we've heard of it was
this was February 29th so they're saying
openai plus humanoid robots and you know
at the time they posted this you know
where the robot makes a cup of coffee
and you know I think it was well
received although there were some
criticism there was a lot about this
demonstration that is fairly simp like
the coffee maker that they've used is
often the specific model is used for
robotic demonstrations because it's made
to be very simple to use you don't need
multiple digits you can have just one
claw like appendage you basically need
one finger to operate it you know
assuming that the Cure cup is in there
basically you just need to push the
handle down push a button it's done and
I think I even at the time said oh maybe
you know just I wasn't sure how
impressive it was just kind of that
demonstration alone kind of out of
context but I mean now it's definitely
shaping up and getting a lot more
exciting also I find myself regretting
more and more every time that I've said
anything negative about robots in the
past now as it's getting better and it's
talking you know you start reconsidering
some of your past words like maybe I
should not be talking crap about this
thing so with that said I think figurei
has excellent robots amazing robots I
love the robots maybe now Boston
Dynamics will uh be motivated enough to
learn a new Tik Tock dance crap I got to
stop doing that Boston Dynamics is great
Boston Dynamics makes good robots
anyways my name is West rth thank you
for watching
Ver Más Videos Relacionados
FIGURE 01 AI Robot Update w/ OpenAI + Microsoft Shocks Tech World (THEMIS HUMANOID DEMO)
The Race For AI Robots Just Got Real (OpenAI, NVIDIA and more)
OpenAI's Newest AI Humanoid Robot - Figure 02 - Just Stunned the Robotics World!
Shocking AI Robots That Could Replace Us Unveiled at the 2024 World Robot Conference
OpenAI Releases Smartest AI Ever & How-To Use It
5 MINUTES AGO: OpenAI Just Released GPT-o1 the Most Powerful AI Model Yet
5.0 / 5 (0 votes)