OpenAI's 'AGI Robot' Develops SHOCKING NEW ABILITIES | Sam Altman Gives Figure 01 Get a Brain
Summary
TLDRフィギュアAIとOpenAIが協力して開発したロボット「フィギュア01」が注目されています。このロボットは、視覚と言語を理解する高水準のニューラルネットワークを搭載し、人とフルコンバーサーションを行うことができます。ロボットは、テーブル上の食べ物を認識し、ゴミを片付けたり、食器を乾燥架に移動させたりする動揺をとることができます。全ての動作はニューラルネットワークによって学習され、人間の指示に従って自然に動くことができます。
Takeaways
- 🤖 Figure AIとOpenAIが協力し、ロボット工学とニューラルネットワークの専門知識を組み合わせた。
- 🔧 ロボットは金属とニューラルネットワークの統合体であり、視覚と言語の高度な機能を備えている。
- 🍎 Figure Oneは、与えられたシーンに基づいて、食べ物を選定し、清理を行うことができる。
- 🗑️ ロボットは、テーブル上の汚れ物を適切な場所に移動させることができ、 multitaskingも可能だ。
- 🌐 すべての動作はニューラルネットワークによって学習され、事前にプログラミングされていない。
- 📚 Figure AIは独自に開発された技術を用いていて、詳細は公開されていない可能性がある。
- 🤔 ロボットは、Google DeepMindのRT2ロボットと比較されるが、学習方法が異なる可能性がある。
- 💬 Figure Oneは、会話の履歴を処理し、短期記憶を持つことで、一般的な要求に応答することができる。
- 🔄 ロボットは、視覚情報と言語モデルを組み合わせ、高いレベルの計画を立て、実行する。
- 📈 Figure AIは、embodied AIを拡大するために最適な場所であり、現在積極的に人材を募集している。
Q & A
Figure AIとはどのような企業ですか?
-Figure AIはロボティクスに特化した企業であり、OpenAIと協力してAIとロボット技術を組み合わせています。
Figure AIとOpenAIのパートナーシップの目的は何ですか?
-このパートナーシップの目的は、OpenAIの高度な視覚と言語知能とFigure AIの迅速で緻密なロボット動作を組み合わせることです。
Figure AIのロボットはどのように人間と対話できますか?
-Figure AIのロボットは視覚的な情報を理解し、音声を通じて会話を行い、質問に応答することができます。
Figure AIのロボットはどのようにしてタスクを実行しますか?
-ロボットは環境を視覚的に認識し、その情報を基にして行動を決定し、物体を操作することでタスクを実行します。
Google DeepMindのRT2ロボットとFigure AIのロボットの違いは何ですか?
-Google DeepMindのRT2ロボットは独自の「ビジョン言語アクションモデル」を使用し、Figure AIのロボットはOpenAIの技術と統合されています。
ロボットが動作を学習するプロセスについてどのような情報がありますか?
-ロボットはウェブやロボティクスのデータから学習し、視覚的情報と動作を組み合わせて状況に応じた行動を実行します。
Figure AIのロボットはどのようにしてマルチタスクを処理しますか?
-ロボットは複数のタスクを同時に処理する能力を持ち、会話をしながら物理的なタスクを実行できます。
Figure AIとOpenAIの技術統合の特徴は何ですか?
-技術統合により、ロボットは視覚的情報と言語を理解し、複雑なタスクを独立して実行できるようになります。
ロボットによるテレオペレーションとはどういう意味ですか?
-テレオペレーションは、人間が遠隔でロボットを操作し、その動作を制御するプロセスを指します。
Figure AIのロボットの将来性についての見解は?
-Figure AIのロボットは、AIとロボティクスの進歩を代表し、将来的にはさらに多様なタスクを実行可能になると考えられます。
Outlines
🤖 フィギュアAIとOpenAIの提携
この段落では、フィギュアAIがOpenAIと提携し、ロボット技術とニューラルネットワークを組み合わせた初の成果について説明しています。フィギュアOneというロボットが、テーブル上の物体を認識し、人との会話をしながらゴミを拾ったり、食器を乾燥架に置きながら、その行動を説明することが示されています。この様子は、メタルとニューラルネットワークが結合した結果とされています。
🧠 オープンAIのビジュアルと言語の知能
この段落では、OpenAIのモデルが高水準の視覚と言語の知能を提供し、フィギュアのニューラルネットワークが迅速で正確なロボットの動作を実現していることについて説明されています。ビデオの全てがニューラルネットワークであるため、ロボットの動きや質問への答えは、スクリプト化されたわけではなく、自らの学習結果に基づいて行われています。また、Google DeepMindのrt2ロボットと比較し、フィギュアAIの進化を示唆しています。
🗣️ 会話と行動の統合
この段落では、フィギュア01が人との会話中に視覚的経験を説明し、未来の行動を計画し、記憶を反映して理由を説明する能力について述べています。技術的な深いダイブとして、全ての行動が自ら学習されたものであり、遠隔操作されていないことが強調されています。また、Google DeepMindのrt2との違いを指摘し、フィギュアAIの新しいアプローチを示しています。
🔄 ビジュアル・ランゲージ・モデルの統合
この段落では、ビジュアル・ランゲージ・モデルが画像とテキストを理解し、高水準の計画を立て、学習されたビジュアルモーターポリシーがその計画を実行する様子について説明されています。また、Google DeepMindのアプローチとの違いを詳細に説明し、フィギュアAIがどのように動作しているかを具体的に示しています。
👋 感謝と今後の展望
最後の段落では、話者がWest rthであり、視聴者に感謝の意を示しています。また、フィギュアAIとOpenAIの提携が発表されて以来、両社の協力が進展している様子が触れられ、身長を縮めている波士顿ダイナミクスに言及しています。
Mindmap
Keywords
💡Figure AI
💡Open AI
💡Neural Networks
💡Robotics
💡Conversations
💡Multitasking
💡Dexterity
💡Vision Language Action model
💡Teleoperation
💡Embodied AI
💡AGI (Artificial General Intelligence)
Highlights
Figure AI, a robotics company, has partnered with OpenAI to combine expertise in robotics with AI.
The collaboration aims to showcase the amalgamation of metal and neural networks, marking a significant technological advancement.
Figure one, a robot, is capable of having full conversations with people, thanks to the integration of OpenAI models.
The robot's actions are not scripted or pre-programmed, but rather determined by neural networks, indicating a high level of autonomy.
Figure AI's robot demonstrates the ability to pick up trash, respond to questions, and perform multitasking.
The robot's neural networks enable fast, low-level, dextrous actions without the need for explicit programming.
The robot can describe its visual experience, plan future actions, reflect on its memory, and explain its reasoning verbally.
All behaviors exhibited in the demonstration video are learned, not teleoperated, and occur at normal speed.
The robot learns from both web data and robotics data, incorporating a wide range of information into its operations.
Figure AI's robot can understand and execute complex commands, such as moving a banana to a specific location.
The robot's capabilities are a result of the partnership between Figure AI's robotics expertise and OpenAI's AI models.
The robot's system involves a large multimodal model trained by OpenAI that understands both images and text.
The robot processes the entire history of the conversation, including past images, to come up with language responses.
Figure AI's robot can reflect on memory and conversation history to inform its actions and decisions.
The robot's neural network policies enable it to perform fast, dextrous manipulation and stable dynamics for whole body movement.
The robot's vision system is adept at understanding pointing gestures and other visual cues from humans.
The robot's behaviors are driven by neural network Vis motor Transformer policies, mapping pixels directly to actions.
Figure AI is a leading place for scaling up embodied AI, and they are actively recruiting talent to further advance the technology.
Transcripts
so figure AI is the robotics company
that partnered up with open AI to
combine their expertise in building
robotics with openi well Ai and this is
the first time we can behold the
amalgamation of metal and neural Nets
now I'm exaggerating but this is kind of
a big deal take a
look
hey figure one what do you see right
now I see a red apple on a plate in the
center of the table a drying rack with
cups and a plate and you standing nearby
with your hand on the table great can I
have something to
eat sure
thing great
can you explain why you did what you
just did while you pick up this
trash on it so I gave you the Apple
because it's the only uh edible item I
could provide you with from the
table great so based on the scene right
now where do you think the dishes in
front of you go
next
the dishes on the table like that plate
and cup are likely to go into the drying
rack next great can you put them
[Music]
there of
[Music]
course
[Music]
all right so how do you think you
did I I think I did pretty well the
apple found its new owner the trash is
gone and the table wear is right where
it
belongs I agree thanks figure
one you're welcome if there's anything
else you need help with just let me
know
[Applause]
[Music]
so they're saying with opening eye
figure one can now have full
conversations with with people open ey
models provide high level Visual and
language intelligence figure neural
networks deliver fast low-level dextrous
robot actions everything in this video
is a neural network this I think is as
clear as it gets previous videos I think
I maybe questioned like exactly what was
happening like I wish they kind of
spelled out I mean this is pretty pretty
clear in what they're saying here
everything here is a neural network mean
the robot isn't scripted it's not
pre-programmed for certain motions it's
not pre-programmed to answer in a
certain way that's you know Chad GPT
that's GPT 4 or higher maybe that's
answering the questions and then the
robot is moving based on the neural
networks that figure AI has it sounds
like already independently developed
just to make sure that the people that
are like asking questions kind of like
know so this is film speed 1.0 right
endtoend neural networks speech to
speech reasoning and this is the figure
01 the robot Plus open Ai and look I
think it's fair that people are
skeptical we had somebody you know roll
a truck off a hill and then be like this
thing really functions and you know it
did not but at this point it's pretty
obvious I feel like to most people that
kind of been following these
developments that okay they they really
have the real deal you know for example
if Google deep Minds rt2 robot you know
they have they kind of explain what's
running it they have this Vision
language action model so almost kind of
like you can think of it of as like a
large language model with vision with
this action model kind of on top of it
that translates with the vision language
models kind of reasoning and seeing
translates that into movements of the
hand and the grippers ETC so text goes
in here and again this doesn't have
anything to do with figure this is
Google deep mine but maybe figure has
something similar right so you have the
large language model here and all the
speech to speech right so here's what
the should do and the language model
like reasons through it you also have
like the um Vision input right that kind
of combines and the output is in terms
of the movements of the robots so kind
of it's movements in the physical space
and that's why you're able to say
somewhat weird things like you know move
banana right as you can see in this
image there's a couple flags on this
table and there's a banana and you get
the command move banana to Germany right
and it figures out you know kind of what
you mean so it puts it on top of the
flag on top of the correct flag right it
doesn't pick up the banana and head to
Frankfurt so it would be very curious to
know if they're doing something like
this now of course it's a small startup
they're not going to post all their
Insider secrets on the internet maybe
sometime in the future we'll find out
but it is getting pretty impressive like
when you tell it to pick up the trash
while responding to a question you know
you know you tell me to do that like
that's something that I might struggle
with myself multitasking hard so it
looks like this Google Deep Mind rt2
paper was published in uh July 2023 and
here's Cory Lynch so Cory Lynch works at
figure AI so he's describing how this
robot is able to do what it's doing so
Corey his background was robotics at
Google and looks like he joined figure
AI uh around July 2023 crazy so he's
saying we are now having full
conversations with figure 01 oh is this
him I wonder so that's Cor
that that might be him interacting with
the robot so he's saying we are not
having full conversations with figure
one thanks to our partnership with open
AI our robot can describe its visual
experience plan future actions reflect
on its memory explain its reasoning
verbally technical Deep dive thread so
he's saying let's break down what we see
in the video so all the behaviors are
learned not teleoperated so this is kind
of a sticking point with some of these
demonstrations we've seen some
incredible things done by robots for
fair like low price points but they're
done with tele operation so for example
you're seeing here he's a robot cooking
a three course uh Cantonese meal right
grilling up some delicious steaks or
chicken eggs vegetables cracking some
eggs like this is absolutely incredible
so this is Telly operated and what that
means is for example here's an example
of it so as you can see here there's a
person standing behind it and they're
kind of mimicking those gestures up in
the air and the robot kind of like
repeats them another way of doing it is
with virtual reality where you kind of
demonstrate those movements and the
robot kind of repeats it and the robots
can learn from this and can generalize
but that's quite different from
something like this this rt2 from Google
deep mins you can see here it's kind of
showing you is like sucking up all the
information right it's watching all
these videos and all the previous robots
and the Wikipedia and YouTube and
everything everything everything so as
they say here it learns from both web
and Robotics data so it just looks at
things and reads things and it knows
things right it it drinks and it knows
things so they're kind of specifying
here okay so it's not teleoperated and
it's running at normal speed so the
camera footage is not sped up and again
we all kind of appreciate that are
skeptical because they've been burnt
before so I I think it just helps to
have little disclaimers like this
somewhere because I mean for the people
that follow this stuff you know I think
to most people probably this is what
they think of like Peak Advanced
robotics right the dancing robots like
to somebody that's not really following
looks at this and goes okay this this is
probably the most advanced thing there
ever is look at the Motions but the
people that are following this stuff
they're like yeah cool but this thing
learning from watching videos how to
operate in the real world is much cooler
right it being able to generalize like
if you say I need to hammer a nail what
objects from this scene might be useful
it's like let's grab a rock and then
prints out the action of how to use that
rock to hammer whatever it needs to
hammer kind of using that Chain of
Thought reasoning which is something
that we know from interacting with large
language models like chpt like you you
know that chpt can probably figure out
the rock can do something like this
right and then it links to the execution
the code that manipulates that rock in
the physical world right so this idea of
vision language model you know it may
not be pretty but this is what Peak
Performance looks like and it seems like
figure is there it's you know there's
definitely competition for these General
r robots so I think people begin to kind
of refer to them off hand as like AGI
robots as in to point out that these are
General robots which you know as cool as
these are they probably wouldn't say
this about the Boston Dynamic robots as
impressive as the the movements are the
movements are prescripted they're
following a routine so he continues we
feed images from the robots cameras and
transcribed text from speech captured by
the onboard microphones to a large
multimodal model trained by open AI that
understands both images and texts so
it's interesting that they don't spell
which model so they don't say GPT 4 you
know with vision they're like a model
maybe Jimmy apples will clue Us in so I
can't find the Tweet but somewhere a
while back he was saying that's weird
somewhere he was saying that um there's
a robot heading to your favorite AI lab
and I feel like maybe it was uh related
to this and so we feed images from the
robot's cameras and transcribe text from
the speech captured by onboard
microphone so this is a little detail
that's interesting so open ey has their
open source whisper which is a You Know
audio to text transcription I believe
it's open source as part of their kind
of Suite of products so in a previous
video I kind of try to paint AGI as a
series of parts that you're trying to
combine together and this is kind of a
good representation of it right cuz
think about what we have here we have
figure this robotics company likely
running with opening eyes whisper a
large language model it's called a GPT 4
could be something else something
specific right something of vision that
can understand images and text and all
that is hooked into the robot so all the
pieces are working together to create
this robot with a general ability which
is kind of what we think of AGI as we
don't have a specific definition but you
know obviously this is moving in that
direction if it can like walk around and
learn and reason like if it can hear
commands and then respond to them like
that's like Are We There Are we almost
there at the very least close the model
processed the entire history of the
conversation including past images to
come up with language responses which
are spoken back to the human via text to
speech the same model is responsible for
deciding which learned closed loop
Behavior to run on the robot to fulfill
a given command loading particular
neural network weights uh onto the GPU
and executing a policy and so this last
part that to me reads as different from
what Google Deep Mind has with their
little like they kind of toonize those
movements so here's from Google Deep
Mind to control a robot it must be
trained to Output actions we address
this challenge by representing actions
as tokens in the model's output similar
to language tokens and describe actions
as strings that can be processed by
standard natural language tokenizers
shown here so if somebody super smart in
the comments can maybe unpack this a
little bit just put the word GPU
somewhere in the comment and I'll kind
of search by that but you know what does
this tell us so this is something
different from the rt2 and I'll do my
best to post an update if we get more
information but to me I'm reading this
as there's a maybe like a finite
pre-trained number of actions that it
has like pick up or or whatever I
don't know you know pour some liquid
into this thing or push a button and
then the GPT model the open AI model
kind of selects which specific thing to
run and then runs it I mean I don't know
if that's true I I might be misreading
it but we'll we'll know more soon
hopefully so speech to text this is
where the person says can I have
something to eat right that feeds into
the open AI model Common Sense reasoning
from images it responds sure thing then
it goes into Behavior selection so it
has sort of these preexisting seems like
neural network policies for fast
dextrous manipulation and so those lead
into the whole body controller of safe
stable Dynamics so that's whole body
movement from legs and arms and Etc so
balancing while it's you know picking
stuff up Etc and then at the same time
the vision the robot as it's seeing
things right so it's seeing what it's
doing that kind of feeds back into you
know the GPT open AI model and the
neural network policies which is kind of
I guess an UDA Loop Orient observe
decide act all right so here's the
command here's like the acknowledgement
sure and then it goes into this Loop to
complete the task but interestingly it
sounds like it it can multitask so it
can like tell you what it's doing as
it's doing it and so the continue
connecting figure1 to a large
pre-trained multimodal model gives us
some interesting new capabilities so
again that's the big Point here that we
have a company good at building metal
robots figure one we have a company
that's excellent at building AI neural
Nets open AI so one plus the other now
it's able to describe its surroundings
use common sense reasoning when making
decisions for example the ishes on the
table like that plate and cup are likely
to go into the drawing rack next
translate ambiguous highle request like
I'm hungry to some context appropriate
behavior like hand the person and Apple
so that was the big test with rt2 SL RTX
whatever you want to call you know the
RT series of models you could give it a
general request I think one of the
examples was they said I'm tired and
it's like okay you know we have a Red
Bull would you like a Red Bull which you
know shows that we make them in our
image don't we and then describe why it
executed a particular action in plain
English for example it was the only
edible item I could provide you with
from the table and a large pre-trained
model that understands conversation
history gives figure one a powerful
short-term memory consider the question
can you put them there where does them
refer to and where is there answering
correctly requires the ability to uh
reflect on memory with a pre-trained
model analog in the conversation's image
and text history figure one quickly
forms and carries out the plan place the
cup on the drawing rack place the plate
on the drawing rack so one and two steps
one and two so I'm rewinding this to see
so when GPT 4 Vision just came out there
was a report testing a lot of its
abilities and so one thing that they
found is it's really good at figuring
out what we're pointing at so if you're
watching on video like if I do this like
I'm pointing to something like you know
which word I'm pointing to right and so
the vision models also pick up on that
very well like if I have an arrow
pointing to something like you know what
I'm pointing to so do they like the the
models they understand it very well
instead of for example needing to show
it coordinates or having to Circle
something like or or anything like that
basically just an arrow or something
pointing at something works very very
well so I was curious if he had to if he
was pointing at all or it was just sort
of an in context so I'm not seeing any
sort of visual Clues other than the
verbal command but I'm going to make my
prediction now we as a society as humans
we will be pointing a lot more I'm just
guessing but like if you point at
something and you stay in action that
might be like the easiest way to give
the most amount of information about
what needs to be done so next they
continue finally let's talk about the
Learned low-level B manual manipulation
so all behaviors are driven by neural
network Vis motor Transformer policies
mapping pixels directly to actions these
networks taken onboard images at 10
Hertz and generate 24 doof actions so in
robotics this is directions of Freedom
so for example wrist poses and finger
joint angles at 200 htz these actions
serve as high rate set points for the
even higher rate whole body controller
to track this is a useful separation of
concerns internet pre-trained models do
common sense reasoning over images and
text to come up with a highlevel plan
learned VIs motor policies execute the
plan performing fast reactive behaviors
that are hard to specify manually like
manipulating a deformable bag in any
position so like if you're picking up a
bag of potato chips right it's not hard
it's not solid so it will deform as you
grab it right you're not going to be
able to grab it if you don't provide
enough pressure If you provide too much
pressure you'll pop it so those look
like trash bags of potato chips so it's
grabbing it and even though it's kind of
crumbling out of the weight the robot's
still able to grasp it and throw it in
the trash they continue meanwhile a
whole body controller ensures safe
stable Dynamics for example maintaining
balance even just a few years ago I
would have thought having a full
conversation with a humanoid robot while
it plans and carries out its own fully
learned behaviors would be something we
would have to wait decades to see
obviously a lot has changed in my mind
figure is the best place in the world
right now to scale up embodied Ai and so
they are recruiting recruiting hard so
you can go to figure. A/C careers yeah I
mean it's very exciting even more so
with the new partnership with openai
which they announced so looks like the
first time I think we've heard of it was
this was February 29th so they're saying
openai plus humanoid robots and you know
at the time they posted this you know
where the robot makes a cup of coffee
and you know I think it was well
received although there were some
criticism there was a lot about this
demonstration that is fairly simp like
the coffee maker that they've used is
often the specific model is used for
robotic demonstrations because it's made
to be very simple to use you don't need
multiple digits you can have just one
claw like appendage you basically need
one finger to operate it you know
assuming that the Cure cup is in there
basically you just need to push the
handle down push a button it's done and
I think I even at the time said oh maybe
you know just I wasn't sure how
impressive it was just kind of that
demonstration alone kind of out of
context but I mean now it's definitely
shaping up and getting a lot more
exciting also I find myself regretting
more and more every time that I've said
anything negative about robots in the
past now as it's getting better and it's
talking you know you start reconsidering
some of your past words like maybe I
should not be talking crap about this
thing so with that said I think figurei
has excellent robots amazing robots I
love the robots maybe now Boston
Dynamics will uh be motivated enough to
learn a new Tik Tock dance crap I got to
stop doing that Boston Dynamics is great
Boston Dynamics makes good robots
anyways my name is West rth thank you
for watching
関連動画をさらに表示
5.0 / 5 (0 votes)