OpenAI's 'AGI Robot' Develops SHOCKING NEW ABILITIES | Sam Altman Gives Figure 01 Get a Brain

AI Unleashed - The Coming Artificial Intelligence Revolution and Race to AGI
13 Mar 202420:09

Summary

TLDRフィギュアAIとOpenAIが協力して開発したロボット「フィギュア01」が注目されています。このロボットは、視覚と言語を理解する高水準のニューラルネットワークを搭載し、人とフルコンバーサーションを行うことができます。ロボットは、テーブル上の食べ物を認識し、ゴミを片付けたり、食器を乾燥架に移動させたりする動揺をとることができます。全ての動作はニューラルネットワークによって学習され、人間の指示に従って自然に動くことができます。

Takeaways

  • 🤖 Figure AIとOpenAIが協力し、ロボット工学とニューラルネットワークの専門知識を組み合わせた。
  • 🔧 ロボットは金属とニューラルネットワークの統合体であり、視覚と言語の高度な機能を備えている。
  • 🍎 Figure Oneは、与えられたシーンに基づいて、食べ物を選定し、清理を行うことができる。
  • 🗑️ ロボットは、テーブル上の汚れ物を適切な場所に移動させることができ、 multitaskingも可能だ。
  • 🌐 すべての動作はニューラルネットワークによって学習され、事前にプログラミングされていない。
  • 📚 Figure AIは独自に開発された技術を用いていて、詳細は公開されていない可能性がある。
  • 🤔 ロボットは、Google DeepMindのRT2ロボットと比較されるが、学習方法が異なる可能性がある。
  • 💬 Figure Oneは、会話の履歴を処理し、短期記憶を持つことで、一般的な要求に応答することができる。
  • 🔄 ロボットは、視覚情報と言語モデルを組み合わせ、高いレベルの計画を立て、実行する。
  • 📈 Figure AIは、embodied AIを拡大するために最適な場所であり、現在積極的に人材を募集している。

Q & A

  • Figure AIとはどのような企業ですか?

    -Figure AIはロボティクスに特化した企業であり、OpenAIと協力してAIとロボット技術を組み合わせています。

  • Figure AIとOpenAIのパートナーシップの目的は何ですか?

    -このパートナーシップの目的は、OpenAIの高度な視覚と言語知能とFigure AIの迅速で緻密なロボット動作を組み合わせることです。

  • Figure AIのロボットはどのように人間と対話できますか?

    -Figure AIのロボットは視覚的な情報を理解し、音声を通じて会話を行い、質問に応答することができます。

  • Figure AIのロボットはどのようにしてタスクを実行しますか?

    -ロボットは環境を視覚的に認識し、その情報を基にして行動を決定し、物体を操作することでタスクを実行します。

  • Google DeepMindのRT2ロボットとFigure AIのロボットの違いは何ですか?

    -Google DeepMindのRT2ロボットは独自の「ビジョン言語アクションモデル」を使用し、Figure AIのロボットはOpenAIの技術と統合されています。

  • ロボットが動作を学習するプロセスについてどのような情報がありますか?

    -ロボットはウェブやロボティクスのデータから学習し、視覚的情報と動作を組み合わせて状況に応じた行動を実行します。

  • Figure AIのロボットはどのようにしてマルチタスクを処理しますか?

    -ロボットは複数のタスクを同時に処理する能力を持ち、会話をしながら物理的なタスクを実行できます。

  • Figure AIとOpenAIの技術統合の特徴は何ですか?

    -技術統合により、ロボットは視覚的情報と言語を理解し、複雑なタスクを独立して実行できるようになります。

  • ロボットによるテレオペレーションとはどういう意味ですか?

    -テレオペレーションは、人間が遠隔でロボットを操作し、その動作を制御するプロセスを指します。

  • Figure AIのロボットの将来性についての見解は?

    -Figure AIのロボットは、AIとロボティクスの進歩を代表し、将来的にはさらに多様なタスクを実行可能になると考えられます。

Outlines

00:00

🤖 フィギュアAIとOpenAIの提携

この段落では、フィギュアAIがOpenAIと提携し、ロボット技術とニューラルネットワークを組み合わせた初の成果について説明しています。フィギュアOneというロボットが、テーブル上の物体を認識し、人との会話をしながらゴミを拾ったり、食器を乾燥架に置きながら、その行動を説明することが示されています。この様子は、メタルとニューラルネットワークが結合した結果とされています。

05:02

🧠 オープンAIのビジュアルと言語の知能

この段落では、OpenAIのモデルが高水準の視覚と言語の知能を提供し、フィギュアのニューラルネットワークが迅速で正確なロボットの動作を実現していることについて説明されています。ビデオの全てがニューラルネットワークであるため、ロボットの動きや質問への答えは、スクリプト化されたわけではなく、自らの学習結果に基づいて行われています。また、Google DeepMindのrt2ロボットと比較し、フィギュアAIの進化を示唆しています。

10:02

🗣️ 会話と行動の統合

この段落では、フィギュア01が人との会話中に視覚的経験を説明し、未来の行動を計画し、記憶を反映して理由を説明する能力について述べています。技術的な深いダイブとして、全ての行動が自ら学習されたものであり、遠隔操作されていないことが強調されています。また、Google DeepMindのrt2との違いを指摘し、フィギュアAIの新しいアプローチを示しています。

15:03

🔄 ビジュアル・ランゲージ・モデルの統合

この段落では、ビジュアル・ランゲージ・モデルが画像とテキストを理解し、高水準の計画を立て、学習されたビジュアルモーターポリシーがその計画を実行する様子について説明されています。また、Google DeepMindのアプローチとの違いを詳細に説明し、フィギュアAIがどのように動作しているかを具体的に示しています。

20:04

👋 感謝と今後の展望

最後の段落では、話者がWest rthであり、視聴者に感謝の意を示しています。また、フィギュアAIとOpenAIの提携が発表されて以来、両社の協力が進展している様子が触れられ、身長を縮めている波士顿ダイナミクスに言及しています。

Mindmap

Keywords

💡Figure AI

Figure AIはロボティクス会社で、Open AIと協力してロボット技術とAIの専門知識を組み合わせています。ビデオのテーマは、Figure AIがどのようにロボットとニューラルネットワークを統合し、高度な対話と行動を実現しているかを示すものです。

💡Open AI

Open AIは、高度なAI技術を研究開発する組織で、Figure AIと協力してロボットの開発に関与しています。ビデオでは、Open AIが提供するネURALネットワークがロボットの視覚と言語の理解を可能にし、自然な対話を実現する重要な役割を果たしていると説明されています。

💡Neural Networks

ニューラルネットワークは、ロボットの動作を制御するAI技術の一形態で、データの学習とパターンの認識を通じてロボットに複雑なタスクを実行させることが可能です。ビデオでは、ニューラルネットワークがロボットの迅速かつ正確な動作を実現する中心的な要素であると強調されています。

💡Robotics

ロボティクスは、機械工学とコンピュータサイエンスの分野で、ロボットの設計、製造、応用に関わる技術です。ビデオでは、Figure AIが得意とするロボット技術が、Open AIのAI技術と組み合わさって、高度なロボットを創造していることが示されています。

💡Conversations

対話とは、2つまたはそれ以上の存在が意思疎通を行うプロセスです。ビデオでは、Figure AIとOpen AIの技術が組み合わされたことで、ロボットが人間の言葉を理解し、自然な流れで応答できるような高度な対話を可能にしていることが強調されています。

💡Multitasking

マルチタスキングは、同時に複数のタスクをこなす能力です。ビデオでは、ロボットが質問に答えながらも、ゴミを拾ったり、食器を整理するなどのタスクをこなすことを示しています。これにより、ロボットがリアルな世界での多様な状況に対応できる柔軟性があることが伝わります。

💡Dexterity

デキステリーは、手や指の使い方や運動の精密度を指す言葉です。ビデオでは、ニューラルネットワークがロボットの低速の細かい操作に必要なデキステリーを提供していることが説明されています。これにより、ロボットは物体を正確に操作し、複雑なタスクを遂行することができます。

💡Vision Language Action model

ビジョン言語アクションモデルは、ロボットが視覚情報を受け取り、言語を理解し、それを基に行動を起こす一連のプロセスを指します。ビデオでは、このモデルがロボットに視覚と言語の理解を提供し、実際の物理的な世界での行動を実現する中枢的な役割を果たしていることが示されています。

💡Teleoperation

テレオペレーションは、遠隔操作の意味で、ロボットが人間から直接指示を受けることを指します。ビデオでは、テレオペレーションが行われていないことや、ロボットが自らの学習と判断に基づいて行動する点が強調されています。

💡Embodied AI

エンボディド AIは、物理的な身体を持つロボットに組み込まれたAI技術を指します。この技術は、ロボットが実際の環境と直接意思疎通を行い、状況に応じて適切な行動を起こすることができるように設計されています。ビデオでは、Figure AIのロボットがエンボディド AIの例として挙げられ、その能力が示されています。

💡AGI (Artificial General Intelligence)

AGIは、人工知能の分野で、特定のタスクや環境にとらわれることなく、様々な状況に対応できる一般的な知能を持つことを目指しています。ビデオでは、Figure AIのロボットがAGIの方向性を示しており、多様なタスクを学び、適応することで人間の生活を支援する未来の可能性が示されています。

Highlights

Figure AI, a robotics company, has partnered with OpenAI to combine expertise in robotics with AI.

The collaboration aims to showcase the amalgamation of metal and neural networks, marking a significant technological advancement.

Figure one, a robot, is capable of having full conversations with people, thanks to the integration of OpenAI models.

The robot's actions are not scripted or pre-programmed, but rather determined by neural networks, indicating a high level of autonomy.

Figure AI's robot demonstrates the ability to pick up trash, respond to questions, and perform multitasking.

The robot's neural networks enable fast, low-level, dextrous actions without the need for explicit programming.

The robot can describe its visual experience, plan future actions, reflect on its memory, and explain its reasoning verbally.

All behaviors exhibited in the demonstration video are learned, not teleoperated, and occur at normal speed.

The robot learns from both web data and robotics data, incorporating a wide range of information into its operations.

Figure AI's robot can understand and execute complex commands, such as moving a banana to a specific location.

The robot's capabilities are a result of the partnership between Figure AI's robotics expertise and OpenAI's AI models.

The robot's system involves a large multimodal model trained by OpenAI that understands both images and text.

The robot processes the entire history of the conversation, including past images, to come up with language responses.

Figure AI's robot can reflect on memory and conversation history to inform its actions and decisions.

The robot's neural network policies enable it to perform fast, dextrous manipulation and stable dynamics for whole body movement.

The robot's vision system is adept at understanding pointing gestures and other visual cues from humans.

The robot's behaviors are driven by neural network Vis motor Transformer policies, mapping pixels directly to actions.

Figure AI is a leading place for scaling up embodied AI, and they are actively recruiting talent to further advance the technology.

Transcripts

play00:00

so figure AI is the robotics company

play00:02

that partnered up with open AI to

play00:04

combine their expertise in building

play00:06

robotics with openi well Ai and this is

play00:10

the first time we can behold the

play00:13

amalgamation of metal and neural Nets

play00:16

now I'm exaggerating but this is kind of

play00:19

a big deal take a

play00:28

look

play00:30

hey figure one what do you see right

play00:35

now I see a red apple on a plate in the

play00:38

center of the table a drying rack with

play00:41

cups and a plate and you standing nearby

play00:44

with your hand on the table great can I

play00:46

have something to

play00:50

eat sure

play00:58

thing great

play01:00

can you explain why you did what you

play01:02

just did while you pick up this

play01:08

trash on it so I gave you the Apple

play01:11

because it's the only uh edible item I

play01:13

could provide you with from the

play01:21

table great so based on the scene right

play01:25

now where do you think the dishes in

play01:27

front of you go

play01:28

next

play01:33

the dishes on the table like that plate

play01:36

and cup are likely to go into the drying

play01:38

rack next great can you put them

play01:41

[Music]

play01:44

there of

play01:55

[Music]

play01:58

course

play02:01

[Music]

play02:03

all right so how do you think you

play02:07

did I I think I did pretty well the

play02:11

apple found its new owner the trash is

play02:13

gone and the table wear is right where

play02:15

it

play02:16

belongs I agree thanks figure

play02:21

one you're welcome if there's anything

play02:23

else you need help with just let me

play02:28

know

play02:29

[Applause]

play02:34

[Music]

play02:55

so they're saying with opening eye

play02:57

figure one can now have full

play02:59

conversations with with people open ey

play03:01

models provide high level Visual and

play03:02

language intelligence figure neural

play03:05

networks deliver fast low-level dextrous

play03:07

robot actions everything in this video

play03:10

is a neural network this I think is as

play03:14

clear as it gets previous videos I think

play03:16

I maybe questioned like exactly what was

play03:18

happening like I wish they kind of

play03:19

spelled out I mean this is pretty pretty

play03:22

clear in what they're saying here

play03:24

everything here is a neural network mean

play03:27

the robot isn't scripted it's not

play03:30

pre-programmed for certain motions it's

play03:32

not pre-programmed to answer in a

play03:33

certain way that's you know Chad GPT

play03:35

that's GPT 4 or higher maybe that's

play03:38

answering the questions and then the

play03:40

robot is moving based on the neural

play03:41

networks that figure AI has it sounds

play03:44

like already independently developed

play03:46

just to make sure that the people that

play03:47

are like asking questions kind of like

play03:49

know so this is film speed 1.0 right

play03:52

endtoend neural networks speech to

play03:54

speech reasoning and this is the figure

play03:56

01 the robot Plus open Ai and look I

play03:59

think it's fair that people are

play04:00

skeptical we had somebody you know roll

play04:03

a truck off a hill and then be like this

play04:06

thing really functions and you know it

play04:08

did not but at this point it's pretty

play04:11

obvious I feel like to most people that

play04:13

kind of been following these

play04:14

developments that okay they they really

play04:16

have the real deal you know for example

play04:19

if Google deep Minds rt2 robot you know

play04:22

they have they kind of explain what's

play04:24

running it they have this Vision

play04:26

language action model so almost kind of

play04:29

like you can think of it of as like a

play04:31

large language model with vision with

play04:34

this action model kind of on top of it

play04:37

that translates with the vision language

play04:39

models kind of reasoning and seeing

play04:42

translates that into movements of the

play04:44

hand and the grippers ETC so text goes

play04:47

in here and again this doesn't have

play04:48

anything to do with figure this is

play04:50

Google deep mine but maybe figure has

play04:53

something similar right so you have the

play04:55

large language model here and all the

play04:57

speech to speech right so here's what

play04:59

the should do and the language model

play05:01

like reasons through it you also have

play05:03

like the um Vision input right that kind

play05:05

of combines and the output is in terms

play05:08

of the movements of the robots so kind

play05:11

of it's movements in the physical space

play05:14

and that's why you're able to say

play05:17

somewhat weird things like you know move

play05:19

banana right as you can see in this

play05:21

image there's a couple flags on this

play05:23

table and there's a banana and you get

play05:24

the command move banana to Germany right

play05:27

and it figures out you know kind of what

play05:30

you mean so it puts it on top of the

play05:32

flag on top of the correct flag right it

play05:34

doesn't pick up the banana and head to

play05:37

Frankfurt so it would be very curious to

play05:39

know if they're doing something like

play05:41

this now of course it's a small startup

play05:43

they're not going to post all their

play05:44

Insider secrets on the internet maybe

play05:46

sometime in the future we'll find out

play05:48

but it is getting pretty impressive like

play05:50

when you tell it to pick up the trash

play05:52

while responding to a question you know

play05:54

you know you tell me to do that like

play05:56

that's something that I might struggle

play05:58

with myself multitasking hard so it

play06:00

looks like this Google Deep Mind rt2

play06:02

paper was published in uh July 2023 and

play06:06

here's Cory Lynch so Cory Lynch works at

play06:08

figure AI so he's describing how this

play06:12

robot is able to do what it's doing so

play06:13

Corey his background was robotics at

play06:17

Google and looks like he joined figure

play06:19

AI uh around July 2023 crazy so he's

play06:23

saying we are now having full

play06:25

conversations with figure 01 oh is this

play06:27

him I wonder so that's Cor

play06:30

that that might be him interacting with

play06:31

the robot so he's saying we are not

play06:33

having full conversations with figure

play06:34

one thanks to our partnership with open

play06:36

AI our robot can describe its visual

play06:39

experience plan future actions reflect

play06:42

on its memory explain its reasoning

play06:44

verbally technical Deep dive thread so

play06:47

he's saying let's break down what we see

play06:48

in the video so all the behaviors are

play06:51

learned not teleoperated so this is kind

play06:53

of a sticking point with some of these

play06:55

demonstrations we've seen some

play06:57

incredible things done by robots for

play06:59

fair like low price points but they're

play07:01

done with tele operation so for example

play07:04

you're seeing here he's a robot cooking

play07:06

a three course uh Cantonese meal right

play07:09

grilling up some delicious steaks or

play07:12

chicken eggs vegetables cracking some

play07:15

eggs like this is absolutely incredible

play07:18

so this is Telly operated and what that

play07:20

means is for example here's an example

play07:23

of it so as you can see here there's a

play07:24

person standing behind it and they're

play07:26

kind of mimicking those gestures up in

play07:28

the air and the robot kind of like

play07:30

repeats them another way of doing it is

play07:32

with virtual reality where you kind of

play07:34

demonstrate those movements and the

play07:36

robot kind of repeats it and the robots

play07:39

can learn from this and can generalize

play07:42

but that's quite different from

play07:43

something like this this rt2 from Google

play07:45

deep mins you can see here it's kind of

play07:47

showing you is like sucking up all the

play07:48

information right it's watching all

play07:50

these videos and all the previous robots

play07:52

and the Wikipedia and YouTube and

play07:54

everything everything everything so as

play07:56

they say here it learns from both web

play07:58

and Robotics data so it just looks at

play08:01

things and reads things and it knows

play08:03

things right it it drinks and it knows

play08:06

things so they're kind of specifying

play08:07

here okay so it's not teleoperated and

play08:09

it's running at normal speed so the

play08:11

camera footage is not sped up and again

play08:13

we all kind of appreciate that are

play08:15

skeptical because they've been burnt

play08:17

before so I I think it just helps to

play08:19

have little disclaimers like this

play08:20

somewhere because I mean for the people

play08:22

that follow this stuff you know I think

play08:23

to most people probably this is what

play08:25

they think of like Peak Advanced

play08:28

robotics right the dancing robots like

play08:30

to somebody that's not really following

play08:32

looks at this and goes okay this this is

play08:33

probably the most advanced thing there

play08:35

ever is look at the Motions but the

play08:37

people that are following this stuff

play08:39

they're like yeah cool but this thing

play08:41

learning from watching videos how to

play08:43

operate in the real world is much cooler

play08:46

right it being able to generalize like

play08:48

if you say I need to hammer a nail what

play08:50

objects from this scene might be useful

play08:53

it's like let's grab a rock and then

play08:54

prints out the action of how to use that

play08:57

rock to hammer whatever it needs to

play08:58

hammer kind of using that Chain of

play09:00

Thought reasoning which is something

play09:01

that we know from interacting with large

play09:03

language models like chpt like you you

play09:06

know that chpt can probably figure out

play09:08

the rock can do something like this

play09:10

right and then it links to the execution

play09:13

the code that manipulates that rock in

play09:15

the physical world right so this idea of

play09:17

vision language model you know it may

play09:19

not be pretty but this is what Peak

play09:21

Performance looks like and it seems like

play09:23

figure is there it's you know there's

play09:26

definitely competition for these General

play09:29

r robots so I think people begin to kind

play09:31

of refer to them off hand as like AGI

play09:33

robots as in to point out that these are

play09:35

General robots which you know as cool as

play09:38

these are they probably wouldn't say

play09:41

this about the Boston Dynamic robots as

play09:43

impressive as the the movements are the

play09:45

movements are prescripted they're

play09:47

following a routine so he continues we

play09:49

feed images from the robots cameras and

play09:51

transcribed text from speech captured by

play09:54

the onboard microphones to a large

play09:56

multimodal model trained by open AI that

play09:59

understands both images and texts so

play10:02

it's interesting that they don't spell

play10:04

which model so they don't say GPT 4 you

play10:07

know with vision they're like a model

play10:09

maybe Jimmy apples will clue Us in so I

play10:12

can't find the Tweet but somewhere a

play10:14

while back he was saying that's weird

play10:16

somewhere he was saying that um there's

play10:18

a robot heading to your favorite AI lab

play10:22

and I feel like maybe it was uh related

play10:23

to this and so we feed images from the

play10:25

robot's cameras and transcribe text from

play10:29

the speech captured by onboard

play10:31

microphone so this is a little detail

play10:33

that's interesting so open ey has their

play10:35

open source whisper which is a You Know

play10:37

audio to text transcription I believe

play10:40

it's open source as part of their kind

play10:42

of Suite of products so in a previous

play10:44

video I kind of try to paint AGI as a

play10:47

series of parts that you're trying to

play10:49

combine together and this is kind of a

play10:52

good representation of it right cuz

play10:53

think about what we have here we have

play10:56

figure this robotics company likely

play10:58

running with opening eyes whisper a

play11:01

large language model it's called a GPT 4

play11:03

could be something else something

play11:04

specific right something of vision that

play11:06

can understand images and text and all

play11:09

that is hooked into the robot so all the

play11:12

pieces are working together to create

play11:14

this robot with a general ability which

play11:18

is kind of what we think of AGI as we

play11:19

don't have a specific definition but you

play11:21

know obviously this is moving in that

play11:23

direction if it can like walk around and

play11:25

learn and reason like if it can hear

play11:28

commands and then respond to them like

play11:31

that's like Are We There Are we almost

play11:34

there at the very least close the model

play11:37

processed the entire history of the

play11:38

conversation including past images to

play11:40

come up with language responses which

play11:42

are spoken back to the human via text to

play11:44

speech the same model is responsible for

play11:46

deciding which learned closed loop

play11:49

Behavior to run on the robot to fulfill

play11:51

a given command loading particular

play11:53

neural network weights uh onto the GPU

play11:57

and executing a policy and so this last

play12:00

part that to me reads as different from

play12:03

what Google Deep Mind has with their

play12:05

little like they kind of toonize those

play12:07

movements so here's from Google Deep

play12:09

Mind to control a robot it must be

play12:11

trained to Output actions we address

play12:13

this challenge by representing actions

play12:15

as tokens in the model's output similar

play12:17

to language tokens and describe actions

play12:19

as strings that can be processed by

play12:21

standard natural language tokenizers

play12:23

shown here so if somebody super smart in

play12:26

the comments can maybe unpack this a

play12:28

little bit just put the word GPU

play12:30

somewhere in the comment and I'll kind

play12:31

of search by that but you know what does

play12:34

this tell us so this is something

play12:35

different from the rt2 and I'll do my

play12:38

best to post an update if we get more

play12:40

information but to me I'm reading this

play12:41

as there's a maybe like a finite

play12:44

pre-trained number of actions that it

play12:47

has like pick up or or whatever I

play12:50

don't know you know pour some liquid

play12:52

into this thing or push a button and

play12:54

then the GPT model the open AI model

play12:57

kind of selects which specific thing to

play12:59

run and then runs it I mean I don't know

play13:01

if that's true I I might be misreading

play13:02

it but we'll we'll know more soon

play13:05

hopefully so speech to text this is

play13:08

where the person says can I have

play13:09

something to eat right that feeds into

play13:10

the open AI model Common Sense reasoning

play13:13

from images it responds sure thing then

play13:16

it goes into Behavior selection so it

play13:18

has sort of these preexisting seems like

play13:21

neural network policies for fast

play13:24

dextrous manipulation and so those lead

play13:26

into the whole body controller of safe

play13:29

stable Dynamics so that's whole body

play13:31

movement from legs and arms and Etc so

play13:34

balancing while it's you know picking

play13:36

stuff up Etc and then at the same time

play13:38

the vision the robot as it's seeing

play13:41

things right so it's seeing what it's

play13:42

doing that kind of feeds back into you

play13:45

know the GPT open AI model and the

play13:48

neural network policies which is kind of

play13:50

I guess an UDA Loop Orient observe

play13:52

decide act all right so here's the

play13:55

command here's like the acknowledgement

play13:57

sure and then it goes into this Loop to

play13:59

complete the task but interestingly it

play14:01

sounds like it it can multitask so it

play14:04

can like tell you what it's doing as

play14:05

it's doing it and so the continue

play14:07

connecting figure1 to a large

play14:08

pre-trained multimodal model gives us

play14:10

some interesting new capabilities so

play14:12

again that's the big Point here that we

play14:13

have a company good at building metal

play14:16

robots figure one we have a company

play14:18

that's excellent at building AI neural

play14:19

Nets open AI so one plus the other now

play14:22

it's able to describe its surroundings

play14:25

use common sense reasoning when making

play14:27

decisions for example the ishes on the

play14:29

table like that plate and cup are likely

play14:32

to go into the drawing rack next

play14:35

translate ambiguous highle request like

play14:37

I'm hungry to some context appropriate

play14:39

behavior like hand the person and Apple

play14:42

so that was the big test with rt2 SL RTX

play14:46

whatever you want to call you know the

play14:47

RT series of models you could give it a

play14:50

general request I think one of the

play14:52

examples was they said I'm tired and

play14:54

it's like okay you know we have a Red

play14:56

Bull would you like a Red Bull which you

play14:58

know shows that we make them in our

play15:00

image don't we and then describe why it

play15:03

executed a particular action in plain

play15:05

English for example it was the only

play15:07

edible item I could provide you with

play15:09

from the table and a large pre-trained

play15:11

model that understands conversation

play15:13

history gives figure one a powerful

play15:15

short-term memory consider the question

play15:17

can you put them there where does them

play15:19

refer to and where is there answering

play15:23

correctly requires the ability to uh

play15:25

reflect on memory with a pre-trained

play15:28

model analog in the conversation's image

play15:30

and text history figure one quickly

play15:32

forms and carries out the plan place the

play15:34

cup on the drawing rack place the plate

play15:37

on the drawing rack so one and two steps

play15:40

one and two so I'm rewinding this to see

play15:43

so when GPT 4 Vision just came out there

play15:45

was a report testing a lot of its

play15:47

abilities and so one thing that they

play15:49

found is it's really good at figuring

play15:51

out what we're pointing at so if you're

play15:52

watching on video like if I do this like

play15:54

I'm pointing to something like you know

play15:57

which word I'm pointing to right and so

play15:59

the vision models also pick up on that

play16:01

very well like if I have an arrow

play16:03

pointing to something like you know what

play16:04

I'm pointing to so do they like the the

play16:07

models they understand it very well

play16:09

instead of for example needing to show

play16:12

it coordinates or having to Circle

play16:14

something like or or anything like that

play16:16

basically just an arrow or something

play16:18

pointing at something works very very

play16:19

well so I was curious if he had to if he

play16:21

was pointing at all or it was just sort

play16:24

of an in context so I'm not seeing any

play16:26

sort of visual Clues other than the

play16:28

verbal command but I'm going to make my

play16:30

prediction now we as a society as humans

play16:33

we will be pointing a lot more I'm just

play16:36

guessing but like if you point at

play16:37

something and you stay in action that

play16:39

might be like the easiest way to give

play16:41

the most amount of information about

play16:42

what needs to be done so next they

play16:44

continue finally let's talk about the

play16:46

Learned low-level B manual manipulation

play16:50

so all behaviors are driven by neural

play16:52

network Vis motor Transformer policies

play16:55

mapping pixels directly to actions these

play16:57

networks taken onboard images at 10

play17:00

Hertz and generate 24 doof actions so in

play17:04

robotics this is directions of Freedom

play17:06

so for example wrist poses and finger

play17:09

joint angles at 200 htz these actions

play17:13

serve as high rate set points for the

play17:16

even higher rate whole body controller

play17:18

to track this is a useful separation of

play17:21

concerns internet pre-trained models do

play17:23

common sense reasoning over images and

play17:25

text to come up with a highlevel plan

play17:28

learned VIs motor policies execute the

play17:30

plan performing fast reactive behaviors

play17:33

that are hard to specify manually like

play17:35

manipulating a deformable bag in any

play17:38

position so like if you're picking up a

play17:40

bag of potato chips right it's not hard

play17:43

it's not solid so it will deform as you

play17:46

grab it right you're not going to be

play17:48

able to grab it if you don't provide

play17:50

enough pressure If you provide too much

play17:52

pressure you'll pop it so those look

play17:54

like trash bags of potato chips so it's

play17:56

grabbing it and even though it's kind of

play17:58

crumbling out of the weight the robot's

play18:00

still able to grasp it and throw it in

play18:02

the trash they continue meanwhile a

play18:04

whole body controller ensures safe

play18:06

stable Dynamics for example maintaining

play18:09

balance even just a few years ago I

play18:11

would have thought having a full

play18:12

conversation with a humanoid robot while

play18:14

it plans and carries out its own fully

play18:16

learned behaviors would be something we

play18:18

would have to wait decades to see

play18:20

obviously a lot has changed in my mind

play18:22

figure is the best place in the world

play18:24

right now to scale up embodied Ai and so

play18:27

they are recruiting recruiting hard so

play18:30

you can go to figure. A/C careers yeah I

play18:33

mean it's very exciting even more so

play18:35

with the new partnership with openai

play18:37

which they announced so looks like the

play18:39

first time I think we've heard of it was

play18:41

this was February 29th so they're saying

play18:43

openai plus humanoid robots and you know

play18:46

at the time they posted this you know

play18:48

where the robot makes a cup of coffee

play18:51

and you know I think it was well

play18:53

received although there were some

play18:54

criticism there was a lot about this

play18:56

demonstration that is fairly simp like

play18:59

the coffee maker that they've used is

play19:01

often the specific model is used for

play19:03

robotic demonstrations because it's made

play19:05

to be very simple to use you don't need

play19:08

multiple digits you can have just one

play19:10

claw like appendage you basically need

play19:13

one finger to operate it you know

play19:14

assuming that the Cure cup is in there

play19:17

basically you just need to push the

play19:19

handle down push a button it's done and

play19:21

I think I even at the time said oh maybe

play19:23

you know just I wasn't sure how

play19:26

impressive it was just kind of that

play19:28

demonstration alone kind of out of

play19:30

context but I mean now it's definitely

play19:33

shaping up and getting a lot more

play19:35

exciting also I find myself regretting

play19:37

more and more every time that I've said

play19:39

anything negative about robots in the

play19:41

past now as it's getting better and it's

play19:42

talking you know you start reconsidering

play19:45

some of your past words like maybe I

play19:46

should not be talking crap about this

play19:48

thing so with that said I think figurei

play19:50

has excellent robots amazing robots I

play19:53

love the robots maybe now Boston

play19:55

Dynamics will uh be motivated enough to

play19:58

learn a new Tik Tock dance crap I got to

play20:00

stop doing that Boston Dynamics is great

play20:03

Boston Dynamics makes good robots

play20:06

anyways my name is West rth thank you

play20:08

for watching

Rate This

5.0 / 5 (0 votes)

関連タグ
FigureAIOpenAIロボット工学ニューラルネットワーク視覚体験未来計画記憶反映理論説明人機対話技術革新エモーション