OpenAI's 'AGI Robot' Develops SHOCKING NEW ABILITIES | Sam Altman Gives Figure 01 Get a Brain

AI Unleashed - The Coming Artificial Intelligence Revolution and Race to AGI
13 Mar 202420:09

Summary

TLDRFigure AI, in partnership with OpenAI, presents a groundbreaking amalgamation of robotics and neural networks. The robot, Figure 01, demonstrates advanced capabilities such as full conversations, understanding and responding to visual cues, and executing complex tasks like picking up trash and handling objects. The robot operates through a combination of a large language model for reasoning and neural network policies for physical actions, showcasing a significant step towards general AI and practical applications in embodied AI.

Takeaways

  • 🤖 Figure AI is a robotics company that has partnered with OpenAI to integrate robotics and AI expertise.
  • 🔄 The collaboration aims to showcase the synergy of metal and neural networks, marking a significant advancement in the field.
  • 🍎 Figure One, the robot, can engage in full conversations with people, demonstrating high-level visual and language intelligence.
  • 🧠 The robot's actions are powered by neural networks, not pre-programmed scripts, allowing for adaptive and dexterous movements.
  • 🎥 The video demonstrates the robot's ability to understand and execute tasks based on visual input and verbal commands.
  • 🔄 The robot can multitask, such as picking up trash while responding to questions, showcasing its ability to handle complex tasks.
  • 🤔 Figure AI's technology suggests the potential for general AI robots (AGI), with capabilities beyond scripted movements.
  • 💻 The robot's neural networks are likely running on a model provided by OpenAI, possibly GPT-4 or a similar advanced model.
  • 📈 The partnership between Figure AI and OpenAI highlights the potential for scaling up embodied AI technology.
  • 🚀 The demonstration of the robot's capabilities, such as handling objects and understanding context, indicates significant progress in AI and robotics.
  • 📌 The video serves as a clear example of the integration of AI and robotics, moving towards more autonomous and interactive machines.

Q & A

  • What is the significance of the partnership between Figure AI and OpenAI?

    -The partnership between Figure AI and OpenAI is significant because it combines Figure AI's expertise in building robotics with OpenAI's advanced AI and neural network capabilities. This collaboration aims to create robots that can understand and interact with their environment in a more sophisticated and autonomous manner.

  • How does Figure AI's robot utilize neural networks?

    -Figure AI's robot uses neural networks to perform a variety of tasks. These include low-level dexterous actions, such as picking up objects, as well as high-level visual and language intelligence to understand and respond to human commands and questions.

  • What does the term 'end-to-end neural networks' refer to in the context of the video?

    -In the context of the video, 'end-to-end neural networks' refers to the complete system of neural networks that handle all aspects of a task, from perception (seeing or understanding the environment) to action (physically manipulating objects), without the need for human intervention or pre-programming.

  • How does the robot determine which objects are useful for a given task?

    -The robot determines which objects are useful for a task by processing visual input from its cameras and understanding the context through a large multimodal model trained by OpenAI. This model uses common sense reasoning to decide which objects are appropriate for the task at hand.

  • What is the role of the 'speech to speech' component in the robot's operation?

    -The 'speech to speech' component allows the robot to engage in full conversations with humans. It processes spoken commands, understands the context, and generates appropriate verbal responses through text-to-speech technology.

  • How does the robot handle ambiguous or high-level requests?

    -The robot handles ambiguous or high-level requests by using its AI model to interpret the request in the context of its surroundings and the objects available. It then selects and executes the most appropriate behavior to fulfill the command, such as handing a person an apple when they express hunger.

  • What is the significance of the robot's ability to reflect on its memory?

    -The robot's ability to reflect on its memory allows it to understand and respond to questions about its past actions. This capability provides a richer interaction experience, as the robot can explain why it performed certain actions based on its memory of the situation.

  • How does the robot's whole body controller contribute to its stability and safety?

    -The whole body controller ensures that the robot maintains balance and stability while performing actions, such as picking up objects or moving around. It manages the robot's dynamics across all its 'degrees of freedom', which include leg movements, arm positions, and other body adjustments necessary for safe and effective task execution.

  • What is the role of the 'Vis motor Transformer policies' in the robot's learned behaviors?

    -The 'Vis motor Transformer policies' are neural network policies that map visual input directly to physical actions. They take onboard images at a high frequency and generate detailed instructions for the robot's movements, such as wrist poses and finger joint angles, enabling the robot to perform complex manipulations and react quickly to its environment.

  • What are some of the challenges that Figure AI and OpenAI aim to overcome with their partnership?

    -Figure AI and OpenAI aim to overcome challenges related to creating robots that can operate autonomously in complex, real-world environments. This includes developing AI that can understand and reason about its surroundings, learn from new experiences, and execute a wide range of physical tasks without human intervention.

Outlines

00:00

🤖 Introduction to Figure AI and OpenAI Partnership

The script introduces Figure AI, a robotics company that has partnered with OpenAI to combine their expertise in robotics with OpenAI's capabilities. The video showcases the first instance of the amalgamation of metal and neural networks, highlighting the significance of the collaboration. The robot, Figure One, is seen interacting with its environment, picking up trash, and engaging in conversation, demonstrating the integration of high-level visual and language intelligence with dextrous robotic actions.

05:02

🧠 Behind the Scenes: Neural Networks and Multitasking

This paragraph delves into the technical aspects of how Figure One operates, emphasizing the role of neural networks in its functions. It explains that the robot is not pre-programmed for specific motions or responses, but rather learns from a neural network. The video clarifies that everything shown is powered by neural networks, and the robot's actions are not scripted. It also compares Figure One's capabilities to Google DeepMind's RT2 robot, highlighting the advanced level of Figure One's learning from vision and language models.

10:02

🗣️ Conversational Abilities and Memory

The script discusses Figure One's ability to have full conversations with people, thanks to the partnership with OpenAI. It explains how the robot can describe its visual experience, plan future actions, and reflect on its memory, all verbally. The paragraph also touches on the technical aspects of how the robot processes conversation history and images to generate language responses and execute actions, highlighting the integration of large multimodal models trained by OpenAI.

15:03

🤹‍♂️ Advanced Robotics and Learning from Observation

This section of the script focuses on the advanced capabilities of Figure One, such as learning from observations and performing complex tasks like manipulating deformable objects. It explains how the robot's behaviors are learned and not teleoperated, and how it can multitask and respond to commands while carrying out actions. The script also discusses the robot's ability to understand and execute ambiguous requests, showcasing its advanced reasoning and manipulation skills.

20:04

🎉 Conclusion and Future Prospects

The script concludes with a reflection on the impressive advancements in robotics and AI, particularly the partnership between Figure AI and OpenAI. It acknowledges the skepticism around such technologies but emphasizes the significant progress made. The speaker expresses excitement about the future of embodied AI and the potential for Figure AI to scale up this technology. The script ends with a call to action for viewers interested in joining Figure AI's efforts in advancing robotics and AI.

Mindmap

Keywords

💡Figure AI

Figure AI is a robotics company that has partnered with OpenAI to integrate their expertise in building robotics with advanced AI capabilities. In the context of the video, this partnership has resulted in the creation of a robot that can engage in full conversations with people, understand visual inputs, and perform complex tasks. The collaboration is highlighted as a significant development in the field of robotics and AI.

💡OpenAI

OpenAI is an AI research lab that collaborates with Figure AI to enhance the robot's capabilities with high-level visual and language intelligence. OpenAI's models provide the robot with the ability to understand and process both images and text, enabling it to engage in conversations, reason, and perform tasks based on the inputs it receives.

💡Neural Networks

Neural networks are a series of algorithms modeled loosely after the human brain, designed to recognize patterns and interpret data. In the video, neural networks are used to deliver fast, low-level, dexterous robot actions, mapping pixels directly to physical movements. They are a crucial component of the robot's ability to learn and perform tasks autonomously.

💡Robotics

Robotics refers to the branch of technology that deals with the design, construction, operation, and use of robots. In the video, Figure AI is a company that specializes in robotics, combining their mechanical expertise with OpenAI's AI capabilities to create advanced humanoid robots capable of understanding and interacting with their environment.

💡Conversations

Conversations in the context of the video refer to the ability of the robot to engage in interactive, multi-turn dialogues with humans. This involves understanding language, processing inputs, and generating appropriate responses, which is facilitated by the integration of AI models from OpenAI.

💡Multitasking

Multitasking refers to the ability to perform multiple tasks simultaneously. In the context of the video, the robot demonstrates multitasking by engaging in conversation while also carrying out tasks such as picking up trash or handling objects.

💡AGI (Artificial General Intelligence)

AGI, or Artificial General Intelligence, refers to AI systems that possess the ability to understand, learn, and apply knowledge across a wide range of tasks, much like a human being. The video suggests that the robot's capabilities, such as learning from observations and performing complex tasks, are indicative of progress towards AGI.

💡Vision Language Model

A vision language model is a type of AI model that can process and understand both visual data (like images) and language data (like text). In the video, the robot uses a vision language model to interpret its surroundings and generate appropriate physical actions based on the combined understanding of visual and textual inputs.

💡Teleoperation

Teleoperation refers to the remote control of robots or other mechanical systems. In the context of the video, teleoperation is contrasted with the robot's autonomous capabilities, emphasizing that the robot's actions are not controlled by a human operator but are instead learned and executed independently.

💡Embodied AI

Embodied AI refers to AI systems that are integrated with a physical body, allowing them to interact with the world in a more human-like manner. In the video, the partnership between Figure AI and OpenAI has led to the development of embodied AI, where the robot's physical form is enhanced by AI capabilities, enabling it to perform tasks and engage in conversations.

💡Multimodal Model

A multimodal model is an AI model that can process and understand multiple types of inputs, such as images, text, and audio. In the video, the robot utilizes a large multimodal model trained by OpenAI to understand both visual and textual information, which enables it to engage in conversations and perform tasks based on the context.

Highlights

Figure AI is a robotics company that has partnered with OpenAI to combine their expertise in building robotics with AI.

The collaboration aims to showcase the amalgamation of metal and neural networks, marking a significant advancement in the field.

Figure One, a robot developed by Figure AI, is capable of having full conversations with people, demonstrating high-level visual and language intelligence.

The robot's actions are not scripted or pre-programmed, but rather determined by neural networks, allowing for independent and adaptive behavior.

The robot can describe its visual experience, plan future actions, reflect on its memory, and explain its reasoning verbally.

Figure One's capabilities are the result of a partnership with OpenAI, integrating their models to provide advanced visual and language processing.

The robot's behaviors are learned, not teleoperated, meaning it operates at normal speed without human mimicry or VR guidance.

Figure One learns from both web data and robotics data, watching videos and reading information to understand and apply knowledge.

The robot's system involves a large multimodal model trained by OpenAI that understands both images and text.

Figure One can process the entire history of a conversation, including past images, to come up with language responses.

The robot can decide which learned closed-loop behavior to run to fulfill a given command, loading neural network weights onto the GPU and executing a policy.

Figure One can multitask, explaining its actions in real-time as it performs them, such as picking up trash while responding to a question.

The robot's system involves a loop of observe, decide, act, allowing it to balance while performing tasks like picking up objects.

Figure One can reflect on memory with a pre-trained model analog in the conversation's image and text history, showing short-term memory capabilities.

The robot's low-level manual manipulation is driven by neural network vis motor transformer policies, mapping pixels directly to actions.

Figure One's whole body controller ensures safe, stable dynamics, allowing it to maintain balance during tasks like grasping and lifting.

The partnership between Figure AI and OpenAI is expected to significantly advance the field of embodied AI, with Figure AI being a prime location for scaling up this technology.

The demonstration of Figure One's capabilities has shifted perceptions of advanced robotics, with even critics reconsidering their past skepticism.

Transcripts

play00:00

so figure AI is the robotics company

play00:02

that partnered up with open AI to

play00:04

combine their expertise in building

play00:06

robotics with openi well Ai and this is

play00:10

the first time we can behold the

play00:13

amalgamation of metal and neural Nets

play00:16

now I'm exaggerating but this is kind of

play00:19

a big deal take a

play00:28

look

play00:30

hey figure one what do you see right

play00:35

now I see a red apple on a plate in the

play00:38

center of the table a drying rack with

play00:41

cups and a plate and you standing nearby

play00:44

with your hand on the table great can I

play00:46

have something to

play00:50

eat sure

play00:58

thing great

play01:00

can you explain why you did what you

play01:02

just did while you pick up this

play01:08

trash on it so I gave you the Apple

play01:11

because it's the only uh edible item I

play01:13

could provide you with from the

play01:21

table great so based on the scene right

play01:25

now where do you think the dishes in

play01:27

front of you go

play01:28

next

play01:33

the dishes on the table like that plate

play01:36

and cup are likely to go into the drying

play01:38

rack next great can you put them

play01:41

[Music]

play01:44

there of

play01:55

[Music]

play01:58

course

play02:01

[Music]

play02:03

all right so how do you think you

play02:07

did I I think I did pretty well the

play02:11

apple found its new owner the trash is

play02:13

gone and the table wear is right where

play02:15

it

play02:16

belongs I agree thanks figure

play02:21

one you're welcome if there's anything

play02:23

else you need help with just let me

play02:28

know

play02:29

[Applause]

play02:34

[Music]

play02:55

so they're saying with opening eye

play02:57

figure one can now have full

play02:59

conversations with with people open ey

play03:01

models provide high level Visual and

play03:02

language intelligence figure neural

play03:05

networks deliver fast low-level dextrous

play03:07

robot actions everything in this video

play03:10

is a neural network this I think is as

play03:14

clear as it gets previous videos I think

play03:16

I maybe questioned like exactly what was

play03:18

happening like I wish they kind of

play03:19

spelled out I mean this is pretty pretty

play03:22

clear in what they're saying here

play03:24

everything here is a neural network mean

play03:27

the robot isn't scripted it's not

play03:30

pre-programmed for certain motions it's

play03:32

not pre-programmed to answer in a

play03:33

certain way that's you know Chad GPT

play03:35

that's GPT 4 or higher maybe that's

play03:38

answering the questions and then the

play03:40

robot is moving based on the neural

play03:41

networks that figure AI has it sounds

play03:44

like already independently developed

play03:46

just to make sure that the people that

play03:47

are like asking questions kind of like

play03:49

know so this is film speed 1.0 right

play03:52

endtoend neural networks speech to

play03:54

speech reasoning and this is the figure

play03:56

01 the robot Plus open Ai and look I

play03:59

think it's fair that people are

play04:00

skeptical we had somebody you know roll

play04:03

a truck off a hill and then be like this

play04:06

thing really functions and you know it

play04:08

did not but at this point it's pretty

play04:11

obvious I feel like to most people that

play04:13

kind of been following these

play04:14

developments that okay they they really

play04:16

have the real deal you know for example

play04:19

if Google deep Minds rt2 robot you know

play04:22

they have they kind of explain what's

play04:24

running it they have this Vision

play04:26

language action model so almost kind of

play04:29

like you can think of it of as like a

play04:31

large language model with vision with

play04:34

this action model kind of on top of it

play04:37

that translates with the vision language

play04:39

models kind of reasoning and seeing

play04:42

translates that into movements of the

play04:44

hand and the grippers ETC so text goes

play04:47

in here and again this doesn't have

play04:48

anything to do with figure this is

play04:50

Google deep mine but maybe figure has

play04:53

something similar right so you have the

play04:55

large language model here and all the

play04:57

speech to speech right so here's what

play04:59

the should do and the language model

play05:01

like reasons through it you also have

play05:03

like the um Vision input right that kind

play05:05

of combines and the output is in terms

play05:08

of the movements of the robots so kind

play05:11

of it's movements in the physical space

play05:14

and that's why you're able to say

play05:17

somewhat weird things like you know move

play05:19

banana right as you can see in this

play05:21

image there's a couple flags on this

play05:23

table and there's a banana and you get

play05:24

the command move banana to Germany right

play05:27

and it figures out you know kind of what

play05:30

you mean so it puts it on top of the

play05:32

flag on top of the correct flag right it

play05:34

doesn't pick up the banana and head to

play05:37

Frankfurt so it would be very curious to

play05:39

know if they're doing something like

play05:41

this now of course it's a small startup

play05:43

they're not going to post all their

play05:44

Insider secrets on the internet maybe

play05:46

sometime in the future we'll find out

play05:48

but it is getting pretty impressive like

play05:50

when you tell it to pick up the trash

play05:52

while responding to a question you know

play05:54

you know you tell me to do that like

play05:56

that's something that I might struggle

play05:58

with myself multitasking hard so it

play06:00

looks like this Google Deep Mind rt2

play06:02

paper was published in uh July 2023 and

play06:06

here's Cory Lynch so Cory Lynch works at

play06:08

figure AI so he's describing how this

play06:12

robot is able to do what it's doing so

play06:13

Corey his background was robotics at

play06:17

Google and looks like he joined figure

play06:19

AI uh around July 2023 crazy so he's

play06:23

saying we are now having full

play06:25

conversations with figure 01 oh is this

play06:27

him I wonder so that's Cor

play06:30

that that might be him interacting with

play06:31

the robot so he's saying we are not

play06:33

having full conversations with figure

play06:34

one thanks to our partnership with open

play06:36

AI our robot can describe its visual

play06:39

experience plan future actions reflect

play06:42

on its memory explain its reasoning

play06:44

verbally technical Deep dive thread so

play06:47

he's saying let's break down what we see

play06:48

in the video so all the behaviors are

play06:51

learned not teleoperated so this is kind

play06:53

of a sticking point with some of these

play06:55

demonstrations we've seen some

play06:57

incredible things done by robots for

play06:59

fair like low price points but they're

play07:01

done with tele operation so for example

play07:04

you're seeing here he's a robot cooking

play07:06

a three course uh Cantonese meal right

play07:09

grilling up some delicious steaks or

play07:12

chicken eggs vegetables cracking some

play07:15

eggs like this is absolutely incredible

play07:18

so this is Telly operated and what that

play07:20

means is for example here's an example

play07:23

of it so as you can see here there's a

play07:24

person standing behind it and they're

play07:26

kind of mimicking those gestures up in

play07:28

the air and the robot kind of like

play07:30

repeats them another way of doing it is

play07:32

with virtual reality where you kind of

play07:34

demonstrate those movements and the

play07:36

robot kind of repeats it and the robots

play07:39

can learn from this and can generalize

play07:42

but that's quite different from

play07:43

something like this this rt2 from Google

play07:45

deep mins you can see here it's kind of

play07:47

showing you is like sucking up all the

play07:48

information right it's watching all

play07:50

these videos and all the previous robots

play07:52

and the Wikipedia and YouTube and

play07:54

everything everything everything so as

play07:56

they say here it learns from both web

play07:58

and Robotics data so it just looks at

play08:01

things and reads things and it knows

play08:03

things right it it drinks and it knows

play08:06

things so they're kind of specifying

play08:07

here okay so it's not teleoperated and

play08:09

it's running at normal speed so the

play08:11

camera footage is not sped up and again

play08:13

we all kind of appreciate that are

play08:15

skeptical because they've been burnt

play08:17

before so I I think it just helps to

play08:19

have little disclaimers like this

play08:20

somewhere because I mean for the people

play08:22

that follow this stuff you know I think

play08:23

to most people probably this is what

play08:25

they think of like Peak Advanced

play08:28

robotics right the dancing robots like

play08:30

to somebody that's not really following

play08:32

looks at this and goes okay this this is

play08:33

probably the most advanced thing there

play08:35

ever is look at the Motions but the

play08:37

people that are following this stuff

play08:39

they're like yeah cool but this thing

play08:41

learning from watching videos how to

play08:43

operate in the real world is much cooler

play08:46

right it being able to generalize like

play08:48

if you say I need to hammer a nail what

play08:50

objects from this scene might be useful

play08:53

it's like let's grab a rock and then

play08:54

prints out the action of how to use that

play08:57

rock to hammer whatever it needs to

play08:58

hammer kind of using that Chain of

play09:00

Thought reasoning which is something

play09:01

that we know from interacting with large

play09:03

language models like chpt like you you

play09:06

know that chpt can probably figure out

play09:08

the rock can do something like this

play09:10

right and then it links to the execution

play09:13

the code that manipulates that rock in

play09:15

the physical world right so this idea of

play09:17

vision language model you know it may

play09:19

not be pretty but this is what Peak

play09:21

Performance looks like and it seems like

play09:23

figure is there it's you know there's

play09:26

definitely competition for these General

play09:29

r robots so I think people begin to kind

play09:31

of refer to them off hand as like AGI

play09:33

robots as in to point out that these are

play09:35

General robots which you know as cool as

play09:38

these are they probably wouldn't say

play09:41

this about the Boston Dynamic robots as

play09:43

impressive as the the movements are the

play09:45

movements are prescripted they're

play09:47

following a routine so he continues we

play09:49

feed images from the robots cameras and

play09:51

transcribed text from speech captured by

play09:54

the onboard microphones to a large

play09:56

multimodal model trained by open AI that

play09:59

understands both images and texts so

play10:02

it's interesting that they don't spell

play10:04

which model so they don't say GPT 4 you

play10:07

know with vision they're like a model

play10:09

maybe Jimmy apples will clue Us in so I

play10:12

can't find the Tweet but somewhere a

play10:14

while back he was saying that's weird

play10:16

somewhere he was saying that um there's

play10:18

a robot heading to your favorite AI lab

play10:22

and I feel like maybe it was uh related

play10:23

to this and so we feed images from the

play10:25

robot's cameras and transcribe text from

play10:29

the speech captured by onboard

play10:31

microphone so this is a little detail

play10:33

that's interesting so open ey has their

play10:35

open source whisper which is a You Know

play10:37

audio to text transcription I believe

play10:40

it's open source as part of their kind

play10:42

of Suite of products so in a previous

play10:44

video I kind of try to paint AGI as a

play10:47

series of parts that you're trying to

play10:49

combine together and this is kind of a

play10:52

good representation of it right cuz

play10:53

think about what we have here we have

play10:56

figure this robotics company likely

play10:58

running with opening eyes whisper a

play11:01

large language model it's called a GPT 4

play11:03

could be something else something

play11:04

specific right something of vision that

play11:06

can understand images and text and all

play11:09

that is hooked into the robot so all the

play11:12

pieces are working together to create

play11:14

this robot with a general ability which

play11:18

is kind of what we think of AGI as we

play11:19

don't have a specific definition but you

play11:21

know obviously this is moving in that

play11:23

direction if it can like walk around and

play11:25

learn and reason like if it can hear

play11:28

commands and then respond to them like

play11:31

that's like Are We There Are we almost

play11:34

there at the very least close the model

play11:37

processed the entire history of the

play11:38

conversation including past images to

play11:40

come up with language responses which

play11:42

are spoken back to the human via text to

play11:44

speech the same model is responsible for

play11:46

deciding which learned closed loop

play11:49

Behavior to run on the robot to fulfill

play11:51

a given command loading particular

play11:53

neural network weights uh onto the GPU

play11:57

and executing a policy and so this last

play12:00

part that to me reads as different from

play12:03

what Google Deep Mind has with their

play12:05

little like they kind of toonize those

play12:07

movements so here's from Google Deep

play12:09

Mind to control a robot it must be

play12:11

trained to Output actions we address

play12:13

this challenge by representing actions

play12:15

as tokens in the model's output similar

play12:17

to language tokens and describe actions

play12:19

as strings that can be processed by

play12:21

standard natural language tokenizers

play12:23

shown here so if somebody super smart in

play12:26

the comments can maybe unpack this a

play12:28

little bit just put the word GPU

play12:30

somewhere in the comment and I'll kind

play12:31

of search by that but you know what does

play12:34

this tell us so this is something

play12:35

different from the rt2 and I'll do my

play12:38

best to post an update if we get more

play12:40

information but to me I'm reading this

play12:41

as there's a maybe like a finite

play12:44

pre-trained number of actions that it

play12:47

has like pick up or or whatever I

play12:50

don't know you know pour some liquid

play12:52

into this thing or push a button and

play12:54

then the GPT model the open AI model

play12:57

kind of selects which specific thing to

play12:59

run and then runs it I mean I don't know

play13:01

if that's true I I might be misreading

play13:02

it but we'll we'll know more soon

play13:05

hopefully so speech to text this is

play13:08

where the person says can I have

play13:09

something to eat right that feeds into

play13:10

the open AI model Common Sense reasoning

play13:13

from images it responds sure thing then

play13:16

it goes into Behavior selection so it

play13:18

has sort of these preexisting seems like

play13:21

neural network policies for fast

play13:24

dextrous manipulation and so those lead

play13:26

into the whole body controller of safe

play13:29

stable Dynamics so that's whole body

play13:31

movement from legs and arms and Etc so

play13:34

balancing while it's you know picking

play13:36

stuff up Etc and then at the same time

play13:38

the vision the robot as it's seeing

play13:41

things right so it's seeing what it's

play13:42

doing that kind of feeds back into you

play13:45

know the GPT open AI model and the

play13:48

neural network policies which is kind of

play13:50

I guess an UDA Loop Orient observe

play13:52

decide act all right so here's the

play13:55

command here's like the acknowledgement

play13:57

sure and then it goes into this Loop to

play13:59

complete the task but interestingly it

play14:01

sounds like it it can multitask so it

play14:04

can like tell you what it's doing as

play14:05

it's doing it and so the continue

play14:07

connecting figure1 to a large

play14:08

pre-trained multimodal model gives us

play14:10

some interesting new capabilities so

play14:12

again that's the big Point here that we

play14:13

have a company good at building metal

play14:16

robots figure one we have a company

play14:18

that's excellent at building AI neural

play14:19

Nets open AI so one plus the other now

play14:22

it's able to describe its surroundings

play14:25

use common sense reasoning when making

play14:27

decisions for example the ishes on the

play14:29

table like that plate and cup are likely

play14:32

to go into the drawing rack next

play14:35

translate ambiguous highle request like

play14:37

I'm hungry to some context appropriate

play14:39

behavior like hand the person and Apple

play14:42

so that was the big test with rt2 SL RTX

play14:46

whatever you want to call you know the

play14:47

RT series of models you could give it a

play14:50

general request I think one of the

play14:52

examples was they said I'm tired and

play14:54

it's like okay you know we have a Red

play14:56

Bull would you like a Red Bull which you

play14:58

know shows that we make them in our

play15:00

image don't we and then describe why it

play15:03

executed a particular action in plain

play15:05

English for example it was the only

play15:07

edible item I could provide you with

play15:09

from the table and a large pre-trained

play15:11

model that understands conversation

play15:13

history gives figure one a powerful

play15:15

short-term memory consider the question

play15:17

can you put them there where does them

play15:19

refer to and where is there answering

play15:23

correctly requires the ability to uh

play15:25

reflect on memory with a pre-trained

play15:28

model analog in the conversation's image

play15:30

and text history figure one quickly

play15:32

forms and carries out the plan place the

play15:34

cup on the drawing rack place the plate

play15:37

on the drawing rack so one and two steps

play15:40

one and two so I'm rewinding this to see

play15:43

so when GPT 4 Vision just came out there

play15:45

was a report testing a lot of its

play15:47

abilities and so one thing that they

play15:49

found is it's really good at figuring

play15:51

out what we're pointing at so if you're

play15:52

watching on video like if I do this like

play15:54

I'm pointing to something like you know

play15:57

which word I'm pointing to right and so

play15:59

the vision models also pick up on that

play16:01

very well like if I have an arrow

play16:03

pointing to something like you know what

play16:04

I'm pointing to so do they like the the

play16:07

models they understand it very well

play16:09

instead of for example needing to show

play16:12

it coordinates or having to Circle

play16:14

something like or or anything like that

play16:16

basically just an arrow or something

play16:18

pointing at something works very very

play16:19

well so I was curious if he had to if he

play16:21

was pointing at all or it was just sort

play16:24

of an in context so I'm not seeing any

play16:26

sort of visual Clues other than the

play16:28

verbal command but I'm going to make my

play16:30

prediction now we as a society as humans

play16:33

we will be pointing a lot more I'm just

play16:36

guessing but like if you point at

play16:37

something and you stay in action that

play16:39

might be like the easiest way to give

play16:41

the most amount of information about

play16:42

what needs to be done so next they

play16:44

continue finally let's talk about the

play16:46

Learned low-level B manual manipulation

play16:50

so all behaviors are driven by neural

play16:52

network Vis motor Transformer policies

play16:55

mapping pixels directly to actions these

play16:57

networks taken onboard images at 10

play17:00

Hertz and generate 24 doof actions so in

play17:04

robotics this is directions of Freedom

play17:06

so for example wrist poses and finger

play17:09

joint angles at 200 htz these actions

play17:13

serve as high rate set points for the

play17:16

even higher rate whole body controller

play17:18

to track this is a useful separation of

play17:21

concerns internet pre-trained models do

play17:23

common sense reasoning over images and

play17:25

text to come up with a highlevel plan

play17:28

learned VIs motor policies execute the

play17:30

plan performing fast reactive behaviors

play17:33

that are hard to specify manually like

play17:35

manipulating a deformable bag in any

play17:38

position so like if you're picking up a

play17:40

bag of potato chips right it's not hard

play17:43

it's not solid so it will deform as you

play17:46

grab it right you're not going to be

play17:48

able to grab it if you don't provide

play17:50

enough pressure If you provide too much

play17:52

pressure you'll pop it so those look

play17:54

like trash bags of potato chips so it's

play17:56

grabbing it and even though it's kind of

play17:58

crumbling out of the weight the robot's

play18:00

still able to grasp it and throw it in

play18:02

the trash they continue meanwhile a

play18:04

whole body controller ensures safe

play18:06

stable Dynamics for example maintaining

play18:09

balance even just a few years ago I

play18:11

would have thought having a full

play18:12

conversation with a humanoid robot while

play18:14

it plans and carries out its own fully

play18:16

learned behaviors would be something we

play18:18

would have to wait decades to see

play18:20

obviously a lot has changed in my mind

play18:22

figure is the best place in the world

play18:24

right now to scale up embodied Ai and so

play18:27

they are recruiting recruiting hard so

play18:30

you can go to figure. A/C careers yeah I

play18:33

mean it's very exciting even more so

play18:35

with the new partnership with openai

play18:37

which they announced so looks like the

play18:39

first time I think we've heard of it was

play18:41

this was February 29th so they're saying

play18:43

openai plus humanoid robots and you know

play18:46

at the time they posted this you know

play18:48

where the robot makes a cup of coffee

play18:51

and you know I think it was well

play18:53

received although there were some

play18:54

criticism there was a lot about this

play18:56

demonstration that is fairly simp like

play18:59

the coffee maker that they've used is

play19:01

often the specific model is used for

play19:03

robotic demonstrations because it's made

play19:05

to be very simple to use you don't need

play19:08

multiple digits you can have just one

play19:10

claw like appendage you basically need

play19:13

one finger to operate it you know

play19:14

assuming that the Cure cup is in there

play19:17

basically you just need to push the

play19:19

handle down push a button it's done and

play19:21

I think I even at the time said oh maybe

play19:23

you know just I wasn't sure how

play19:26

impressive it was just kind of that

play19:28

demonstration alone kind of out of

play19:30

context but I mean now it's definitely

play19:33

shaping up and getting a lot more

play19:35

exciting also I find myself regretting

play19:37

more and more every time that I've said

play19:39

anything negative about robots in the

play19:41

past now as it's getting better and it's

play19:42

talking you know you start reconsidering

play19:45

some of your past words like maybe I

play19:46

should not be talking crap about this

play19:48

thing so with that said I think figurei

play19:50

has excellent robots amazing robots I

play19:53

love the robots maybe now Boston

play19:55

Dynamics will uh be motivated enough to

play19:58

learn a new Tik Tock dance crap I got to

play20:00

stop doing that Boston Dynamics is great

play20:03

Boston Dynamics makes good robots

play20:06

anyways my name is West rth thank you

play20:08

for watching

Rate This

5.0 / 5 (0 votes)

Связанные теги
RoboticsAI PartnershipNeural NetworksHumanoid RobotsConversational AIMultitasking RobotsVision Language ModelEmbodied AIOpenAI CollaborationTech Advancements
Вам нужно краткое изложение на английском?