LLMs are not superintelligent | Yann LeCun and Lex Fridman

Lex Clips
10 Mar 202426:20

Summary

TLDRThe transcript discusses the limitations of large language models (LLMs) in achieving superhuman intelligence. It highlights that while LLMs can process vast amounts of text, they lack the ability to understand the physical world, possess persistent memory, reason, and plan effectively. The speaker argues that intelligence requires grounding in reality and that most knowledge is acquired through sensory input and interaction with the world, not just language. They also touch on the challenges of creating AI systems that can build a comprehensive world model and the current methods being explored to improve AI's understanding and interaction with the physical environment.

Takeaways

  • πŸ€– Large language models (LLMs) like GPT-4 and LLaMa 2/3 are not sufficient for achieving superhuman intelligence due to their limitations in understanding, memory, reasoning, and planning.
  • 🧠 Human and animal intelligence involves understanding the physical world, persistent memory, reasoning, and planning, which current LLMs lack.
  • πŸ“š LLMs are trained on vast amounts of text data, but this is still less than the sensory data a four-year-old processes, highlighting the importance of non-linguistic learning.
  • πŸ“ˆ Language is a compressed form of information, but it is an approximate representation of our mental models and percepts, suggesting that more than language is needed for true intelligence.
  • πŸš€ There is a debate among philosophers and cognitive scientists about whether intelligence needs to be grounded in reality, with the speaker advocating for a connection to physical or simulated reality.
  • πŸ€” The complexity of the world is difficult to represent, and current LLMs are not trained to handle the intricacies of intuitive physics or common-sense reasoning about the physical space.
  • πŸ› οΈ LLMs are trained using an autoregressive prediction method, which is different from human thought processes that are not strictly tied to language.
  • 🌐 Building a complete world model requires more than just predicting words; it involves observing and understanding the world's evolution and predicting the consequences of actions.
  • πŸ” Current methods for training systems to learn representations of images by reconstruction from corrupted versions have largely failed, indicating a need for alternative approaches.
  • πŸ”— Joint embedding predictive architecture (JEA) is a promising alternative to traditional reconstruction-based training, which involves training a predictor to understand the full representation from a corrupted one.

Q & A

  • What are the key characteristics of intelligent behavior mentioned in the transcript?

    -The key characteristics of intelligent behavior mentioned are the capacity to understand the world, the ability to remember and retrieve things (persistent memory), the ability to reason, and the ability to plan.

  • Why are Large Language Models (LLMs) considered insufficient for achieving superhuman intelligence?

    -LLMs are considered insufficient because they do not possess or can only perform in a primitive way the essential characteristics of intelligence such as understanding the physical world, persistent memory, reasoning, and planning.

  • How does the amount of data a four-year-old processes visually compare to the data used to train LLMs?

    -A four-year-old processes approximately 10^15 bytes of visual data, which is significantly more than the 2 * 10^13 bytes used for 170,000 years of reading text that LLMs are trained on.

  • What is the argument against the idea that language alone contains enough wisdom and knowledge to construct a world model?

    -The argument is that language is a compressed and approximate representation of our percepts and mental models. It lacks the richness of the environment and most of our knowledge comes from observation and interaction with the real world, not just language.

  • What is the debate among philosophers and cognitive scientists regarding the grounding of intelligence?

    -The debate is whether intelligence needs to be grounded in reality, with some arguing that intelligence cannot appear without some grounding, whether physical or simulated, while others may not necessarily agree with this.

  • Why are tasks like driving a car or clearing a dishwasher more challenging for AI compared to passing a bar exam?

    -These tasks are more challenging because they require intuitive physics and common-sense reasoning about the physical world, which LLMs currently lack. They are trained on text and do not understand intuitive physics as well as humans do.

  • How do LLMs generate text?

    -LLMs generate text through an autoregressive prediction process where they predict the next word based on the previous words in a text, using a probability distribution over possible words.

  • What is the difference between the autoregressive prediction of LLMs and human speech planning?

    -Human speech planning involves thinking about what to say independent of the language used, while LLMs generate text one word at a time based on the previous words without an overarching plan.

  • What is the fundamental limitation of generative models in video prediction?

    -The fundamental limitation is that the world is incredibly complex and rich in information compared to text. Video is high-dimensional and continuous, making it difficult to represent distributions over all possible frames in a video.

  • What is the concept of joint embedding and how does it differ from traditional image reconstruction methods?

    -Joint embedding involves encoding both the full and corrupted versions of an image and training a predictor to predict the representation of the full image from the corrupted one. This differs from traditional methods that focus on reconstructing a good image from a corrupted version, which has proven to be ineffective in learning good generic features of images.

Outlines

00:00

πŸ€– Limitations of Large Language Models (LLMs)

The speaker discusses the limitations of autoaggressive LLMs like GPT-4 and Llama 2/3 in achieving superhuman intelligence. They lack essential characteristics of intelligence such as understanding the physical world, persistent memory, reasoning, and planning. Despite their inability to fully understand or interact with the world, LLMs are useful and can support an ecosystem of applications. The speaker also compares the amount of data LLMs are trained on to the sensory input a four-year-old receives, highlighting that most knowledge comes from observation and interaction with the real world, not language.

05:01

🌐 The Debate on Grounding Intelligence in Reality

The speaker explores the debate on whether intelligence needs to be grounded in reality, arguing that it does. They point out that language is an approximate representation of our mental models and that much of our knowledge comes from physical interaction with the world. The speaker also touches on the challenges of representing the complexities of the real world in AI and the limitations of current LLMs in understanding intuitive physics and common sense reasoning.

10:03

πŸ“ˆ The Training Process of LLMs

The speaker explains the training process of LLMs, which involves predicting missing words in a text. This autoregressive prediction method allows the model to generate text one word at a time. The speaker contrasts this with human thought processes, which are not tied to language and involve planning and mental models. They argue that LLMs lack this higher level of abstraction and planning, which is crucial for true intelligence.

15:04

πŸš€ Building World Models and Predicting Actions

The speaker discusses the concept of building world models for AI, which involves understanding and predicting the evolution of the world based on actions. They argue that while it's possible to build a world model by predicting words, it's not feasible with the current LLMs due to the limitations of language as a low-bandwidth medium. The speaker also mentions the challenges of representing high-dimensional continuous spaces, which are necessary for video and image understanding.

20:04

πŸ” The Failure of Self-Supervised Image Reconstruction

The speaker addresses the failure of self-supervised methods in learning good image representations by reconstructing corrupted images. They compare this to the success of LLMs in text prediction and argue that the same approach does not work for images due to the high dimensionality and complexity of visual data. The speaker then introduces the concept of joint embedding, which involves training a system to predict the representation of a full image from a corrupted version, as a potential solution to this problem.

25:05

πŸ”„ Contrastive and Non-Contrastive Learning Methods

The speaker discusses the limitations of contrastive learning methods, which involve training representations to be similar for similar images and dissimilar for different images. They mention the emergence of non-contrastive methods that do not require negative samples and rely on other techniques to prevent system collapse. The speaker highlights the development of several new methods over the past few years that can improve the training of such systems.

Mindmap

Keywords

πŸ’‘Artificial Intelligence (AI)

AI refers to the simulation of human intelligence in machines that are programmed to think and learn like humans. In the video, the speaker discusses the limitations of current AI systems, particularly Large Language Models (LLMs), in achieving superhuman intelligence due to their inability to understand the physical world, reason, and plan effectively.

πŸ’‘Large Language Models (LLMs)

LLMs are a class of AI models designed to process and generate human-like text based on the data they have been trained on. The video highlights that despite their vast knowledge from training on large datasets, LLMs like GPT-4 and LLaMa lack the essential characteristics of intelligence, such as persistent memory, reasoning, and planning.

πŸ’‘Autoaggressive Learning

This term refers to a training method where AI models are taught to predict missing parts of a sequence, such as text or images. The speaker argues that autoaggressive learning, as used in LLMs, is not sufficient for creating truly intelligent systems because it does not involve understanding or interacting with the physical world.

πŸ’‘Persistent Memory

Persistent memory is the ability to retain and recall information over time. The video points out that LLMs lack this capability, which is crucial for intelligent behavior, as they cannot remember or retrieve information in the same way humans or animals can.

πŸ’‘Reasoning

Reasoning is the process of forming conclusions or making judgments based on available information. The speaker emphasizes that LLMs do not possess true reasoning abilities, which are fundamental to intelligent systems as they cannot make inferences or predictions beyond the patterns they have been trained on.

πŸ’‘Planning

Planning involves the ability to think ahead and organize actions to achieve a desired outcome. The video discusses how LLMs are incapable of planning, which is a key aspect of intelligent behavior, as they cannot anticipate the consequences of actions or create strategies for future events.

πŸ’‘Sensory Input

Sensory input refers to the information received through the senses, such as sight, hearing, touch, etc. The speaker compares the amount of sensory input a child receives to the data processed by LLMs, highlighting that our understanding of the world is largely derived from sensory experiences rather than language alone.

πŸ’‘Embodied AI

Embodied AI is an approach to AI development that focuses on creating systems with physical bodies that can interact with the environment. The video suggests that true intelligence may require grounding in reality, and some researchers believe that AI systems need to be embodied to achieve human-level intelligence.

πŸ’‘Intuitive Physics

Intuitive physics refers to the innate understanding of the physical world's principles, such as gravity and motion. The speaker argues that LLMs lack this understanding, which is a significant barrier to creating AI systems capable of navigating and interacting with the physical world effectively.

πŸ’‘Joint Embedding Predictive Architecture (JEA)

JEA is a novel approach to training AI systems by using joint embeddings to predict representations of full inputs from corrupted or transformed versions. The video discusses this as a potential solution to the limitations of current AI systems, particularly in the context of image and video processing.

Highlights

Large language models (LLMs) like GPT-4 and LLaMa 2/3 are not the path to superhuman intelligence due to their limitations in understanding the physical world, memory, reasoning, and planning.

LLMs are trained on vast amounts of text data, but this is not as rich as the sensory input a human experiences, especially during early childhood.

A four-year-old's visual cortex receives more information than 170,000 years of reading text, indicating that most learning comes from observation and interaction with the real world.

Language is a compressed form of information, but it is an approximate representation of our percepts and mental models.

Intelligence needs to be grounded in reality, whether physical or simulated, to truly understand and interact with the world.

The complexity of the world is difficult to represent, and current LLMs are not trained to handle the intuitive physics or common sense reasoning required for such understanding.

LLMs are trained using an autoregressive prediction method, which is different from human thought processes that are not strictly tied to language.

There is a debate among philosophers and cognitive scientists about whether intelligence can exist without grounding in reality.

Current LLMs lack the ability to construct a world model and understand the physical world, which is a significant limitation for achieving human-level intelligence.

The training process of LLMs involves predicting missing words in a text, which is a simplistic approach compared to the complexity of the world and its representation.

Attempts to train models to predict video frames have been unsuccessful, highlighting the difficulty of representing high-dimensional continuous spaces.

Joint embedding predictive architecture (JEA) is a promising approach that involves training a predictor to reconstruct the full representation of an input from a corrupted version.

Contrastive learning methods have been developed to improve the training of image representations, but they have limitations.

Non-contrastive methods have emerged in recent years, allowing for training without negative samples, which could potentially improve the quality of learned representations.

The failure of self-supervised reconstruction methods for images suggests that simply reconstructing from corrupted data does not lead to good generic features for image recognition tasks.

Supervised learning with labeled data produces better image representations and recognition performance compared to self-supervised reconstruction methods.

The transcript discusses the limitations of current AI models and the potential of new methods like JEA and non-contrastive learning to advance the field of artificial intelligence.

Transcripts

play00:02

you've had some strong

play00:05

statements technical statements about

play00:07

the future of artificial intelligence

play00:09

recently throughout your career actually

play00:11

but recently as well uh you've said that

play00:15

autoaggressive LMS are uh not the way

play00:19

we're going to make progress towards

play00:22

superhuman intelligence these are the

play00:24

large language models like GPT 4 like

play00:27

llama 2 and three soon and so on how do

play00:29

they work why are they not going to take

play00:31

us all the way for a number of reasons

play00:33

the first is that there is a number of

play00:35

characteristics of intelligent

play00:38

behavior for example the capacity to

play00:42

understand the world understand the

play00:44

physical

play00:45

world the ability to remember and

play00:49

retrieve

play00:50

things um persistent memory the ability

play00:54

to reason and the ability to plan those

play00:57

are four essential characteristic of

play00:59

intelligent

play01:00

um systems or entities humans

play01:04

animals LMS can do none of those or they

play01:08

can only do them in a very primitive way

play01:11

and uh they don't really understand the

play01:13

physical world they don't really have

play01:15

persistent memory they can't really

play01:16

reason and they certainly can't plan and

play01:19

so you know if if if you expect the

play01:22

system to become intelligent just you

play01:25

know without having the possibility of

play01:27

doing those things uh you're making a

play01:29

mistake

play01:30

that is not to say that auto LMS are not

play01:34

useful they're certainly

play01:36

useful um that they're not interesting

play01:40

that we can't build a whole ecosystem of

play01:43

applications around them of course we

play01:45

can but as a path towards human level

play01:49

intelligence they're missing essential

play01:52

components and then there is another

play01:54

tidbit or or fact that I think is very

play01:57

interesting those LMS are TR trained on

play02:00

enormous amounts of textt basically the

play02:02

entirety of all publicly available text

play02:05

on the internet right that's

play02:07

typically on the order of 10 to the 13

play02:10

tokens each token is typically two bytes

play02:13

so that's two 10 to the 13 bytes as

play02:15

training data it would take you or me

play02:18

170,000 years to just read through this

play02:20

at eight hours a day uh so it seems like

play02:24

an enormous amount of knowledge right

play02:26

that those systems can

play02:28

accumulate um

play02:31

but then you realize it's really not

play02:32

that much data if you you talk to

play02:35

developmental psychologists and they

play02:37

tell you a four-year-old has been awake

play02:39

for 16,000 hours in his or

play02:42

life um and the amount of information

play02:46

that has uh reached the visual cortex of

play02:50

that child in four years um is about 10

play02:55

to the 15 bytes and you can compute this

play02:57

by estimating that the optical nerve

play03:00

carry about 20 megab megabytes per

play03:02

second roughly and so 10^ the 15 bytes

play03:06

for a four-year-old versus 2 * 10^ the

play03:09

13 bytes for 170,000 years worth of

play03:13

reading what it tells you is that uh

play03:17

through sensory input we see a lot more

play03:19

information than we than we do through

play03:22

language and that despite our

play03:24

intuition most of what we learn and most

play03:27

of our knowledge is through our

play03:30

observation and interaction with the

play03:32

real world not through language

play03:34

everything that we learn in the first

play03:35

few years of life and uh certainly

play03:38

everything that animals learn has

play03:39

nothing to do with language so it would

play03:42

be good to uh maybe push against some of

play03:44

the intuition behind what you're saying

play03:47

so it is true there's several orders of

play03:50

magnitude more data coming into the

play03:52

human

play03:53

mind much faster and the human mind is

play03:56

able to learn very quickly from that

play03:57

filter the data very quickly you know

play04:00

somebody might argue your comparison

play04:02

between sensory data versus language

play04:04

that language is already very compressed

play04:08

it already contains a lot more

play04:09

information than the btes it takes to

play04:11

store them if you compare it to visual

play04:13

data so there's a lot of wisdom and

play04:15

language there's words and the way we

play04:17

stitch them together it already contains

play04:19

a lot of information so is it possible

play04:23

that language alone already has

play04:28

enough wisdom and knowledge in there to

play04:31

be able to from that language construct

play04:34

a a world model and understanding of the

play04:37

world an understanding of the physical

play04:38

world that you're saying L LMS lack so

play04:41

it's a big debate among uh philosophers

play04:45

and also cognitive scientists like

play04:46

whether intelligence needs to be

play04:48

grounded in

play04:49

reality uh I'm clearly in the camp that

play04:52

uh yes uh intelligence cannot appear

play04:55

without some grounding in uh some

play04:58

reality doesn't need to could be you

play05:00

know physical reality it could be

play05:02

simulated but um but the environment is

play05:04

just much richer than what you can

play05:06

express in language language is a very

play05:08

approximate

play05:09

representation of our percepts and our

play05:13

mental models right I mean there there's

play05:14

a lot of TX that we accomplish where we

play05:17

manipulate uh a mental model of uh of

play05:21

the situation at hand and that has

play05:23

nothing to do with language everything

play05:25

that's physical mechanical whatever when

play05:28

we build something when we accomplish a

play05:31

task a model task of you know grabbing

play05:33

something Etc we plan or action

play05:36

sequences and we do this by essentially

play05:39

Imagining the result of the outcome of

play05:43

sequence of actions that we might

play05:45

imagine and that requires mental models

play05:48

that don't have much to do with language

play05:50

and that's I would argue most of our

play05:54

knowledge is derived from that

play05:56

interaction with the physical world so a

play05:58

lot of a lot of my my colleagues who are

play06:00

more uh interested in things like

play06:02

computer vision are really on that camp

play06:05

that uh AI needs to be embodied

play06:08

essentially and then other people coming

play06:11

from the NLP side or maybe you know some

play06:15

some other U motivation don't

play06:17

necessarily agree with that um and

play06:19

philosophers are split as well uh and

play06:24

the um the complexity of the world is

play06:26

hard to um it's hard to imagine

play06:30

you know it's hard to represent uh all

play06:34

the complexities that we take completely

play06:36

for granted in the real world that we

play06:38

don't even imagine require intelligence

play06:40

right this is the old marac Paradox from

play06:43

the pioneer of Robotics hence marac who

play06:46

said you know how is it that with

play06:47

computers it seems to be easy to do high

play06:49

level complex tasks like playing chess

play06:51

and solving integrals and doing things

play06:54

like that whereas the thing we take for

play06:56

granted that we do every day um like I

play06:59

don't know learning to drive a car or

play07:01

you know grabbing an object we can do as

play07:04

computers um

play07:08

and you know we have llms that can pass

play07:11

pass the bar exam so they must be smart

play07:14

but then they can't learn to drive in 20

play07:17

hours like any 17y old they can't learn

play07:20

to clear up the dinner table and F of

play07:24

the dishwasher like any 10-year-old can

play07:26

learn in one shot um why is that like

play07:29

you what what are we missing what what

play07:31

type of learning or or reasoning

play07:34

architecture or whatever are we missing

play07:37

that um um basically prevent us from

play07:41

from you know having level five sing

play07:43

Cars and domestic robots can a large

play07:46

language model construct a world model

play07:50

that does know how to drive and does

play07:52

know how to fill a dishwasher but just

play07:54

doesn't know how to deal with visual

play07:55

data at this time so it it can Opera in

play08:00

a space of Concepts so yeah that's what

play08:03

a lot of people are working on so the

play08:05

answer the short answer is no and the

play08:07

more complex answer is you can use all

play08:10

kind of tricks to get uh uh an llm to

play08:16

basically digest um visual

play08:19

representations of representations of

play08:22

images uh or video or audio for that

play08:25

matter um and uh a classical way of

play08:28

doing this

play08:30

is uh you train a vision system in some

play08:33

way and we have a number of ways to

play08:35

train Vision systems either supervise

play08:36

semi supervised self-supervised all

play08:38

kinds of different

play08:39

ways uh that will turn any image into

play08:44

high level

play08:46

representation basically a list of

play08:48

tokens that are really similar to the

play08:49

kind of tokens that

play08:52

uh typical llm takes as an input and

play08:55

then you just feed that to the llm

play09:00

in addition to the text and you just

play09:02

expect LM to kind of uh you know during

play09:05

training to kind of be able to uh use

play09:09

those representations to help make

play09:11

decisions I mean there been working

play09:12

along those line for for quite a long

play09:14

time um and now you see those systems

play09:17

right I mean there are llms that can

play09:19

that have some Vision extension but

play09:21

they're basically hacks in the sense

play09:23

that um those things are not like train

play09:25

end to end to to handle to really

play09:27

understand the world they're not train

play09:29

with video for example uh they don't

play09:31

really understand intuitive physics at

play09:33

least not at the moment so you don't

play09:36

think there's something special to you

play09:38

about intuitive physics about sort of

play09:39

Common Sense reasoning about the

play09:41

physical space about physical reality

play09:43

that's that to you is a giant leap that

play09:45

llms are just not able to do we're not

play09:47

going to be able to do this with the

play09:49

type of llms that we are uh working with

play09:52

today and there's a number of reasons

play09:53

for this but uh the main reason

play09:56

is the way llm LMS are trained is that

play09:59

you you take a piece of text you remove

play10:03

some of the words in that text you Mass

play10:04

them you replace by replace them by

play10:06

blank markers and you train a gtic

play10:08

neural net to predict the words that are

play10:11

missing uh and if you build this neural

play10:13

net in a particular way so that it can

play10:15

only look at uh words that are to the

play10:18

left of the one is trying to predict

play10:20

then what you have is a system that

play10:22

basically is trying to predict the next

play10:23

word in a text right so then you can

play10:25

feed it um a text a prompt and you can

play10:29

ask it to predict the next word it can

play10:30

never predict the next word exactly and

play10:33

so what it's going to do is uh produce a

play10:36

probability distribution over all the

play10:38

possible words in your dictionary in

play10:39

fact it doesn't predict words it

play10:41

predicts tokens that are kind of subw

play10:42

units and so it's easy to handle the

play10:46

uncertainty in the prediction there

play10:47

because there's only a finite number of

play10:49

possible words in the dictionary and you

play10:52

can just compute a distribution over

play10:54

them um then what you what the system

play10:57

does is that it it picks word from that

play11:00

distribution of course there's a higher

play11:02

chance of picking words that have a

play11:03

higher probability within that

play11:05

distribution so you sample from that

play11:07

distribution to actually produce a word

play11:10

and then you shift that word into the

play11:12

input and so that allows the system not

play11:14

to predict the second word right and

play11:17

once you do this you shift it into the

play11:18

input Etc that's called Auto regressive

play11:22

prediction and which is why those llms

play11:24

should be called Auto regressive llms uh

play11:28

but we just called

play11:30

themm

play11:31

and there is a difference between this

play11:34

kind of process and a process by which

play11:37

before producing a word when you talk

play11:40

when you and I talk you and I are

play11:42

bilinguals we think about what we're

play11:44

going to say and it's relatively

play11:46

independent of the language in which

play11:47

we're going to say it when we we talk

play11:50

about like a I don't know let's say

play11:52

mathematical concept or something the

play11:54

kind of thinking that we're doing and

play11:55

the answer that we're planning to

play11:57

produce is not linked to whether we're

play12:00

going to see it in French or Russian or

play12:03

English Chomsky just rolled his eyes but

play12:05

I understand so you're saying that

play12:07

there's a a bigger abstraction that

play12:10

repres that's uh that goes before

play12:12

language that maps onto language right

play12:15

it's certainly true for a lot of

play12:17

thinking that we that we do is that

play12:19

obvious that we don't like you're saying

play12:21

your thinking is same in French as it is

play12:24

in English yeah pretty much yeah pretty

play12:27

much or is this like how how flexible

play12:30

are you like if if there's a probability

play12:33

distribution well it it depends what

play12:34

kind of thinking right if it's just uh

play12:37

if it's like producing puns I get much

play12:39

better in French than English about that

play12:41

no but so is there an abstract

play12:44

representation of puns like is your

play12:45

humor an abstract like when you tweet

play12:48

and your tweets are sometimes a little

play12:49

bit spicy uh what's is there an abstract

play12:52

representation in your brain of a tweet

play12:54

before it maps onto English there is an

play12:57

asct representation of uh Imagining the

play13:00

reaction of a reader to that uh text

play13:03

well you start with laughter and then

play13:05

figure out how to make that happen or so

play13:07

figure out like a reaction you want to

play13:10

cause and and then figure out how to say

play13:11

it right so that it causes that reaction

play13:13

but that's like really close to language

play13:15

but think about like a m mathematical

play13:18

concept uh or um you know imagining you

play13:21

know something you want to build out of

play13:22

wood or something like this right the

play13:24

kind of thinking you're doing has

play13:26

absolutely nothing to do with language

play13:27

really like it's not you have

play13:29

necessarily like an internal monologue

play13:31

in any particular language you're you're

play13:33

you know imagining mental models of of

play13:36

the thing right I mean if I if I ask you

play13:38

to like imagine what this uh water

play13:40

bottle will look like if I rotate it 90

play13:43

degrees um that has nothing to do with

play13:46

language and so uh so clearly there is

play13:50

you know a more abstract level of

play13:52

representation uh in which we we do most

play13:55

of our thinking and we plan what we're

play13:58

going to say if the output is

play14:01

is you know uttered words as opposed to

play14:05

an output being uh you know muscle

play14:08

actions right um we we plan our answer

play14:12

before we produce it and LMS don't do

play14:15

that they just produce one word after

play14:17

the other instinctively if you want it's

play14:20

like it's a bit like the you know

play14:23

subconscious uh actions where you don't

play14:26

like you're distracted you're doing

play14:28

something you completely concentrated

play14:29

and someone comes to you and you know

play14:32

asks you a question and you kind of

play14:33

answer the question you don't have time

play14:35

to think about the answer but the answer

play14:36

is easy so you don't need to pay

play14:38

attention you sort of respond

play14:40

automatically that's kind of what an llm

play14:42

does right it doesn't think about its

play14:44

answer really uh it retrieves it because

play14:47

it's accumulated a lot of knowledge so

play14:49

it can retrieve some some things but

play14:51

it's going

play14:52

to just spit out one token after the

play14:55

other without planning the answer but

play14:59

you're making it sound just one token

play15:01

after the other one token at a time

play15:03

generation is uh bound to be

play15:09

simplistic but if the world model is

play15:11

sufficiently sophisticated that one

play15:13

token at a

play15:15

time the the most likely thing it

play15:18

generates is a sequence of tokens is

play15:20

going to be a deeply profound thing okay

play15:24

but then that assumes that those systems

play15:27

actually possess

play15:29

World model so really goes to the I I

play15:31

think the fundamental question is can

play15:34

you build a a

play15:36

really complete World model not complete

play15:39

but a one that has a deep understanding

play15:42

of the world yeah so can you build this

play15:45

first of all by prediction right and the

play15:49

answer is probably yes can you predict

play15:51

can you build it by predicting words and

play15:55

the answer is most probably no

play15:59

because language is very poor in terms

play16:02

or weak or low bandwidth if you want

play16:04

there's just not enough information

play16:05

there so building World models means

play16:09

observing the

play16:11

world and uh understanding why the world

play16:15

is evolving the way the way it is and

play16:18

then uh the the extra component of a

play16:22

world model is something that can

play16:25

predict how the world is going to evolve

play16:27

as a consequence of action you might

play16:29

take right so what model really is here

play16:32

is my idea of the state of the world at

play16:33

time te here is an action I might take

play16:35

what is the predicted state of the world

play16:38

at time

play16:39

t+1 now that state of the world doesn't

play16:42

does not need to represent everything

play16:44

about the world it just needs to

play16:46

represent enough that's relevant for

play16:48

this planning of of the action but not

play16:51

necessarily all the details now here is

play16:53

the problem um you're not going to be

play16:55

able to do this with generative models

play16:59

so a genery model has trained on video

play17:01

and we've tried to do this for 10 years

play17:03

you take a video show a system a piece

play17:06

of video and then ask you to predict the

play17:09

reminder of the video basically predict

play17:11

what's going to happen one frame at a

play17:13

time there the same thing as sort of uh

play17:16

the autoaggressive llms do but for video

play17:19

right either one FR at a time or a group

play17:21

of friends at a time um but yeah uh a

play17:24

large video model if you want uh the

play17:28

idea of of doing this has been floating

play17:30

around for a long time and at at Fair uh

play17:34

some colleagues and I have been trying

play17:36

to do this for about 10

play17:37

years um and you can't you can't really

play17:41

do the same trick as with llms because

play17:44

uh you know LMS as I said you can't

play17:47

predict exactly which word is going to

play17:49

follow a sequence of words but you can

play17:52

predict a distribution over our words

play17:54

now if you go to video what you would

play17:56

have to do is predict the distribution

play17:58

over all possible frames in a video and

play18:01

we don't really know how to do that

play18:03

properly uh we we we do not know how to

play18:05

represent distributions over High

play18:07

dimensional continuous spaces in ways

play18:09

that are

play18:10

useful uh and and that's that they lies

play18:14

the main issue and the reason we can do

play18:17

this is because the world is incredibly

play18:20

more complicated and richer in terms of

play18:23

information than than text text is

play18:26

discret video is High dimensional and

play18:29

continuous a lot of details in this um

play18:32

so if I take a a video of this room uh

play18:36

and the video is you know a camera

play18:38

panning

play18:39

around um there is no way I can predict

play18:42

everything that's going to be in the

play18:43

room as I pan around the system cannot

play18:45

predict what's going to be in the room

play18:47

as the camera is panning maybe it's

play18:49

going to predict this is this is a room

play18:51

where there's a light and there is a

play18:53

wall and things like that it can't

play18:54

predict what the painting on the wall

play18:56

looks like or what the texture of the

play18:57

couch looks like like certainly not the

play18:59

texture of the carpet so there's no way

play19:02

I can predict all those details so the

play19:05

the way to handle this is one way

play19:08

possibly to handle this which we've been

play19:10

working for a long time is to have a

play19:12

model that has what's called a latent

play19:14

variable and the latent variable is fed

play19:16

to an Nal net and it's supposed to

play19:18

represent all the information about the

play19:20

world that you don't perceive yet and uh

play19:24

that you need to

play19:26

augment uh the the system for the

play19:29

prediction to do a good job at

play19:31

predicting pixels including the you know

play19:33

fine texture of the of the carpet and

play19:36

the and the couch and and the painting

play19:39

on the wall

play19:40

um uh that has been a complete failure

play19:44

essentially and we've tried lots of

play19:45

things we tried uh just straight neural

play19:48

Nets we tried Gans we tried uh you know

play19:52

Vees all kinds of regularized Auto

play19:54

encoders we tried um many things

play19:58

we also tried those kind of methods to

play20:01

learn uh good representations of images

play20:04

or video um that could then be used as

play20:08

input to for example an image

play20:10

classification

play20:11

system and that also has basically

play20:13

failed like all the systems that attempt

play20:15

to predict missing parts of an image or

play20:19

video um you know from a corrupted

play20:24

version of it basically so right take an

play20:25

image or a video corrupt it or transform

play20:27

it in some way

play20:28

and then try to reconstruct the complete

play20:30

video or image from the corrupted

play20:34

version and then hope that internally

play20:37

the system will develop a good

play20:38

representations of images that you can

play20:40

use for object recognition segmentation

play20:42

whatever it

play20:43

is that has been essentially a complete

play20:46

failure and it works really well for

play20:48

text that's the principle that is used

play20:50

for llms right so where is the failure

play20:53

exactly is that that it's very difficult

play20:55

to form a good representation of an

play20:58

image a good in like a good embedding of

play21:01

all all the important information in the

play21:03

image is it in terms of the consistency

play21:05

of image to image to image to image that

play21:07

forms the video like where what are the

play21:10

if we do a highlight reel of all the

play21:12

ways you

play21:13

failed what what's that look like okay

play21:15

so the reason this doesn't work uh is

play21:20

first of all I have to tell you exactly

play21:21

what doesn't work because there is

play21:22

something else that does work uh so the

play21:25

thing that does not work is training a

play21:28

system to learn representations of

play21:31

images by training it to reconstruct uh

play21:36

a good image from a corrupted version of

play21:38

it okay that's what doesn't work and we

play21:40

have a whole slew of techniques for this

play21:43

uh that are you know variant of dening

play21:46

Auto encoders something called Mee

play21:48

developed by some of my colleagues at

play21:50

Fair Max Auto encoder so it's basically

play21:52

like

play21:53

the you know llms or or or or things

play21:56

like this where you train the system by

play21:58

corrupting text except you corrupt

play21:59

images you remove Patches from it and

play22:01

you train a gigantic neet to reconstruct

play22:04

the features you get are not good and

play22:06

you know they're not good because if you

play22:08

now train the same architecture but you

play22:10

train it

play22:11

supervised with uh label data with text

play22:15

textual descriptions of images Etc you

play22:18

do get good representations and the

play22:20

performance on recognition task is much

play22:23

better than if you do this

play22:25

self-supervised pre trining so the AR

play22:28

Ure is good the architecture is good the

play22:30

architecture of the encoder is good okay

play22:32

but the fact that you train the system

play22:34

to reconstruct images does not lead it

play22:37

to produce to learn good generic

play22:39

features of images when you train in a

play22:41

self-supervised way self-supervised by

play22:44

reconstruction Yeah by reconstruction

play22:46

okay so what's the

play22:48

alternative the alternative is joint

play22:51

embedding what is joint embedding what

play22:53

are what are these architectures that

play22:55

you're so excited about okay so now

play22:56

instead of training system to encode the

play22:59

image and then training it to

play23:01

reconstruct the the full image from a

play23:03

corrupted version you take the full

play23:05

image you take

play23:07

the corrupted or transformed version you

play23:10

run them both through

play23:12

encoders which in general are identical

play23:14

but not

play23:15

necessarily and then you you train a

play23:19

predictor on top of those

play23:21

encoders um to predict the

play23:24

representation of the full input from

play23:28

the representation of the corrupted

play23:30

one okay so don't embedding because

play23:34

you're you're taking the the full input

play23:36

and the corrupted version or transform

play23:38

version run them both through encoders

play23:40

so you get a joint embedding and then

play23:42

you and then you're you're saying can I

play23:44

predict the representation of the full

play23:46

one from the representation of the

play23:48

corrupted one okay um and I call this

play23:52

JEA so that means joint embedding

play23:53

predictive architecture because this

play23:55

joint embedding and there is this

play23:56

predictor that predicts the weos

play23:57

presentation of the good guy from from

play23:59

the bad

play24:00

guy um and the big question is how do

play24:03

you train something like this uh and

play24:06

until five years ago or six years ago we

play24:08

didn't have particularly good answers

play24:11

for how you train those things except

play24:12

for one um called contractive

play24:15

contrastive

play24:17

learning

play24:19

where U and the idea of contractive

play24:21

learning is you you take a pair of

play24:23

images that are again an image and a

play24:26

corrupted version or degraded version

play24:28

somehow or transformed version of the

play24:30

original one and you train the predicted

play24:34

representation to be the same as I said

play24:36

if you only do this the system collapses

play24:38

it basically completely ignores the

play24:40

input and produces representations that

play24:42

are

play24:43

constant so the contrastive methods

play24:46

avoid this and and those things have

play24:48

been around since the early 90s I had a

play24:50

paper on this in

play24:51

1993 um is you also show pairs of images

play24:57

that you know are

play24:58

different and then you push away the

play25:01

representations from each other so you

play25:02

say not only do representations of

play25:05

things that we know are the same should

play25:07

be the same or should be similar but

play25:08

representation of things that we know

play25:10

are different should be

play25:11

different and that prevents the collapse

play25:13

but it has some limitation and there's a

play25:15

whole bunch of uh techniques that have

play25:17

appeared over the last six seven years

play25:20

um that can revive this this type of

play25:22

method um some of them from Fair some of

play25:25

them from from Google and other places

play25:29

um but there are limitations to those

play25:31

contrasty method what has changed in the

play25:33

last

play25:34

uh you know three four years is now now

play25:37

we have methods that are non-contrastive

play25:39

so they don't require those negative

play25:42

contrastive samples of images that are

play25:45

that we know are different you can only

play25:47

you TR them only with images that are

play25:50

you know different versions or different

play25:51

views are the same thing uh and you rely

play25:54

on some other tricks to prevent the

play25:56

system from collapsing and we have half

play25:58

a dozen different methods for this

play26:19

now

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
AI_LimitationsLLMsSuperhuman_IntellectPhysical_RealityCognitive_ScienceEmbodied_AILanguage_ModelsPredictive_ModelingContrastive_LearningAI_EthicsTech_Innovation