Can LLMs reason? | Yann LeCun and Lex Fridman

Lex Clips
13 Mar 202417:54

Summary

TLDRThe transcript discusses the limitations of large language models (LLMs) in reasoning and their constant computation per token produced. It suggests that future dialogue systems will require a more sophisticated approach, involving planning and optimization before generating a response. The conversation touches on the potential for systems to build upon a foundational world model, using processes akin to probabilistic models to infer latent variables. This could lead to more efficient and deep reasoning capabilities, moving beyond the current auto-regressive prediction methods.

Takeaways

  • 🧠 The reasoning in large language models (LLMs) is considered primitive due to the constant amount of computation spent per token produced.
  • πŸ”„ The computation does not adjust based on the complexity of the question, whether it's simple, complicated, or impossible to answer.
  • πŸš€ Human reasoning involves spending more time on complex problems, with an iterative and hierarchical approach, unlike the constant computation model of LLMs.
  • 🌐 The future of dialogue systems may involve building upon a well-constructed world model with mechanisms like persistent long-term memory and reasoning.
  • πŸ› οΈ There's a need for systems that can plan and reason, devoting more resources to complex problems, moving beyond auto-regressive prediction of tokens.
  • 🎯 The concept of an energy-based model is introduced, where the model output is a scalar number representing the 'goodness' of an answer for a given prompt.
  • πŸ“ˆ Optimization processes are key in future dialog systems, with the system planning and optimizing the answer before converting it into text.
  • 🌟 The optimization process involves abstract representation and is more efficient than generating numerous sequences and selecting the best ones.
  • πŸ”„ The training of an energy-based model involves showing it compatible pairs of inputs and outputs, using methods like contrastive training and regularizers.
  • πŸ”’ The energy function is trained to have low energy for compatible XY pairs and higher energy elsewhere, ensuring the model can distinguish between good and bad answers.
  • πŸ“š The transcript discusses the indirect nature of training LLMs, where high probability for one word results in low probability for others, and how this could be adapted for more complex reasoning tasks.

Q & A

  • What is the main limitation of the reasoning process in large language models (LLMs)?

    -The main limitation is that the amount of computation spent per token produced is constant, meaning that the system does not adjust the computational resources based on the complexity of the question or problem at hand.

  • How does human reasoning differ from the reasoning process in LLMs?

    -Human reasoning involves spending more time on complex problems, with an iterative and hierarchical approach, while LLMs allocate a fixed amount of computation regardless of the question's complexity.

  • What is the significance of a persistent long-term memory in dialogue systems?

    -A persistent long-term memory allows dialogue systems to build upon previous information and context, leading to more coherent and informed responses in a conversation.

  • How does the concept of 'system one' and 'system two' in psychology relate to LLMs?

    -System one corresponds to tasks that can be done without conscious thought, similar to how LLMs operate on instinctive language patterns. System two involves deliberate planning and thinking, which is something LLMs currently lack but could potentially develop.

  • What is the proposed blueprint for future dialogue systems?

    -The proposed blueprint involves a system that thinks about and plans its answer through optimization before converting it into text, moving away from the auto-regressive prediction of tokens.

  • How does the energy-based model work in the context of dialogue systems?

    -The energy-based model is a function that outputs a scalar number indicating how good an answer is for a given prompt. The system searches for an answer that minimizes this number, representing a good response.

  • What is the difference between contrastive and non-contrastive methods in training an energy-based model?

    -Contrastive methods train the model by showing it pairs of compatible and incompatible inputs and outputs, adjusting the weights to increase the energy for incompatible pairs. Non-contrastive methods, on the other hand, use a regularizer to ensure that the energy is higher for incompatible pairs by minimizing the volume of space that can take low energy.

  • How does the concept of latent variables play a role in the optimization process of dialogue systems?

    -Latent variables, or Z in the context of the script, represent an abstract form of a good answer that the system can manipulate to minimize the output energy. This allows for optimization in an abstract representation space rather than directly in text.

  • What is the main inefficiency in the current auto-regressive language model training?

    -The main inefficiency is that it involves generating a large number of hypothesis sequences and then selecting the best ones, which is computationally wasteful compared to optimizing in continuous, differentiable spaces.

  • How does the energy function ensure that a good answer has low energy and a bad answer has high energy?

    -The energy function is trained to produce low energy for pairs of inputs and outputs (X and Y) that are compatible, based on the training set. A regularizer in the cost function ensures that the energy is higher for incompatible pairs, effectively pushing the energy function down in regions of compatible XY pairs and up elsewhere.

  • How is the concept of energy-based models applied in visual data processing?

    -In visual data processing, the energy of the system is represented by the prediction error of the representation when comparing a corrupted version of an image or video to the actual, uncorrupted version. A low energy indicates a good match, while a high energy indicates significant differences.

Outlines

00:00

πŸ€– Primitive Reasoning in LLMs

The paragraph discusses the limitations of reasoning in large language models (LLMs) due to the constant amount of computation spent per token produced. It highlights that the system does not adjust the computational effort based on the complexity of the question, leading to a fundamental flaw in the way LLMs approach problem-solving. The speaker suggests that future improvements could involve building upon a well-constructed world model and incorporating mechanisms like persistent long-term memory and hierarchical reasoning, which are more akin to human thought processes.

05:00

🌟 The Future of Dialogue Systems

This section envisions the future of dialogue systems, emphasizing the need for systems that can plan and optimize their answers before producing them. The speaker introduces the concept of an energy-based model that evaluates the quality of an answer to a prompt, suggesting that future systems will operate in an abstract representation space rather than just generating text. The goal is to create a system that can perform iterative optimization and hierarchical reasoning, which is currently beyond the capabilities of auto-regressive LLMs.

10:03

🧠 Training Energy-Based Models

The paragraph delves into the conceptual framework of training energy-based models, which are designed to output a scalar value indicating the compatibility of a proposed answer with a given prompt. The speaker explains that these models are trained by showing them pairs of compatible inputs and outputs, and the system learns to minimize the output value. The process involves ensuring that the energy is higher for incompatible pairs, which can be achieved through contrastive methods or non-contrastive regularization techniques. The discussion also touches on the importance of abstract representations and the potential for these models to perform reasoning tasks more efficiently than current LLMs.

15:06

πŸ“ˆ Visual Data and Energy Functions

This paragraph explores the application of energy functions in the context of visual data, contrasting it with language-based systems. The speaker describes how energy-based models can be used to assess the quality of visual representations by comparing a corrupted image with its uncorrupted version, using the prediction error as the energy measure. The process is highlighted as a way to achieve a compressed and efficient representation of visual reality, which has been successfully applied in classification systems.

Mindmap

Keywords

πŸ’‘Computation

In the context of the video, computation refers to the process of performing mathematical calculations or reasoning tasks by a machine, specifically an AI language model (LLM). The amount of computation spent per token produced by the LLM is constant, which means that regardless of the complexity of the question, the system allocates the same amount of computational resources to generate an answer. This is contrasted with human reasoning, where more complex problems typically receive more computational effort.

πŸ’‘Token

A token in this context is a basic unit of text, such as a word or a character, that the AI language model processes when generating responses. The number of tokens produced in an answer determines the amount of computation the system will devote to that answer. However, the script critiques this approach, suggesting that human reasoning does not allocate resources in such a uniform manner but instead varies effort based on the complexity of the problem.

πŸ’‘Reasoning

Reasoning in the video script refers to the cognitive process of forming conclusions, making judgments, or solving problems based on available information. The speaker contrasts the 'primitive' reasoning of LLMs, which is based on a fixed computation per token, with human reasoning, which is adaptive and iterative, involving more time and effort for complex problems. The script suggests that future AI systems should incorporate more sophisticated reasoning abilities, akin to human thought processes.

πŸ’‘Auto-regressive LMS

Auto-regressive LMS (Language Models) are a type of AI model that predicts the next item in a sequence based on the previous items. In the context of the video, this refers to the way current LLMs operate, generating text one token at a time based on the probability of each token following the previous ones. The speaker suggests that future dialog systems will differ significantly from this auto-regressive approach, instead incorporating more complex reasoning and planning mechanisms.

πŸ’‘World Model

A world model, as used in the video, refers to an internal representation or understanding that an AI system has of the world, which it uses to make predictions and decisions. The speaker suggests that building on top of a well-constructed world model could allow for the development of AI systems with improved reasoning and planning capabilities, moving beyond the limitations of current LLMs.

πŸ’‘Latent Variables

Latent variables are factors or variables that are not directly observed or measured but are inferred from other variables. In the context of the video, latent variables could represent abstract concepts or ideas that the AI system uses to form an internal representation or thought process before generating a response. The speaker suggests using inference of latent variables as a method for AI to engage in more complex reasoning and planning.

πŸ’‘Optimization

Optimization in the video refers to the process of finding the best possible solution or outcome from a set of potential options. The speaker describes an AI system that uses optimization to refine its abstract representations of thoughts before generating a response, akin to how humans might plan an answer before speaking. This process involves minimizing an objective function, which measures the quality of the answer.

πŸ’‘Energy-based Model

An energy-based model, as described in the video, is a type of machine learning model that assigns a scalar value (energy) to input data, indicating the compatibility or goodness of fit between the observed data and the proposed answer or continuation. The model is trained to produce low energy values for correct or compatible inputs and high energy values for incorrect or incompatible ones. This concept is used to illustrate a potential future direction for AI systems in terms of reasoning and planning.

πŸ’‘Gradient Descent

Gradient descent is an optimization algorithm used in machine learning to minimize a function by iteratively adjusting the parameters of the model in the direction that minimizes the function's value. In the context of the video, gradient descent would be used to refine the abstract representation of an answer, moving towards a state that minimizes the output of the energy-based model, thus optimizing the quality of the AI's response.

πŸ’‘Inference

Inference in the video refers to the process of drawing conclusions or making predictions based on available data or evidence. The speaker contrasts inference with training, where training involves adjusting the model's parameters to fit the training data, while inference involves using the trained model to generate new outputs or predictions. The video suggests that future AI systems will use a form of inference that involves optimizing abstract representations before generating text.

πŸ’‘Conceptual Training

Conceptual training, as mentioned in the video, involves teaching an AI system to understand and work with abstract concepts rather than just concrete sensory information or specific instances. This type of training is crucial for developing AI systems that can reason and plan like humans, by working with representations of ideas and optimizing these representations to produce high-quality responses.

Highlights

The reasoning in large language models (LLMs) is considered primitive due to the constant amount of computation spent per token produced.

The computation devoted to computing an answer is proportional to the number of tokens produced in the answer, regardless of the question's complexity.

Human reasoning involves spending more time on complex problems, with an iterative and hierarchical approach, unlike the constant computation model of LLMs.

The future of dialogue systems may involve planning and optimizing answers before expressing them in text, moving away from autoregressive LMs.

The concept of system one and system two in humans is introduced as an analogy for the different levels of cognitive tasks and reasoning.

Experienced individuals can perform system one tasks subconsciously, while system two tasks require deliberate planning and thought.

LLMs currently lack the ability to perform system two tasks, which involve internal world modeling and deliberate planning.

The future blueprint of dialogue systems may involve persistent long-term memory and reasoning mechanisms built on top of a well-constructed world model.

The idea of a mental model that allows planning of responses before expressing them is crucial for advanced dialogue systems.

The optimization process for dialogue systems involves abstract representation and searching for an answer that minimizes a cost function.

The concept of an energy-based model is introduced, where the model outputs a scalar number to measure the quality of an answer.

The future of dialogue systems may involve differentiable systems that allow for gradient-based inference and optimization in continuous spaces.

The training of an energy-based model involves showing it pairs of compatible inputs and outputs, and adjusting the neural network to produce low energy for correct answers.

Contrastive methods are used to train energy-based models by presenting both good and bad examples and adjusting the system to produce higher energy for incorrect answers.

Non-contrastive methods ensure higher energy for incompatible pairs by minimizing the volume of space that can take low energy.

The concept of latent variables and abstract representations is crucial for optimizing and planning complex answers in future dialogue systems.

The indirect method of training LLMs through probability distribution over tokens results in a basic level of reasoning but lacks the depth of system two tasks.

The potential for visual data applications of energy-based models is discussed, where the energy represents the prediction error of a representation.

The energy-based model approach aims to provide a compressed representation of visual reality, which has proven effective in classification tasks.

Transcripts

play00:03

the type of reasoning that takes place

play00:04

in llm is very very primitive and the

play00:07

reason you can tell is primitive is

play00:09

because the amount of computation that

play00:11

is spent per token produced is constant

play00:15

so if you ask a question and that

play00:17

question has an answer in a given number

play00:20

of token the amount of competition

play00:22

devoted to Computing that answer can be

play00:24

exactly estimated it's like you know

play00:27

it's how it's the the size of the

play00:30

prediction Network you know with its 36

play00:32

layers or 92 layers or whatever it is uh

play00:35

multiply by number of tokens that's it

play00:37

and so essentially it doesn't matter if

play00:40

the question being asked is is simple to

play00:45

answer complicated to answer impossible

play00:48

to answer because it's undecidable or

play00:50

something um the amount of computation

play00:53

the system will be able to devote to

play00:55

that to the answer is constant or is

play00:57

proportional to the number of token

play00:59

produced in the answer right this is not

play01:01

the way we work the way we reason is

play01:04

that when we're faced with a complex

play01:08

problem or complex question we spend

play01:10

more time trying to solve it and answer

play01:12

it right because it's more difficult

play01:15

there's a prediction element there's a

play01:17

iterative element where you're like

play01:21

uh adjusting your understanding of a

play01:23

thing by going over over and over and

play01:25

over there's a hierarchical element so

play01:27

on does this mean that a fundamental

play01:29

flaw of llms or does it mean

play01:32

that there's more part to that

play01:35

question now you're just behaving like

play01:37

an

play01:38

llm immediately answer no that that it's

play01:43

just the lowlevel world model on top of

play01:46

which we can then build some of these

play01:49

kinds of mechanisms like you said

play01:51

persistent long-term memory

play01:53

or uh reasoning so on but we need that

play01:57

world model that comes from language is

play02:00

it maybe it is not so difficult to build

play02:03

this kind of uh reasoning system on top

play02:06

of a well constructed World model OKAY

play02:09

whether it's difficult or not the near

play02:11

future will will say because a lot of

play02:13

people are working on reasoning and

play02:15

planning abilities for for dialogue

play02:18

systems um I mean if we're even if we

play02:20

restrict ourselves to

play02:22

language uh just having the ability to

play02:25

plan your answer before you

play02:27

answer uh in terms that are not

play02:29

necessarily linked with the language

play02:31

you're going to use to produce the

play02:33

answer right so this idea of this mental

play02:35

model that allows you to plan what

play02:36

you're going to say before you say it MH

play02:40

um that is very important I think

play02:43

there's going to be a lot of systems

play02:45

over the next few years are going to

play02:47

have this capability but the blueprint

play02:50

of those systems will be extremely

play02:52

different from Auto regressive LMS so

play02:57

um it's the same difference as has the

play03:00

difference between what psychology is

play03:02

called system one and system two in

play03:03

humans right so system one is the type

play03:06

of task that you can accomplish without

play03:08

like deliberately consciously think

play03:09

about how you do them you just do them

play03:13

you've done them enough that you can

play03:15

just do it subconsciously right without

play03:17

thinking about them if you're an

play03:18

experienced driver you can drive without

play03:21

really thinking about it and you can

play03:23

talk to someone at the same time or

play03:24

listen to the radio right um if you are

play03:28

a very experienced chest player you can

play03:30

play against a non-experienced CH player

play03:32

without really thinking either you just

play03:34

recognize the pattern and you play mhm

play03:36

right that's system one um so all the

play03:40

things that you do instinctively without

play03:41

really having to deliberately plan and

play03:44

think about it and then there is all

play03:45

task what you need to plan so if you are

play03:48

a not to experienced uh chess player or

play03:51

you are experienced where you play

play03:52

against another experienced chest player

play03:54

you think about all kinds of options

play03:56

right you you think about it for a while

play03:58

right and you you you're much better if

play04:01

you have time to think about it than you

play04:02

are if you are if you play Blitz uh with

play04:05

limited time so and um so this type of

play04:09

deliberate uh planning which uses your

play04:12

internal World model um that system to

play04:16

this is what LMS currently cannot do so

play04:18

how how do we get them to do this right

play04:20

how do we build a system that can do

play04:22

this kind of planning that or reasoning

play04:26

that devotes more resources to complex

play04:29

part problems than two simple problems

play04:32

and it's not going to be Auto regressive

play04:33

prediction of tokens it's going to be

play04:36

more something akin to inference of

play04:40

latent variables in um you know what

play04:44

used to be called probalistic models or

play04:47

graphical models and things of that type

play04:49

so basically the principle is like this

play04:51

you you know the prompt is like observed

play04:55

uh variables mhm and what you're what

play04:59

the model

play05:00

does is that it's basically a

play05:03

measure of it can measure to what extent

play05:06

an answer is a good answer for a prompt

play05:10

okay so think of it as some gigantic

play05:12

Neal net but it's got only one output

play05:14

and that output is a scalar number which

play05:17

is let's say zero if the answer is a

play05:19

good answer for the question and a large

play05:22

number if the answer is not a good

play05:23

answer for the question imagine you had

play05:25

this model if you had such a model you

play05:28

could use it to produce good answers the

play05:30

way you would do

play05:32

is you know produce the prompt and then

play05:34

search through the space of possible

play05:36

answers for one that minimizes that

play05:39

number um that's called an energy based

play05:42

model but that energy based model would

play05:45

need the the model constructed by the

play05:49

llm well so uh really what you need to

play05:52

do would be to not uh search over

play05:55

possible strings of text that minimize

play05:57

that uh energy but what you would do it

play06:00

do this in abstract representation space

play06:02

so in in sort of the space of abstract

play06:05

thoughts you would elaborate a thought

play06:08

right using this process of minimizing

play06:11

the output of your your model okay which

play06:14

is just a scalar um it's an optimization

play06:17

process right so now the the way the

play06:19

system produces its answer is through

play06:22

optimization um by you know minimizing

play06:25

an objective function basically right uh

play06:28

and this is we're talking about

play06:28

inference not talking about training

play06:30

right the system has been trained

play06:32

already so now we have an abstract

play06:34

representation of the thought of the

play06:36

answer representation of the answer we

play06:38

feed that to basically an auto

play06:40

regressive decoder uh which can be very

play06:42

simple that turns this into a text that

play06:45

expresses this thought okay so that that

play06:48

in my opinion is the blueprint of future

play06:50

dialog systems um they will think about

play06:54

their answer plan their answer by

play06:56

optimization before turning it into text

play07:00

uh and that is turning complete can you

play07:03

explain exactly what the optimization

play07:05

problem there is like what's the

play07:07

objective function just Linger on it you

play07:10

you kind of briefly described it but

play07:13

over what space are you optimizing the

play07:15

space of

play07:16

representations goes abstract

play07:18

representation abstract repres so you

play07:20

have an abstract representation inside

play07:22

the system you have a prompt The Prompt

play07:24

goes through an encoder produces a

play07:26

representation perhaps goes through a

play07:27

predictor that predicts a representation

play07:29

of the answer of the proper answer but

play07:31

that representation may not be a good

play07:35

answer because there might there might

play07:36

be some complicated reasoning you need

play07:38

to do right so um so then you have

play07:41

another process that takes the

play07:44

representation of the answers and

play07:46

modifies it so as to

play07:49

minimize uh a cost function that

play07:51

measures to what extent the answer is a

play07:53

good answer for the question now we we

play07:56

sort of ignore the the fact for I mean

play07:59

the the issue for a moment of how you

play08:01

train that system to measure whether an

play08:05

answer is a good answer for for but

play08:07

suppose such a system could be created

play08:10

but what's the process this kind of

play08:12

search like process it's a optimization

play08:15

process you can do this if if the entire

play08:17

system is

play08:18

differentiable that scalar output is the

play08:21

result of you know running through some

play08:23

neural net MH uh running the answer the

play08:26

representation of the answer to some

play08:27

neural net then by GR

play08:29

by back propag back propagating

play08:31

gradients you can figure out like how to

play08:33

modify the representation of the answer

play08:35

so as to minimize that so that's still

play08:37

gradient based it's gradient based

play08:39

inference so now you have a

play08:40

representation of the answer in abstract

play08:42

space now you can turn it into

play08:45

text right and the cool thing about this

play08:49

is that the representation now can be

play08:52

optimized through gr and descent but

play08:54

also is independent of the language in

play08:56

which you're going to express the

play08:58

answer right so you're operating in the

play09:00

substract representation I mean this

play09:02

goes back to the Joint embedding that is

play09:04

better to work in the uh in the space of

play09:08

I don't know to romanticize the notion

play09:10

like space of Concepts versus yeah the

play09:13

space of

play09:15

concrete sensory information

play09:18

right okay but this can can this do

play09:21

something like reasoning which is what

play09:22

we're talking about well not really in a

play09:24

only in a very simple way I mean

play09:26

basically you can think of those things

play09:27

as doing the kind of optimization I was

play09:30

I was talking about except they optimize

play09:32

in the discrete space which is the space

play09:34

of possible sequences of of tokens and

play09:37

they do it they do this optimization in

play09:39

a horribly inefficient way which is

play09:41

generate a lot of hypothesis and then

play09:43

select the best ones and that's

play09:46

incredibly wasteful in terms of uh

play09:49

computation because you have you run you

play09:51

basically have to run your LM for like

play09:53

every possible you know Genera sequence

play09:56

um and it's incredibly wasteful

play09:59

um so it's much better to do an

play10:03

optimization in continuous space where

play10:05

you can do gr and descent as opposed to

play10:07

like generate tons of things and then

play10:08

select the best you just iteratively

play10:11

refine your answer to to go towards the

play10:13

best right that's much more efficient

play10:15

but you can only do this in continuous

play10:17

spaces with differentiable functions

play10:19

you're talking about the reasoning like

play10:22

ability to think deeply or to reason

play10:25

deeply how do you know what

play10:29

is an

play10:31

answer uh that's better or worse based

play10:34

on deep reasoning right so then we're

play10:37

asking the question of conceptually how

play10:39

do you train an energy based model right

play10:41

so energy based model is a function with

play10:43

a scalar output just a

play10:45

number you give it two inputs X and Y M

play10:49

and it tells you whether Y is compatible

play10:51

with X or not X You observe let's say

play10:53

it's a prompt an image a video whatever

play10:56

and why is a proposal for an answer a

play10:59

continuation of video um you know

play11:03

whatever and it tells you whether Y is

play11:05

compatible with X and the way it tells

play11:07

you that Y is compatible with X is that

play11:09

the output of that function will be zero

play11:11

if Y is compatible with X it would be a

play11:14

positive number non zero if Y is not

play11:17

compatible with X okay how do you train

play11:19

a system like this at a completely

play11:22

General level is you show it pairs of X

play11:26

and Y that are compatible equ question

play11:28

and the corresp answer and you train the

play11:31

parameters of the big neural net inside

play11:34

um to produce zero M okay now that

play11:37

doesn't completely work because the

play11:39

system might decide well I'm just going

play11:41

to say zero for everything so now you

play11:43

have to have a process to make sure that

play11:45

for a a wrong y the energy would be

play11:48

larger than zero and there you have two

play11:51

options one is contrastive Method so

play11:53

contrastive method is you show an X and

play11:55

a bad

play11:56

Y and you tell the system well that's

play11:59

you know give a high energy to this like

play12:01

push up the energy right change the

play12:02

weights in the neural net that confus

play12:04

the energy so that it goes

play12:06

up um so that's contrasting methods the

play12:09

problem with this is if the space of Y

play12:12

is large the number of such contrasted

play12:15

samples you're going to have to show is

play12:19

gigantic but people do this they they do

play12:22

this when you train a system with RF

play12:25

basically what you're training is what's

play12:28

called a reward model which is basically

play12:30

an objective function that tells you

play12:32

whether an answer is good or bad and

play12:34

that's basically exactly what what this

play12:37

is so we already do this to some extent

play12:40

we're just not using it for inference

play12:41

we're just using it for training um uh

play12:45

there is another set of methods which

play12:47

are non-contrastive and I prefer those

play12:50

uh and those non-contrastive method

play12:52

basically

play12:53

say uh okay the energy function needs to

play12:58

have low energy on pairs of xys that are

play13:01

compatible that come from your training

play13:03

set how do you make sure that the energy

play13:05

is going to be higher everywhere

play13:07

else and the way you do this is by um

play13:11

having a regularizer a Criterion a term

play13:15

in your cost function that basically

play13:17

minimizes the volume of space that can

play13:21

take low

play13:22

energy and the precise way to do this is

play13:24

all kinds of different specific ways to

play13:26

do this depending on the architecture

play13:28

but that's the basic principle so that

play13:30

if you push down the energy function for

play13:33

particular regions in the XY space it

play13:35

will automatically go up in other places

play13:37

because there's only a limited volume of

play13:40

space that can take low energy okay by

play13:43

the construction of the system or by the

play13:45

regularizer regularizing function we've

play13:48

been talking very generally but what is

play13:51

a good X and a good Y what is a good

play13:53

representation of X and Y cuz we've been

play13:57

talking about language and if you just

play13:59

take language directly that presumably

play14:02

is not good so there has to be some kind

play14:04

of abstract representation of

play14:06

ideas yeah so you I mean you can do this

play14:09

with language directly um by just you

play14:12

know X is a text and Y is the

play14:14

continuation of that text yes um or X is

play14:17

a question Y is the answer but you're

play14:20

you're saying that's not going to take

play14:21

it I mean that's going to do what LMS

play14:22

are time well no it depends on how you

play14:26

how the internal structure of the system

play14:28

is built if the if the internal

play14:29

structure of the system is built in such

play14:31

a way that inside of the system there is

play14:34

a latent variable that's called Z that

play14:37

uh you can manipulate so as to minimize

play14:42

the output

play14:43

energy then that Z can be viewed as a

play14:46

representation of a good answer that you

play14:48

can translate into a y that is a good

play14:51

answer so this kind of system could be

play14:54

trained in a very similar way very

play14:56

similar way but you have to have this

play14:58

way of preventing collapse of of

play15:00

ensuring that you know there is high

play15:02

energy for things you don't train it on

play15:05

um and and currently it's it's very

play15:09

implicit in llm it's done in a way that

play15:11

people don't realize it's being done but

play15:12

it is being done is is due to the fact

play15:15

that when you give a high probability to

play15:18

a

play15:19

word automatically you give low

play15:21

probability to other words because you

play15:23

only have a finite amount of probability

play15:26

to go around right there to some to one

play15:29

um so when you minimize the cross

play15:30

entropy or whatever when you train the

play15:33

your llm to produce the to predict the

play15:35

next word uh you're increasing the

play15:38

probability your system will give to the

play15:40

correct word but you're also decreasing

play15:41

the probability will give to the

play15:42

incorrect words now indirectly that

play15:46

gives a low probability to a high

play15:49

probability to sequences of words that

play15:50

are good and low probability to

play15:52

sequences of words that are bad but it's

play15:53

very indirect and it's not it's not

play15:56

obvious why this actually works at all

play15:58

but um because you're not doing it on

play16:01

the joint probability of all the symbols

play16:03

in a in a sequence you're just doing it

play16:05

kind

play16:06

of you sort of factorize that

play16:08

probability in terms of conditional

play16:10

probabilities over successive tokens so

play16:13

how do you do this for visual data so

play16:15

we've been doing this with all JEA

play16:17

architectures basically the joint Bing

play16:19

IA so uh there are the compatibility

play16:23

between two things is uh you know here's

play16:25

here's an image or a video here's a

play16:28

corrupted shifted or transformed version

play16:29

of that image or video or masked okay

play16:33

and then uh the energy of the system is

play16:36

the prediction error of

play16:40

the

play16:42

representation uh the the predicted

play16:45

representation of the Good Thing versus

play16:46

the actual representation of the good

play16:48

thing right so so you run the corrupted

play16:51

image to the system predict the

play16:53

representation of the the good input

play16:55

uncorrupted and then compute the

play16:57

prediction error that's energy of the

play16:59

system so this system will tell you this

play17:01

is a

play17:04

good you know if this is a good image

play17:06

and this is a corrupted version it will

play17:08

give you Zero Energy if those two things

play17:10

are effectively one of them is a

play17:13

corrupted version of the other give you

play17:15

a high energy if the if the two images

play17:17

are completely different and hopefully

play17:18

that whole process gives you a really

play17:21

nice compressed representation of of

play17:24

reality of visual reality and we know it

play17:26

does because then we use those for

play17:28

presentations as input to a

play17:30

classification system that

play17:31

classification system works really

play17:32

nicely

play17:52

okay

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Artificial IntelligenceReasoning SystemsComputational ModelsDeep LearningLanguage ModelsOptimization TechniquesAbstract RepresentationDialog SystemsNeural NetworksInference Processes