Can LLMs reason? | Yann LeCun and Lex Fridman
Summary
TLDRThe transcript discusses the limitations of large language models (LLMs) in reasoning and their constant computation per token produced. It suggests that future dialogue systems will require a more sophisticated approach, involving planning and optimization before generating a response. The conversation touches on the potential for systems to build upon a foundational world model, using processes akin to probabilistic models to infer latent variables. This could lead to more efficient and deep reasoning capabilities, moving beyond the current auto-regressive prediction methods.
Takeaways
- đ§ The reasoning in large language models (LLMs) is considered primitive due to the constant amount of computation spent per token produced.
- đ The computation does not adjust based on the complexity of the question, whether it's simple, complicated, or impossible to answer.
- đ Human reasoning involves spending more time on complex problems, with an iterative and hierarchical approach, unlike the constant computation model of LLMs.
- đ The future of dialogue systems may involve building upon a well-constructed world model with mechanisms like persistent long-term memory and reasoning.
- đ ïž There's a need for systems that can plan and reason, devoting more resources to complex problems, moving beyond auto-regressive prediction of tokens.
- đŻ The concept of an energy-based model is introduced, where the model output is a scalar number representing the 'goodness' of an answer for a given prompt.
- đ Optimization processes are key in future dialog systems, with the system planning and optimizing the answer before converting it into text.
- đ The optimization process involves abstract representation and is more efficient than generating numerous sequences and selecting the best ones.
- đ The training of an energy-based model involves showing it compatible pairs of inputs and outputs, using methods like contrastive training and regularizers.
- đą The energy function is trained to have low energy for compatible XY pairs and higher energy elsewhere, ensuring the model can distinguish between good and bad answers.
- đ The transcript discusses the indirect nature of training LLMs, where high probability for one word results in low probability for others, and how this could be adapted for more complex reasoning tasks.
Q & A
What is the main limitation of the reasoning process in large language models (LLMs)?
-The main limitation is that the amount of computation spent per token produced is constant, meaning that the system does not adjust the computational resources based on the complexity of the question or problem at hand.
How does human reasoning differ from the reasoning process in LLMs?
-Human reasoning involves spending more time on complex problems, with an iterative and hierarchical approach, while LLMs allocate a fixed amount of computation regardless of the question's complexity.
What is the significance of a persistent long-term memory in dialogue systems?
-A persistent long-term memory allows dialogue systems to build upon previous information and context, leading to more coherent and informed responses in a conversation.
How does the concept of 'system one' and 'system two' in psychology relate to LLMs?
-System one corresponds to tasks that can be done without conscious thought, similar to how LLMs operate on instinctive language patterns. System two involves deliberate planning and thinking, which is something LLMs currently lack but could potentially develop.
What is the proposed blueprint for future dialogue systems?
-The proposed blueprint involves a system that thinks about and plans its answer through optimization before converting it into text, moving away from the auto-regressive prediction of tokens.
How does the energy-based model work in the context of dialogue systems?
-The energy-based model is a function that outputs a scalar number indicating how good an answer is for a given prompt. The system searches for an answer that minimizes this number, representing a good response.
What is the difference between contrastive and non-contrastive methods in training an energy-based model?
-Contrastive methods train the model by showing it pairs of compatible and incompatible inputs and outputs, adjusting the weights to increase the energy for incompatible pairs. Non-contrastive methods, on the other hand, use a regularizer to ensure that the energy is higher for incompatible pairs by minimizing the volume of space that can take low energy.
How does the concept of latent variables play a role in the optimization process of dialogue systems?
-Latent variables, or Z in the context of the script, represent an abstract form of a good answer that the system can manipulate to minimize the output energy. This allows for optimization in an abstract representation space rather than directly in text.
What is the main inefficiency in the current auto-regressive language model training?
-The main inefficiency is that it involves generating a large number of hypothesis sequences and then selecting the best ones, which is computationally wasteful compared to optimizing in continuous, differentiable spaces.
How does the energy function ensure that a good answer has low energy and a bad answer has high energy?
-The energy function is trained to produce low energy for pairs of inputs and outputs (X and Y) that are compatible, based on the training set. A regularizer in the cost function ensures that the energy is higher for incompatible pairs, effectively pushing the energy function down in regions of compatible XY pairs and up elsewhere.
How is the concept of energy-based models applied in visual data processing?
-In visual data processing, the energy of the system is represented by the prediction error of the representation when comparing a corrupted version of an image or video to the actual, uncorrupted version. A low energy indicates a good match, while a high energy indicates significant differences.
Outlines
đ€ Primitive Reasoning in LLMs
The paragraph discusses the limitations of reasoning in large language models (LLMs) due to the constant amount of computation spent per token produced. It highlights that the system does not adjust the computational effort based on the complexity of the question, leading to a fundamental flaw in the way LLMs approach problem-solving. The speaker suggests that future improvements could involve building upon a well-constructed world model and incorporating mechanisms like persistent long-term memory and hierarchical reasoning, which are more akin to human thought processes.
đ The Future of Dialogue Systems
This section envisions the future of dialogue systems, emphasizing the need for systems that can plan and optimize their answers before producing them. The speaker introduces the concept of an energy-based model that evaluates the quality of an answer to a prompt, suggesting that future systems will operate in an abstract representation space rather than just generating text. The goal is to create a system that can perform iterative optimization and hierarchical reasoning, which is currently beyond the capabilities of auto-regressive LLMs.
đ§ Training Energy-Based Models
The paragraph delves into the conceptual framework of training energy-based models, which are designed to output a scalar value indicating the compatibility of a proposed answer with a given prompt. The speaker explains that these models are trained by showing them pairs of compatible inputs and outputs, and the system learns to minimize the output value. The process involves ensuring that the energy is higher for incompatible pairs, which can be achieved through contrastive methods or non-contrastive regularization techniques. The discussion also touches on the importance of abstract representations and the potential for these models to perform reasoning tasks more efficiently than current LLMs.
đ Visual Data and Energy Functions
This paragraph explores the application of energy functions in the context of visual data, contrasting it with language-based systems. The speaker describes how energy-based models can be used to assess the quality of visual representations by comparing a corrupted image with its uncorrupted version, using the prediction error as the energy measure. The process is highlighted as a way to achieve a compressed and efficient representation of visual reality, which has been successfully applied in classification systems.
Mindmap
Keywords
đĄComputation
đĄToken
đĄReasoning
đĄAuto-regressive LMS
đĄWorld Model
đĄLatent Variables
đĄOptimization
đĄEnergy-based Model
đĄGradient Descent
đĄInference
đĄConceptual Training
Highlights
The reasoning in large language models (LLMs) is considered primitive due to the constant amount of computation spent per token produced.
The computation devoted to computing an answer is proportional to the number of tokens produced in the answer, regardless of the question's complexity.
Human reasoning involves spending more time on complex problems, with an iterative and hierarchical approach, unlike the constant computation model of LLMs.
The future of dialogue systems may involve planning and optimizing answers before expressing them in text, moving away from autoregressive LMs.
The concept of system one and system two in humans is introduced as an analogy for the different levels of cognitive tasks and reasoning.
Experienced individuals can perform system one tasks subconsciously, while system two tasks require deliberate planning and thought.
LLMs currently lack the ability to perform system two tasks, which involve internal world modeling and deliberate planning.
The future blueprint of dialogue systems may involve persistent long-term memory and reasoning mechanisms built on top of a well-constructed world model.
The idea of a mental model that allows planning of responses before expressing them is crucial for advanced dialogue systems.
The optimization process for dialogue systems involves abstract representation and searching for an answer that minimizes a cost function.
The concept of an energy-based model is introduced, where the model outputs a scalar number to measure the quality of an answer.
The future of dialogue systems may involve differentiable systems that allow for gradient-based inference and optimization in continuous spaces.
The training of an energy-based model involves showing it pairs of compatible inputs and outputs, and adjusting the neural network to produce low energy for correct answers.
Contrastive methods are used to train energy-based models by presenting both good and bad examples and adjusting the system to produce higher energy for incorrect answers.
Non-contrastive methods ensure higher energy for incompatible pairs by minimizing the volume of space that can take low energy.
The concept of latent variables and abstract representations is crucial for optimizing and planning complex answers in future dialogue systems.
The indirect method of training LLMs through probability distribution over tokens results in a basic level of reasoning but lacks the depth of system two tasks.
The potential for visual data applications of energy-based models is discussed, where the energy represents the prediction error of a representation.
The energy-based model approach aims to provide a compressed representation of visual reality, which has proven effective in classification tasks.
Transcripts
the type of reasoning that takes place
in llm is very very primitive and the
reason you can tell is primitive is
because the amount of computation that
is spent per token produced is constant
so if you ask a question and that
question has an answer in a given number
of token the amount of competition
devoted to Computing that answer can be
exactly estimated it's like you know
it's how it's the the size of the
prediction Network you know with its 36
layers or 92 layers or whatever it is uh
multiply by number of tokens that's it
and so essentially it doesn't matter if
the question being asked is is simple to
answer complicated to answer impossible
to answer because it's undecidable or
something um the amount of computation
the system will be able to devote to
that to the answer is constant or is
proportional to the number of token
produced in the answer right this is not
the way we work the way we reason is
that when we're faced with a complex
problem or complex question we spend
more time trying to solve it and answer
it right because it's more difficult
there's a prediction element there's a
iterative element where you're like
uh adjusting your understanding of a
thing by going over over and over and
over there's a hierarchical element so
on does this mean that a fundamental
flaw of llms or does it mean
that there's more part to that
question now you're just behaving like
an
llm immediately answer no that that it's
just the lowlevel world model on top of
which we can then build some of these
kinds of mechanisms like you said
persistent long-term memory
or uh reasoning so on but we need that
world model that comes from language is
it maybe it is not so difficult to build
this kind of uh reasoning system on top
of a well constructed World model OKAY
whether it's difficult or not the near
future will will say because a lot of
people are working on reasoning and
planning abilities for for dialogue
systems um I mean if we're even if we
restrict ourselves to
language uh just having the ability to
plan your answer before you
answer uh in terms that are not
necessarily linked with the language
you're going to use to produce the
answer right so this idea of this mental
model that allows you to plan what
you're going to say before you say it MH
um that is very important I think
there's going to be a lot of systems
over the next few years are going to
have this capability but the blueprint
of those systems will be extremely
different from Auto regressive LMS so
um it's the same difference as has the
difference between what psychology is
called system one and system two in
humans right so system one is the type
of task that you can accomplish without
like deliberately consciously think
about how you do them you just do them
you've done them enough that you can
just do it subconsciously right without
thinking about them if you're an
experienced driver you can drive without
really thinking about it and you can
talk to someone at the same time or
listen to the radio right um if you are
a very experienced chest player you can
play against a non-experienced CH player
without really thinking either you just
recognize the pattern and you play mhm
right that's system one um so all the
things that you do instinctively without
really having to deliberately plan and
think about it and then there is all
task what you need to plan so if you are
a not to experienced uh chess player or
you are experienced where you play
against another experienced chest player
you think about all kinds of options
right you you think about it for a while
right and you you you're much better if
you have time to think about it than you
are if you are if you play Blitz uh with
limited time so and um so this type of
deliberate uh planning which uses your
internal World model um that system to
this is what LMS currently cannot do so
how how do we get them to do this right
how do we build a system that can do
this kind of planning that or reasoning
that devotes more resources to complex
part problems than two simple problems
and it's not going to be Auto regressive
prediction of tokens it's going to be
more something akin to inference of
latent variables in um you know what
used to be called probalistic models or
graphical models and things of that type
so basically the principle is like this
you you know the prompt is like observed
uh variables mhm and what you're what
the model
does is that it's basically a
measure of it can measure to what extent
an answer is a good answer for a prompt
okay so think of it as some gigantic
Neal net but it's got only one output
and that output is a scalar number which
is let's say zero if the answer is a
good answer for the question and a large
number if the answer is not a good
answer for the question imagine you had
this model if you had such a model you
could use it to produce good answers the
way you would do
is you know produce the prompt and then
search through the space of possible
answers for one that minimizes that
number um that's called an energy based
model but that energy based model would
need the the model constructed by the
llm well so uh really what you need to
do would be to not uh search over
possible strings of text that minimize
that uh energy but what you would do it
do this in abstract representation space
so in in sort of the space of abstract
thoughts you would elaborate a thought
right using this process of minimizing
the output of your your model okay which
is just a scalar um it's an optimization
process right so now the the way the
system produces its answer is through
optimization um by you know minimizing
an objective function basically right uh
and this is we're talking about
inference not talking about training
right the system has been trained
already so now we have an abstract
representation of the thought of the
answer representation of the answer we
feed that to basically an auto
regressive decoder uh which can be very
simple that turns this into a text that
expresses this thought okay so that that
in my opinion is the blueprint of future
dialog systems um they will think about
their answer plan their answer by
optimization before turning it into text
uh and that is turning complete can you
explain exactly what the optimization
problem there is like what's the
objective function just Linger on it you
you kind of briefly described it but
over what space are you optimizing the
space of
representations goes abstract
representation abstract repres so you
have an abstract representation inside
the system you have a prompt The Prompt
goes through an encoder produces a
representation perhaps goes through a
predictor that predicts a representation
of the answer of the proper answer but
that representation may not be a good
answer because there might there might
be some complicated reasoning you need
to do right so um so then you have
another process that takes the
representation of the answers and
modifies it so as to
minimize uh a cost function that
measures to what extent the answer is a
good answer for the question now we we
sort of ignore the the fact for I mean
the the issue for a moment of how you
train that system to measure whether an
answer is a good answer for for but
suppose such a system could be created
but what's the process this kind of
search like process it's a optimization
process you can do this if if the entire
system is
differentiable that scalar output is the
result of you know running through some
neural net MH uh running the answer the
representation of the answer to some
neural net then by GR
by back propag back propagating
gradients you can figure out like how to
modify the representation of the answer
so as to minimize that so that's still
gradient based it's gradient based
inference so now you have a
representation of the answer in abstract
space now you can turn it into
text right and the cool thing about this
is that the representation now can be
optimized through gr and descent but
also is independent of the language in
which you're going to express the
answer right so you're operating in the
substract representation I mean this
goes back to the Joint embedding that is
better to work in the uh in the space of
I don't know to romanticize the notion
like space of Concepts versus yeah the
space of
concrete sensory information
right okay but this can can this do
something like reasoning which is what
we're talking about well not really in a
only in a very simple way I mean
basically you can think of those things
as doing the kind of optimization I was
I was talking about except they optimize
in the discrete space which is the space
of possible sequences of of tokens and
they do it they do this optimization in
a horribly inefficient way which is
generate a lot of hypothesis and then
select the best ones and that's
incredibly wasteful in terms of uh
computation because you have you run you
basically have to run your LM for like
every possible you know Genera sequence
um and it's incredibly wasteful
um so it's much better to do an
optimization in continuous space where
you can do gr and descent as opposed to
like generate tons of things and then
select the best you just iteratively
refine your answer to to go towards the
best right that's much more efficient
but you can only do this in continuous
spaces with differentiable functions
you're talking about the reasoning like
ability to think deeply or to reason
deeply how do you know what
is an
answer uh that's better or worse based
on deep reasoning right so then we're
asking the question of conceptually how
do you train an energy based model right
so energy based model is a function with
a scalar output just a
number you give it two inputs X and Y M
and it tells you whether Y is compatible
with X or not X You observe let's say
it's a prompt an image a video whatever
and why is a proposal for an answer a
continuation of video um you know
whatever and it tells you whether Y is
compatible with X and the way it tells
you that Y is compatible with X is that
the output of that function will be zero
if Y is compatible with X it would be a
positive number non zero if Y is not
compatible with X okay how do you train
a system like this at a completely
General level is you show it pairs of X
and Y that are compatible equ question
and the corresp answer and you train the
parameters of the big neural net inside
um to produce zero M okay now that
doesn't completely work because the
system might decide well I'm just going
to say zero for everything so now you
have to have a process to make sure that
for a a wrong y the energy would be
larger than zero and there you have two
options one is contrastive Method so
contrastive method is you show an X and
a bad
Y and you tell the system well that's
you know give a high energy to this like
push up the energy right change the
weights in the neural net that confus
the energy so that it goes
up um so that's contrasting methods the
problem with this is if the space of Y
is large the number of such contrasted
samples you're going to have to show is
gigantic but people do this they they do
this when you train a system with RF
basically what you're training is what's
called a reward model which is basically
an objective function that tells you
whether an answer is good or bad and
that's basically exactly what what this
is so we already do this to some extent
we're just not using it for inference
we're just using it for training um uh
there is another set of methods which
are non-contrastive and I prefer those
uh and those non-contrastive method
basically
say uh okay the energy function needs to
have low energy on pairs of xys that are
compatible that come from your training
set how do you make sure that the energy
is going to be higher everywhere
else and the way you do this is by um
having a regularizer a Criterion a term
in your cost function that basically
minimizes the volume of space that can
take low
energy and the precise way to do this is
all kinds of different specific ways to
do this depending on the architecture
but that's the basic principle so that
if you push down the energy function for
particular regions in the XY space it
will automatically go up in other places
because there's only a limited volume of
space that can take low energy okay by
the construction of the system or by the
regularizer regularizing function we've
been talking very generally but what is
a good X and a good Y what is a good
representation of X and Y cuz we've been
talking about language and if you just
take language directly that presumably
is not good so there has to be some kind
of abstract representation of
ideas yeah so you I mean you can do this
with language directly um by just you
know X is a text and Y is the
continuation of that text yes um or X is
a question Y is the answer but you're
you're saying that's not going to take
it I mean that's going to do what LMS
are time well no it depends on how you
how the internal structure of the system
is built if the if the internal
structure of the system is built in such
a way that inside of the system there is
a latent variable that's called Z that
uh you can manipulate so as to minimize
the output
energy then that Z can be viewed as a
representation of a good answer that you
can translate into a y that is a good
answer so this kind of system could be
trained in a very similar way very
similar way but you have to have this
way of preventing collapse of of
ensuring that you know there is high
energy for things you don't train it on
um and and currently it's it's very
implicit in llm it's done in a way that
people don't realize it's being done but
it is being done is is due to the fact
that when you give a high probability to
a
word automatically you give low
probability to other words because you
only have a finite amount of probability
to go around right there to some to one
um so when you minimize the cross
entropy or whatever when you train the
your llm to produce the to predict the
next word uh you're increasing the
probability your system will give to the
correct word but you're also decreasing
the probability will give to the
incorrect words now indirectly that
gives a low probability to a high
probability to sequences of words that
are good and low probability to
sequences of words that are bad but it's
very indirect and it's not it's not
obvious why this actually works at all
but um because you're not doing it on
the joint probability of all the symbols
in a in a sequence you're just doing it
kind
of you sort of factorize that
probability in terms of conditional
probabilities over successive tokens so
how do you do this for visual data so
we've been doing this with all JEA
architectures basically the joint Bing
IA so uh there are the compatibility
between two things is uh you know here's
here's an image or a video here's a
corrupted shifted or transformed version
of that image or video or masked okay
and then uh the energy of the system is
the prediction error of
the
representation uh the the predicted
representation of the Good Thing versus
the actual representation of the good
thing right so so you run the corrupted
image to the system predict the
representation of the the good input
uncorrupted and then compute the
prediction error that's energy of the
system so this system will tell you this
is a
good you know if this is a good image
and this is a corrupted version it will
give you Zero Energy if those two things
are effectively one of them is a
corrupted version of the other give you
a high energy if the if the two images
are completely different and hopefully
that whole process gives you a really
nice compressed representation of of
reality of visual reality and we know it
does because then we use those for
presentations as input to a
classification system that
classification system works really
nicely
okay
Voir Plus de Vidéos Connexes
Recent breakthroughs in AI: A brief overview | Aravind Srinivas and Lex Fridman
LLMs are not superintelligent | Yann LeCun and Lex Fridman
Liquid Neural Networks
STUNNING Step for Autonomous AI Agents PLUS OpenAI Defense Against JAILBROKEN Agents
So Google's Research Just Exposed OpenAI's Secrets (OpenAI o1-Exposed)
The Fastest Way to AGI: LLMs + Tree Search â Demis Hassabis (Google DeepMind CEO)
5.0 / 5 (0 votes)