LLMs are not superintelligent | Yann LeCun and Lex Fridman
Summary
TLDRThe transcript discusses the limitations of large language models (LLMs) in achieving superhuman intelligence. It highlights that while LLMs can process vast amounts of text, they lack the ability to understand the physical world, possess persistent memory, reason, and plan effectively. The speaker argues that intelligence requires grounding in reality and that most knowledge is acquired through sensory input and interaction with the world, not just language. They also touch on the challenges of creating AI systems that can build a comprehensive world model and the current methods being explored to improve AI's understanding and interaction with the physical environment.
Takeaways
- 🤖 Large language models (LLMs) like GPT-4 and LLaMa 2/3 are not sufficient for achieving superhuman intelligence due to their limitations in understanding, memory, reasoning, and planning.
- 🧠 Human and animal intelligence involves understanding the physical world, persistent memory, reasoning, and planning, which current LLMs lack.
- 📚 LLMs are trained on vast amounts of text data, but this is still less than the sensory data a four-year-old processes, highlighting the importance of non-linguistic learning.
- 📈 Language is a compressed form of information, but it is an approximate representation of our mental models and percepts, suggesting that more than language is needed for true intelligence.
- 🚀 There is a debate among philosophers and cognitive scientists about whether intelligence needs to be grounded in reality, with the speaker advocating for a connection to physical or simulated reality.
- 🤔 The complexity of the world is difficult to represent, and current LLMs are not trained to handle the intricacies of intuitive physics or common-sense reasoning about the physical space.
- 🛠️ LLMs are trained using an autoregressive prediction method, which is different from human thought processes that are not strictly tied to language.
- 🌐 Building a complete world model requires more than just predicting words; it involves observing and understanding the world's evolution and predicting the consequences of actions.
- 🔍 Current methods for training systems to learn representations of images by reconstruction from corrupted versions have largely failed, indicating a need for alternative approaches.
- 🔗 Joint embedding predictive architecture (JEA) is a promising alternative to traditional reconstruction-based training, which involves training a predictor to understand the full representation from a corrupted one.
Q & A
What are the key characteristics of intelligent behavior mentioned in the transcript?
-The key characteristics of intelligent behavior mentioned are the capacity to understand the world, the ability to remember and retrieve things (persistent memory), the ability to reason, and the ability to plan.
Why are Large Language Models (LLMs) considered insufficient for achieving superhuman intelligence?
-LLMs are considered insufficient because they do not possess or can only perform in a primitive way the essential characteristics of intelligence such as understanding the physical world, persistent memory, reasoning, and planning.
How does the amount of data a four-year-old processes visually compare to the data used to train LLMs?
-A four-year-old processes approximately 10^15 bytes of visual data, which is significantly more than the 2 * 10^13 bytes used for 170,000 years of reading text that LLMs are trained on.
What is the argument against the idea that language alone contains enough wisdom and knowledge to construct a world model?
-The argument is that language is a compressed and approximate representation of our percepts and mental models. It lacks the richness of the environment and most of our knowledge comes from observation and interaction with the real world, not just language.
What is the debate among philosophers and cognitive scientists regarding the grounding of intelligence?
-The debate is whether intelligence needs to be grounded in reality, with some arguing that intelligence cannot appear without some grounding, whether physical or simulated, while others may not necessarily agree with this.
Why are tasks like driving a car or clearing a dishwasher more challenging for AI compared to passing a bar exam?
-These tasks are more challenging because they require intuitive physics and common-sense reasoning about the physical world, which LLMs currently lack. They are trained on text and do not understand intuitive physics as well as humans do.
How do LLMs generate text?
-LLMs generate text through an autoregressive prediction process where they predict the next word based on the previous words in a text, using a probability distribution over possible words.
What is the difference between the autoregressive prediction of LLMs and human speech planning?
-Human speech planning involves thinking about what to say independent of the language used, while LLMs generate text one word at a time based on the previous words without an overarching plan.
What is the fundamental limitation of generative models in video prediction?
-The fundamental limitation is that the world is incredibly complex and rich in information compared to text. Video is high-dimensional and continuous, making it difficult to represent distributions over all possible frames in a video.
What is the concept of joint embedding and how does it differ from traditional image reconstruction methods?
-Joint embedding involves encoding both the full and corrupted versions of an image and training a predictor to predict the representation of the full image from the corrupted one. This differs from traditional methods that focus on reconstructing a good image from a corrupted version, which has proven to be ineffective in learning good generic features of images.
Outlines
🤖 Limitations of Large Language Models (LLMs)
The speaker discusses the limitations of autoaggressive LLMs like GPT-4 and Llama 2/3 in achieving superhuman intelligence. They lack essential characteristics of intelligence such as understanding the physical world, persistent memory, reasoning, and planning. Despite their inability to fully understand or interact with the world, LLMs are useful and can support an ecosystem of applications. The speaker also compares the amount of data LLMs are trained on to the sensory input a four-year-old receives, highlighting that most knowledge comes from observation and interaction with the real world, not language.
🌐 The Debate on Grounding Intelligence in Reality
The speaker explores the debate on whether intelligence needs to be grounded in reality, arguing that it does. They point out that language is an approximate representation of our mental models and that much of our knowledge comes from physical interaction with the world. The speaker also touches on the challenges of representing the complexities of the real world in AI and the limitations of current LLMs in understanding intuitive physics and common sense reasoning.
📈 The Training Process of LLMs
The speaker explains the training process of LLMs, which involves predicting missing words in a text. This autoregressive prediction method allows the model to generate text one word at a time. The speaker contrasts this with human thought processes, which are not tied to language and involve planning and mental models. They argue that LLMs lack this higher level of abstraction and planning, which is crucial for true intelligence.
🚀 Building World Models and Predicting Actions
The speaker discusses the concept of building world models for AI, which involves understanding and predicting the evolution of the world based on actions. They argue that while it's possible to build a world model by predicting words, it's not feasible with the current LLMs due to the limitations of language as a low-bandwidth medium. The speaker also mentions the challenges of representing high-dimensional continuous spaces, which are necessary for video and image understanding.
🔍 The Failure of Self-Supervised Image Reconstruction
The speaker addresses the failure of self-supervised methods in learning good image representations by reconstructing corrupted images. They compare this to the success of LLMs in text prediction and argue that the same approach does not work for images due to the high dimensionality and complexity of visual data. The speaker then introduces the concept of joint embedding, which involves training a system to predict the representation of a full image from a corrupted version, as a potential solution to this problem.
🔄 Contrastive and Non-Contrastive Learning Methods
The speaker discusses the limitations of contrastive learning methods, which involve training representations to be similar for similar images and dissimilar for different images. They mention the emergence of non-contrastive methods that do not require negative samples and rely on other techniques to prevent system collapse. The speaker highlights the development of several new methods over the past few years that can improve the training of such systems.
Mindmap
Keywords
💡Artificial Intelligence (AI)
💡Large Language Models (LLMs)
💡Autoaggressive Learning
💡Persistent Memory
💡Reasoning
💡Planning
💡Sensory Input
💡Embodied AI
💡Intuitive Physics
💡Joint Embedding Predictive Architecture (JEA)
Highlights
Large language models (LLMs) like GPT-4 and LLaMa 2/3 are not the path to superhuman intelligence due to their limitations in understanding the physical world, memory, reasoning, and planning.
LLMs are trained on vast amounts of text data, but this is not as rich as the sensory input a human experiences, especially during early childhood.
A four-year-old's visual cortex receives more information than 170,000 years of reading text, indicating that most learning comes from observation and interaction with the real world.
Language is a compressed form of information, but it is an approximate representation of our percepts and mental models.
Intelligence needs to be grounded in reality, whether physical or simulated, to truly understand and interact with the world.
The complexity of the world is difficult to represent, and current LLMs are not trained to handle the intuitive physics or common sense reasoning required for such understanding.
LLMs are trained using an autoregressive prediction method, which is different from human thought processes that are not strictly tied to language.
There is a debate among philosophers and cognitive scientists about whether intelligence can exist without grounding in reality.
Current LLMs lack the ability to construct a world model and understand the physical world, which is a significant limitation for achieving human-level intelligence.
The training process of LLMs involves predicting missing words in a text, which is a simplistic approach compared to the complexity of the world and its representation.
Attempts to train models to predict video frames have been unsuccessful, highlighting the difficulty of representing high-dimensional continuous spaces.
Joint embedding predictive architecture (JEA) is a promising approach that involves training a predictor to reconstruct the full representation of an input from a corrupted version.
Contrastive learning methods have been developed to improve the training of image representations, but they have limitations.
Non-contrastive methods have emerged in recent years, allowing for training without negative samples, which could potentially improve the quality of learned representations.
The failure of self-supervised reconstruction methods for images suggests that simply reconstructing from corrupted data does not lead to good generic features for image recognition tasks.
Supervised learning with labeled data produces better image representations and recognition performance compared to self-supervised reconstruction methods.
The transcript discusses the limitations of current AI models and the potential of new methods like JEA and non-contrastive learning to advance the field of artificial intelligence.
Transcripts
you've had some strong
statements technical statements about
the future of artificial intelligence
recently throughout your career actually
but recently as well uh you've said that
autoaggressive LMS are uh not the way
we're going to make progress towards
superhuman intelligence these are the
large language models like GPT 4 like
llama 2 and three soon and so on how do
they work why are they not going to take
us all the way for a number of reasons
the first is that there is a number of
characteristics of intelligent
behavior for example the capacity to
understand the world understand the
physical
world the ability to remember and
retrieve
things um persistent memory the ability
to reason and the ability to plan those
are four essential characteristic of
intelligent
um systems or entities humans
animals LMS can do none of those or they
can only do them in a very primitive way
and uh they don't really understand the
physical world they don't really have
persistent memory they can't really
reason and they certainly can't plan and
so you know if if if you expect the
system to become intelligent just you
know without having the possibility of
doing those things uh you're making a
mistake
that is not to say that auto LMS are not
useful they're certainly
useful um that they're not interesting
that we can't build a whole ecosystem of
applications around them of course we
can but as a path towards human level
intelligence they're missing essential
components and then there is another
tidbit or or fact that I think is very
interesting those LMS are TR trained on
enormous amounts of textt basically the
entirety of all publicly available text
on the internet right that's
typically on the order of 10 to the 13
tokens each token is typically two bytes
so that's two 10 to the 13 bytes as
training data it would take you or me
170,000 years to just read through this
at eight hours a day uh so it seems like
an enormous amount of knowledge right
that those systems can
accumulate um
but then you realize it's really not
that much data if you you talk to
developmental psychologists and they
tell you a four-year-old has been awake
for 16,000 hours in his or
life um and the amount of information
that has uh reached the visual cortex of
that child in four years um is about 10
to the 15 bytes and you can compute this
by estimating that the optical nerve
carry about 20 megab megabytes per
second roughly and so 10^ the 15 bytes
for a four-year-old versus 2 * 10^ the
13 bytes for 170,000 years worth of
reading what it tells you is that uh
through sensory input we see a lot more
information than we than we do through
language and that despite our
intuition most of what we learn and most
of our knowledge is through our
observation and interaction with the
real world not through language
everything that we learn in the first
few years of life and uh certainly
everything that animals learn has
nothing to do with language so it would
be good to uh maybe push against some of
the intuition behind what you're saying
so it is true there's several orders of
magnitude more data coming into the
human
mind much faster and the human mind is
able to learn very quickly from that
filter the data very quickly you know
somebody might argue your comparison
between sensory data versus language
that language is already very compressed
it already contains a lot more
information than the btes it takes to
store them if you compare it to visual
data so there's a lot of wisdom and
language there's words and the way we
stitch them together it already contains
a lot of information so is it possible
that language alone already has
enough wisdom and knowledge in there to
be able to from that language construct
a a world model and understanding of the
world an understanding of the physical
world that you're saying L LMS lack so
it's a big debate among uh philosophers
and also cognitive scientists like
whether intelligence needs to be
grounded in
reality uh I'm clearly in the camp that
uh yes uh intelligence cannot appear
without some grounding in uh some
reality doesn't need to could be you
know physical reality it could be
simulated but um but the environment is
just much richer than what you can
express in language language is a very
approximate
representation of our percepts and our
mental models right I mean there there's
a lot of TX that we accomplish where we
manipulate uh a mental model of uh of
the situation at hand and that has
nothing to do with language everything
that's physical mechanical whatever when
we build something when we accomplish a
task a model task of you know grabbing
something Etc we plan or action
sequences and we do this by essentially
Imagining the result of the outcome of
sequence of actions that we might
imagine and that requires mental models
that don't have much to do with language
and that's I would argue most of our
knowledge is derived from that
interaction with the physical world so a
lot of a lot of my my colleagues who are
more uh interested in things like
computer vision are really on that camp
that uh AI needs to be embodied
essentially and then other people coming
from the NLP side or maybe you know some
some other U motivation don't
necessarily agree with that um and
philosophers are split as well uh and
the um the complexity of the world is
hard to um it's hard to imagine
you know it's hard to represent uh all
the complexities that we take completely
for granted in the real world that we
don't even imagine require intelligence
right this is the old marac Paradox from
the pioneer of Robotics hence marac who
said you know how is it that with
computers it seems to be easy to do high
level complex tasks like playing chess
and solving integrals and doing things
like that whereas the thing we take for
granted that we do every day um like I
don't know learning to drive a car or
you know grabbing an object we can do as
computers um
and you know we have llms that can pass
pass the bar exam so they must be smart
but then they can't learn to drive in 20
hours like any 17y old they can't learn
to clear up the dinner table and F of
the dishwasher like any 10-year-old can
learn in one shot um why is that like
you what what are we missing what what
type of learning or or reasoning
architecture or whatever are we missing
that um um basically prevent us from
from you know having level five sing
Cars and domestic robots can a large
language model construct a world model
that does know how to drive and does
know how to fill a dishwasher but just
doesn't know how to deal with visual
data at this time so it it can Opera in
a space of Concepts so yeah that's what
a lot of people are working on so the
answer the short answer is no and the
more complex answer is you can use all
kind of tricks to get uh uh an llm to
basically digest um visual
representations of representations of
images uh or video or audio for that
matter um and uh a classical way of
doing this
is uh you train a vision system in some
way and we have a number of ways to
train Vision systems either supervise
semi supervised self-supervised all
kinds of different
ways uh that will turn any image into
high level
representation basically a list of
tokens that are really similar to the
kind of tokens that
uh typical llm takes as an input and
then you just feed that to the llm
in addition to the text and you just
expect LM to kind of uh you know during
training to kind of be able to uh use
those representations to help make
decisions I mean there been working
along those line for for quite a long
time um and now you see those systems
right I mean there are llms that can
that have some Vision extension but
they're basically hacks in the sense
that um those things are not like train
end to end to to handle to really
understand the world they're not train
with video for example uh they don't
really understand intuitive physics at
least not at the moment so you don't
think there's something special to you
about intuitive physics about sort of
Common Sense reasoning about the
physical space about physical reality
that's that to you is a giant leap that
llms are just not able to do we're not
going to be able to do this with the
type of llms that we are uh working with
today and there's a number of reasons
for this but uh the main reason
is the way llm LMS are trained is that
you you take a piece of text you remove
some of the words in that text you Mass
them you replace by replace them by
blank markers and you train a gtic
neural net to predict the words that are
missing uh and if you build this neural
net in a particular way so that it can
only look at uh words that are to the
left of the one is trying to predict
then what you have is a system that
basically is trying to predict the next
word in a text right so then you can
feed it um a text a prompt and you can
ask it to predict the next word it can
never predict the next word exactly and
so what it's going to do is uh produce a
probability distribution over all the
possible words in your dictionary in
fact it doesn't predict words it
predicts tokens that are kind of subw
units and so it's easy to handle the
uncertainty in the prediction there
because there's only a finite number of
possible words in the dictionary and you
can just compute a distribution over
them um then what you what the system
does is that it it picks word from that
distribution of course there's a higher
chance of picking words that have a
higher probability within that
distribution so you sample from that
distribution to actually produce a word
and then you shift that word into the
input and so that allows the system not
to predict the second word right and
once you do this you shift it into the
input Etc that's called Auto regressive
prediction and which is why those llms
should be called Auto regressive llms uh
but we just called
themm
and there is a difference between this
kind of process and a process by which
before producing a word when you talk
when you and I talk you and I are
bilinguals we think about what we're
going to say and it's relatively
independent of the language in which
we're going to say it when we we talk
about like a I don't know let's say
mathematical concept or something the
kind of thinking that we're doing and
the answer that we're planning to
produce is not linked to whether we're
going to see it in French or Russian or
English Chomsky just rolled his eyes but
I understand so you're saying that
there's a a bigger abstraction that
repres that's uh that goes before
language that maps onto language right
it's certainly true for a lot of
thinking that we that we do is that
obvious that we don't like you're saying
your thinking is same in French as it is
in English yeah pretty much yeah pretty
much or is this like how how flexible
are you like if if there's a probability
distribution well it it depends what
kind of thinking right if it's just uh
if it's like producing puns I get much
better in French than English about that
no but so is there an abstract
representation of puns like is your
humor an abstract like when you tweet
and your tweets are sometimes a little
bit spicy uh what's is there an abstract
representation in your brain of a tweet
before it maps onto English there is an
asct representation of uh Imagining the
reaction of a reader to that uh text
well you start with laughter and then
figure out how to make that happen or so
figure out like a reaction you want to
cause and and then figure out how to say
it right so that it causes that reaction
but that's like really close to language
but think about like a m mathematical
concept uh or um you know imagining you
know something you want to build out of
wood or something like this right the
kind of thinking you're doing has
absolutely nothing to do with language
really like it's not you have
necessarily like an internal monologue
in any particular language you're you're
you know imagining mental models of of
the thing right I mean if I if I ask you
to like imagine what this uh water
bottle will look like if I rotate it 90
degrees um that has nothing to do with
language and so uh so clearly there is
you know a more abstract level of
representation uh in which we we do most
of our thinking and we plan what we're
going to say if the output is
is you know uttered words as opposed to
an output being uh you know muscle
actions right um we we plan our answer
before we produce it and LMS don't do
that they just produce one word after
the other instinctively if you want it's
like it's a bit like the you know
subconscious uh actions where you don't
like you're distracted you're doing
something you completely concentrated
and someone comes to you and you know
asks you a question and you kind of
answer the question you don't have time
to think about the answer but the answer
is easy so you don't need to pay
attention you sort of respond
automatically that's kind of what an llm
does right it doesn't think about its
answer really uh it retrieves it because
it's accumulated a lot of knowledge so
it can retrieve some some things but
it's going
to just spit out one token after the
other without planning the answer but
you're making it sound just one token
after the other one token at a time
generation is uh bound to be
simplistic but if the world model is
sufficiently sophisticated that one
token at a
time the the most likely thing it
generates is a sequence of tokens is
going to be a deeply profound thing okay
but then that assumes that those systems
actually possess
World model so really goes to the I I
think the fundamental question is can
you build a a
really complete World model not complete
but a one that has a deep understanding
of the world yeah so can you build this
first of all by prediction right and the
answer is probably yes can you predict
can you build it by predicting words and
the answer is most probably no
because language is very poor in terms
or weak or low bandwidth if you want
there's just not enough information
there so building World models means
observing the
world and uh understanding why the world
is evolving the way the way it is and
then uh the the extra component of a
world model is something that can
predict how the world is going to evolve
as a consequence of action you might
take right so what model really is here
is my idea of the state of the world at
time te here is an action I might take
what is the predicted state of the world
at time
t+1 now that state of the world doesn't
does not need to represent everything
about the world it just needs to
represent enough that's relevant for
this planning of of the action but not
necessarily all the details now here is
the problem um you're not going to be
able to do this with generative models
so a genery model has trained on video
and we've tried to do this for 10 years
you take a video show a system a piece
of video and then ask you to predict the
reminder of the video basically predict
what's going to happen one frame at a
time there the same thing as sort of uh
the autoaggressive llms do but for video
right either one FR at a time or a group
of friends at a time um but yeah uh a
large video model if you want uh the
idea of of doing this has been floating
around for a long time and at at Fair uh
some colleagues and I have been trying
to do this for about 10
years um and you can't you can't really
do the same trick as with llms because
uh you know LMS as I said you can't
predict exactly which word is going to
follow a sequence of words but you can
predict a distribution over our words
now if you go to video what you would
have to do is predict the distribution
over all possible frames in a video and
we don't really know how to do that
properly uh we we we do not know how to
represent distributions over High
dimensional continuous spaces in ways
that are
useful uh and and that's that they lies
the main issue and the reason we can do
this is because the world is incredibly
more complicated and richer in terms of
information than than text text is
discret video is High dimensional and
continuous a lot of details in this um
so if I take a a video of this room uh
and the video is you know a camera
panning
around um there is no way I can predict
everything that's going to be in the
room as I pan around the system cannot
predict what's going to be in the room
as the camera is panning maybe it's
going to predict this is this is a room
where there's a light and there is a
wall and things like that it can't
predict what the painting on the wall
looks like or what the texture of the
couch looks like like certainly not the
texture of the carpet so there's no way
I can predict all those details so the
the way to handle this is one way
possibly to handle this which we've been
working for a long time is to have a
model that has what's called a latent
variable and the latent variable is fed
to an Nal net and it's supposed to
represent all the information about the
world that you don't perceive yet and uh
that you need to
augment uh the the system for the
prediction to do a good job at
predicting pixels including the you know
fine texture of the of the carpet and
the and the couch and and the painting
on the wall
um uh that has been a complete failure
essentially and we've tried lots of
things we tried uh just straight neural
Nets we tried Gans we tried uh you know
Vees all kinds of regularized Auto
encoders we tried um many things
we also tried those kind of methods to
learn uh good representations of images
or video um that could then be used as
input to for example an image
classification
system and that also has basically
failed like all the systems that attempt
to predict missing parts of an image or
video um you know from a corrupted
version of it basically so right take an
image or a video corrupt it or transform
it in some way
and then try to reconstruct the complete
video or image from the corrupted
version and then hope that internally
the system will develop a good
representations of images that you can
use for object recognition segmentation
whatever it
is that has been essentially a complete
failure and it works really well for
text that's the principle that is used
for llms right so where is the failure
exactly is that that it's very difficult
to form a good representation of an
image a good in like a good embedding of
all all the important information in the
image is it in terms of the consistency
of image to image to image to image that
forms the video like where what are the
if we do a highlight reel of all the
ways you
failed what what's that look like okay
so the reason this doesn't work uh is
first of all I have to tell you exactly
what doesn't work because there is
something else that does work uh so the
thing that does not work is training a
system to learn representations of
images by training it to reconstruct uh
a good image from a corrupted version of
it okay that's what doesn't work and we
have a whole slew of techniques for this
uh that are you know variant of dening
Auto encoders something called Mee
developed by some of my colleagues at
Fair Max Auto encoder so it's basically
like
the you know llms or or or or things
like this where you train the system by
corrupting text except you corrupt
images you remove Patches from it and
you train a gigantic neet to reconstruct
the features you get are not good and
you know they're not good because if you
now train the same architecture but you
train it
supervised with uh label data with text
textual descriptions of images Etc you
do get good representations and the
performance on recognition task is much
better than if you do this
self-supervised pre trining so the AR
Ure is good the architecture is good the
architecture of the encoder is good okay
but the fact that you train the system
to reconstruct images does not lead it
to produce to learn good generic
features of images when you train in a
self-supervised way self-supervised by
reconstruction Yeah by reconstruction
okay so what's the
alternative the alternative is joint
embedding what is joint embedding what
are what are these architectures that
you're so excited about okay so now
instead of training system to encode the
image and then training it to
reconstruct the the full image from a
corrupted version you take the full
image you take
the corrupted or transformed version you
run them both through
encoders which in general are identical
but not
necessarily and then you you train a
predictor on top of those
encoders um to predict the
representation of the full input from
the representation of the corrupted
one okay so don't embedding because
you're you're taking the the full input
and the corrupted version or transform
version run them both through encoders
so you get a joint embedding and then
you and then you're you're saying can I
predict the representation of the full
one from the representation of the
corrupted one okay um and I call this
JEA so that means joint embedding
predictive architecture because this
joint embedding and there is this
predictor that predicts the weos
presentation of the good guy from from
the bad
guy um and the big question is how do
you train something like this uh and
until five years ago or six years ago we
didn't have particularly good answers
for how you train those things except
for one um called contractive
contrastive
learning
where U and the idea of contractive
learning is you you take a pair of
images that are again an image and a
corrupted version or degraded version
somehow or transformed version of the
original one and you train the predicted
representation to be the same as I said
if you only do this the system collapses
it basically completely ignores the
input and produces representations that
are
constant so the contrastive methods
avoid this and and those things have
been around since the early 90s I had a
paper on this in
1993 um is you also show pairs of images
that you know are
different and then you push away the
representations from each other so you
say not only do representations of
things that we know are the same should
be the same or should be similar but
representation of things that we know
are different should be
different and that prevents the collapse
but it has some limitation and there's a
whole bunch of uh techniques that have
appeared over the last six seven years
um that can revive this this type of
method um some of them from Fair some of
them from from Google and other places
um but there are limitations to those
contrasty method what has changed in the
last
uh you know three four years is now now
we have methods that are non-contrastive
so they don't require those negative
contrastive samples of images that are
that we know are different you can only
you TR them only with images that are
you know different versions or different
views are the same thing uh and you rely
on some other tricks to prevent the
system from collapsing and we have half
a dozen different methods for this
now
Parcourir plus de vidéos associées
![](https://i.ytimg.com/vi/N09C6oUQX5M/hq720.jpg)
Can LLMs reason? | Yann LeCun and Lex Fridman
![](https://i.ytimg.com/vi/nkdZRBFtqSs/hq720.jpg)
How Developers might stop worrying about AI taking software jobs and Learn to Profit from LLMs
![](https://i.ytimg.com/vi/hrPQS__ayu8/hq720.jpg)
STUNNING Step for Autonomous AI Agents PLUS OpenAI Defense Against JAILBROKEN Agents
![](https://i.ytimg.com/vi/v3o5PdvrdOI/hq720.jpg)
The Pioneer of "Emotional Intelligence" Daniel Goleman on a Balanced Life
![](https://i.ytimg.com/vi/cfqtFvWOfg0/hq720.jpg)
Why Large Language Models Hallucinate
![](https://i.ytimg.com/vi/WXJ0f9bJuS4/hq720.jpg)
NO: GPT, Claude e gli altri NON SONO COSCIENTI. Propongo una soluzione.
5.0 / 5 (0 votes)