Yann Lecun | Objective-Driven AI: Towards AI systems that can learn, remember, reason, and plan
Summary
TLDRYann LeCun在哈佛的演讲中探讨了人工智能的未来,特别是目标驱动的AI系统。他强调了当前AI系统与人类和动物相比在学习和理解世界方面的局限性,并提出了一种新的AI架构理念,包括自我监督学习和能量模型。LeCun认为,开源AI平台对于民主和技术创新至关重要,同时对未来的智能助理和增强人类全球智能表示乐观。
Takeaways
- 🌟 人工智能(AI)的未来不应仅仅是基于现有的大型语言模型(LLMs),而应发展为具有目标驱动的AI架构。
- 🚀 目前的AI系统与人类和动物相比在学习和理解世界方面存在显著不足,例如缺乏常识和目标驱动的行为。
- 📈 自监督学习在AI领域的成功,特别是在文本、图像识别和语音翻译等方面,为未来的AI发展提供了新的可能性。
- 🧠 人类和动物能够快速学习新任务并理解世界运作的原理,而现有的AI系统还远未达到这种能力。
- 🔍 为了实现更智能的AI系统,我们需要构建能够从感官输入中学习世界模型、具有持久记忆和能够进行层次化规划的系统。
- 🛠️ 我们目前的AI系统在逻辑理解、常识推理和现实世界知识方面存在局限,这些都需要通过新的学习范式来克服。
- 🔗 未来的AI系统将与我们的数字世界互动,因此需要具备与人类智能相当的水平,以提供更好的用户体验和服务。
- 🌐 开源AI平台对于实现AI技术的多样性和普及至关重要,有助于防止AI技术被少数公司垄断。
- 📚 教育和研究机构需要深入研究能量模型和基于能量的学习,以发展更高效和安全的AI系统。
- 📈 通过自监督学习和联合嵌入架构(JEPA),AI系统可以在不依赖大量标记数据的情况下学习有效的特征表示。
- 🔑 为了确保AI系统的安全性和可控性,需要在其架构中嵌入保护性目标(Guardrail objectives)和安全控制机制。
Q & A
Yann LeCun 在演讲中提到了哪些关于机器学习未来发展的主要观点?
-Yann LeCun 主要讨论了机器学习未来的发展,特别是关于目标驱动的人工智能(Objective-Driven AI)。他强调了当前AI系统相比于人类和动物的学习效率和理解世界的能力还有很大的差距。他提出了建立与人类智能水平相当的系统的必要性,并探讨了实现这一目标的挑战和可能的途径,包括自我监督学习、能量模型和预测架构等。
Yann LeCun 认为现有的AI系统在学习效率和理解世界方面存在哪些不足?
-Yann LeCun 认为现有的AI系统相较于人类和动物在学习效率上有显著不足。例如,人类和动物能够快速地通过少量样本或尝试来运行新任务,而AI系统则需要更多的数据和计算资源。此外,人类和动物能够理解世界如何运作,具有推理和规划的能力,以及常识,而这些都是当前AI系统所缺乏的。
Yann LeCun 提出的Objective-Driven AI架构包含哪些关键组成部分?
-Objective-Driven AI架构的关键组成部分包括感知模块、可能的持久记忆、世界模型、执行器、成本模块和目标函数。这种架构的目标是通过优化目标函数来规划一系列行动,从而实现预测的结果满足特定的目标。
Yann LeCun 为什么认为自我监督学习对于AI的发展至关重要?
-Yann LeCun 认为自我监督学习是AI发展的关键,因为它是最近AI领域取得重大进展的主要原因。自我监督学习通过从数据中提取内在结构来训练模型,而不需要显式的标注。这种方法使得模型能够更好地理解和处理数据,从而在多种任务中表现出色,如语言模型、图像识别和语音识别等。
Yann LeCun 如何看待当前的强化学习?
-Yann LeCun 认为当前的强化学习虽然给人带来了希望,但实际上效率很低,在现实世界中几乎是不切实际的,除非更多地依赖于自我监督学习。他提出需要新的学习范式,即目标驱动的AI架构,以实现更智能、更安全的AI系统。
Yann LeCun 提出了哪些关于未来AI系统的预测和建议?
-Yann LeCun 预测未来的AI系统将会与我们的日常数字交互紧密相关,并且我们需要建立具有人类水平智能的系统。他建议我们需要开放源代码的AI平台,以便任何人都可以根据自己的语言、文化和价值观来微调AI系统。他还强调了在发展AI的同时,要确保其安全性和可控性,避免潜在的风险。
Yann LeCun 如何看待当前的生成式AI模型?
-Yann LeCun 认为生成式AI模型在处理文本方面效果不错,但对于图像和其他高维连续数据则不够有效。他提出应该放弃生成式模型,转而使用联合嵌入预测架构(JEPA),因为这些架构能够提供更好的内部表示,并且对于预测和规划等任务更为有效。
Yann LeCun 为什么认为能量模型比概率模型更适合当前的AI学习?
-Yann LeCun 认为能量模型提供了一种更直接的方式来处理数据的兼容性或不兼容性,而不需要担心概率模型中的归一化问题。能量模型允许我们通过直接操作能量函数来避免处理分母(分区函数),这在许多统计物理问题中是难以处理的。此外,能量模型在理论上更为灵活,可以更容易地实现正则化,从而避免模型崩溃。
Yann LeCun 如何看待当前的深度学习模型在逻辑理解方面的能力?
-Yann LeCun 认为当前的深度学习模型,尤其是大型语言模型,虽然在文本生成方面表现出色,但它们并不真正理解逻辑。这些模型可能需要显式地教授如何执行算术运算,并且它们缺乏对现实世界的基本理解。因此,尽管它们能够流畅地生成文本,但并不意味着它们具有高级的智能或理解能力。
Yann LeCun 讨论了哪种自我监督学习方法,并认为它在图像识别方面的效果如何?
-Yann LeCun 讨论了一种名为MAE(Masked Auto-Encoder)的自我监督学习方法,这种方法通过遮蔽图像的一部分并训练模型来重建缺失的部分。他认为,尽管这种方法在重建图像方面取得了一定的成功,但它在生成内部表示方面并不如联合嵌入方法有效。
Yann LeCun 对于未来AI系统的安全性和可控性有哪些建议?
-Yann LeCun 建议通过建立目标驱动的AI架构来确保系统的安全性和可控性。这包括使用世界模型来预测行动的后果,以及通过优化目标函数来规划行动序列。此外,他还提出了使用guardrail目标来确保系统的安全可控。
Yann LeCun 如何看待当前AI系统在处理非语言知识方面的能力?
-Yann LeCun 认为当前的AI系统主要依赖于文本数据,而大多数人类知识实际上是非语言的。他指出,我们通过观察世界获得的背景知识远远超过了公开可用的文本数据。因此,仅通过语言训练的AI系统无法达到人类水平的智能。
Outlines
🎓 哈佛数学科学与应用中心主任Dan Freed介绍
Dan Freed是哈佛数学科学与应用中心的主任,该中心由S.T. Yau创立,致力于数学与科学的双向互动。中心拥有众多博士后研究人员,涉及数学、物理、经济、计算机科学和生物学等领域。他们举办项目、研讨会、会议,并邀请专家进行特别讲座。本次讲座邀请了Meta的首席AI科学家、纽约大学教授Yann LeCun,他将谈论目标驱动的AI。
🤖 Yann LeCun关于未来AI的展望
Yann LeCun讨论了AI的未来,而非现状。他提出了一些提案而非结果,并分享了过去两年的初步成果。他强调机器学习与人类的差别,指出AI系统缺乏目标驱动的行为和常识。他提出了学习范式的变化,从监督学习到强化学习,再到自监督学习,并强调了构建与人类智能相当的系统的重要性。
🧠 对现有AI系统的批评
LeCun批评了现有AI系统的限制,如无法快速学习新任务、缺乏推理和计划能力、以及对世界的常识理解。他指出,尽管AI在某些任务上超越了人类,但它们并不具备普遍性。他还提到了人工智能的安全性和可控性,以及如何通过设定目标来实现这些特性。
🚀 自监督学习的兴起与应用
LeCun解释了自监督学习的概念及其在AI领域的成功应用,包括语言模型、图像识别、语音识别等。他提到了自监督学习的不同方式,如文本中的掩码填充,以及如何通过这种方式训练神经网络来预测缺失的信息。
🧠 人类与AI的智能差异
LeCun讨论了人类智能与AI之间的差异,强调人类和动物能够快速学习新任务,而AI系统则需要大量数据和时间。他指出,人类的知识大部分是非言语的,而当前的AI系统主要基于文本数据,这限制了它们的学习能力。
🛠️ 目标驱动的AI架构
LeCun提出了目标驱动的AI架构,强调了感知模块、记忆、世界模型、执行器和成本模块的重要性。他解释了如何通过优化目标来规划行动序列,并讨论了这种架构与现有AI系统的不同之处。
🧠 世界模型的构建与学习
LeCun探讨了如何构建世界模型,以及如何通过观察和预测来学习世界的状态。他提出了使用视频数据进行自监督学习的想法,并讨论了预测编码的概念。他还提出了一种新的方法,即联合嵌入预测架构(JEPA),以改进世界模型的学习。
🤔 对未来AI的思考与挑战
LeCun分享了他对构建更高级AI系统的思考,包括如何从视频中学习世界模型,以及如何将这些模型应用于规划系统。他强调了在不确定性中进行规划的挑战,以及如何通过探索机制来调整世界模型。他还讨论了开源AI平台的重要性,以及如何避免监管过度导致的创新受阻。
🌐 对话与提问环节
在对话和提问环节中,LeCun回答了关于JEPA架构、世界模型、控制策略以及对欧盟AI法案的看法等问题。他强调了开放源代码AI的重要性,并讨论了如何通过多样化的AI系统来减少偏见。他还提到了AI发展的未来趋势,以及如何通过渐进的方式实现更安全、更智能的AI系统。
Mindmap
Keywords
💡人工智能
💡目标驱动的AI
💡自我监督学习
💡能量模型
💡层次规划
💡世界模型
💡知识偏见
💡开放源代码AI平台
💡监管
💡人工智能安全
Highlights
Yann LeCun提出了目标驱动的人工智能概念,旨在构建能够理解和实现目标的AI系统。
现有的AI系统,如大型语言模型(LLMs),缺乏对世界的理解和推理能力,与人类和动物的学习方式相比存在明显不足。
LeCun强调了自我监督学习在AI领域的成功,尤其是在自然语言处理和图像识别等方面。
他提出了一种新的AI架构——联合嵌入预测架构(JEPA),用以替代当前的生成模型,并解决模型崩溃问题。
LeCun讨论了如何通过自我监督学习从视频中学习,这对于构建能够理解和预测复杂世界状态的AI系统至关重要。
他提出了一种名为VICReg的训练方法,通过最大化信息量来防止模型崩溃,并提高模型的泛化能力。
LeCun强调了开放源代码AI平台的重要性,认为这是实现民主和避免监管过度的关键。
他预测,实现人类水平智能的AI不会是一个突然的事件,而是一个渐进的过程,需要不断地增加AI系统的知识、能力和安全性。
LeCun讨论了如何使用AI系统来增强人类的全球智能,并提出了一个未来愿景,即每个人都能通过AI系统与数字世界互动。
他提出了一种基于能量的模型学习方法,这种方法通过直接操作能量函数而不是依赖概率分布来避免处理不可解的分母问题。
LeCun强调了规划算法在AI系统中的重要性,特别是在不确定性存在的情况下如何进行有效规划。
他提出了一种名为I-JEPA的模型,该模型在图像识别等任务上表现出色,优于其他自监督学习方法。
LeCun讨论了如何训练AI系统以理解偏微分方程(PDEs),这对于科学和工程领域的AI应用具有重要意义。
他强调了避免AI系统偏见的重要性,并通过提供多样化的AI系统来解决这个问题。
LeCun提出了对未来AI的愿景,即AI系统将成为人类知识的存储库,每个人都可以使用并根据自己的需求进行定制。
他警告说,过度的AI监管可能会扼杀开放源代码AI平台的发展,这对于科技进步和民主至关重要。
LeCun讨论了如何通过自我监督学习从视频中学习,这对于构建能够理解和预测复杂世界状态的AI系统至关重要。
他提出了一种名为V-JEPA的视频处理模型,该模型通过预测视频的缺失部分来学习世界模型,对于理解视频中的动作和事件具有潜在的应用价值。
LeCun强调了开放源代码AI平台的重要性,认为这是实现民主和避免监管过度的关键。
他预测,实现人类水平智能的AI不会是一个突然的事件,而是一个渐进的过程,需要不断地增加AI系统的知识、能力和安全性。
Transcripts
- I'm Dan Freed,
Director of the Center of Mathematical Sciences
and Applications here at Harvard.
This is a center that was founded 10 years ago by S.T. Yau.
It's a mathematics center.
We engage in mathematics
and mathematics in interaction
two-way interaction with science.
We have quite a crew of postdocs
doing research in mathematics
and mathematics, in physics, in economics,
in computer science and biology.
We run some programs, workshops, conferences,
and a few times a year we have special lectures,
and today is one of them.
This is the fifth annual Ding-Shum lecture.
And we're very pleased today to have Yann LeCun,
who's the chief AI scientist at Meta,
and a professor at New York University,
an expert on machine learning in many, many forms.
And today, he'll talk to us about Objective-Driven AI.
- Thank you very much.
Thank you for inviting me, for hosting me.
It seems to me like I give a talk at Harvard
every six months or so, at least for the last few years,
but to different crowds, physics department,
Center for Mathematics,
psychology, everything.
So I'm going to talk obviously about AI,
but more about the future than about the present.
And a lot of it is going to be
basically, proposals rather than results,
but preliminary results on the way to go.
I wrote a paper that I put online about two years ago
on what this program is about.
And you're basically going to hear a little bit of
what we have accomplished in the last two years
towards that program.
If you're wondering about the picture here on the right,
this is my amateurish connection with physics.
I take also photography pictures.
This is taken from my backyard in New Jersey.
It's Messier 51, beautiful galaxy.
Okay, machine learning sucks.
At least compared to what we observe in humans and animals.
It really isn't that good.
Animals and humans can run new tasks extremely quickly
with very few samples or trials.
They understand how the world works,
which is not the case for AI systems today.
They can reason and plan, which is not the case
for AI systems today.
They have common sense, which is not the case
for AI systems today.
And the behavior is driven by objective,
which is also not the case for most AI systems today.
Objectives means, you set an objective
that you try to accomplish
and you kind of plan a sequence of action
to accomplish this goal.
And AI systems like LLMs don't do this at all.
So the paradigms of learning, supervised learning
has been very popular.
A lot of the success of machine learning
at least until fairly recently
was mostly with supervised learning.
Reinforcement learning gave some people a lot of hope,
but turned out to be so inefficient
as to be almost impractical in the real world,
at least in isolation,
unless you rely much more on something
called self-supervised learning,
which is really what has brought about
the big revolution that we've seen in AI
over the last few years.
So the goal of AI really is,
to build systems that are smart as humans, if not more.
And we have systems that are better than humans
at various tasks today.
They're just not very general.
So hence people who call human-level intelligence,
artificial general intelligence, AGI.
I hate that term,
because human intelligence is actually not general at all,
it's very specialized.
So I think talking about general intelligence,
but we will mean human-level intelligence
is complete nonsense,
but that ship has sailed unfortunately.
But we do need systems that have human-level intelligence,
because in a very near future, or not so near future,
but in the near future, every single one of our interactions
with the digital world will be mediated by an AI system.
We'll have AI systems that are with us at all times.
I'm actually wearing smart glasses right now.
I can take a picture of you guys.
Okay, I can click a button or I can say,
"Hey, Meta, take a picture,"
and it takes a picture.
Or I can ask you the question,
and there isn't a LLM that will answer that question.
You're not going to hear it, because it's bone conduction,
but it's pretty cool.
So pretty soon we'll have those things
and it will be basically the main way
that we interact with the digital world.
Eventually, those systems will have displays
which this pair of glasses doesn't have,
and we'll use those AI systems all the time.
The way for them to be non-frustrating
is for them to be as smart as human assistance, right?
So we need human-level intelligence
just for reasons of basically product design, okay?
But of course, there's a more kind of interesting
scientific question of really what is human intelligence
and how can we reproduce it in machines
and things like that.
So it's one of those kind of small number of areas
where there is people who want a product
and are ready to pay for the development of it,
but at the same time,
it's a really great scientific question to work on.
And there's not a lot of domains
where that's the case, right?
So, but once we have human-level smart assistant
that have human-level intelligence,
this will amplify humanity's global intelligence,
if you want.
I'll come back on this later.
We're very far from that, unfortunately, okay?
Despite all the hype you hear from Silicon Valley mostly,
the people who tell you AGI is just around the corner.
We're not actually that close.
And it's because the systems
that we have at the moment
are extremely limited in some of the capabilities
that we have.
If we had system that approached human intelligence,
we would have systems that can learn
to drive a car in 20 hours of practice,
like any 17-year-old.
And we do have self-driving cars,
but they are heavily engineered, they cheat by using maps,
using all kinds of expensive sensors, active sensors,
and they certainly use a lot more than
20 hours of training data.
So obviously, we're missing something big.
If we had human-level intelligence,
we would have domestic robots that could do simple tasks
that a 10-year-old can learn in one shot,
like clearing up the dinner table
and clearing out the dishwasher.
And unlike 10-year-olds,
it wouldn't be difficult to convince them to do it, right?
But in fact, it's not even humans, just what a cat can do.
No AI system at the moment can do in terms of
planning complex sequences of actions
to jump on a piece of furniture or catch a small animal.
So we're missing something big.
And basically, what we're missing is systems
that are able to learn how the world works,
not just from text, but also from let's say video
or other sensory inputs.
Systems that have internal world models,
systems that have memory, they can reason,
they can plan hierarchically like every human and animal.
So that's the list of requirements,
systems that learn world models from sensory inputs,
learning intuitive physics, for example,
which babies learn in the first few months of life.
Systems that have persistent memory,
which current AI systems don't have.
Systems that can plan actions,
so as to fulfillment objectives.
And systems that are controllable and safe,
perhaps through the specification of Guardrail objectives.
So this is the idea of objective-driven AI architectures.
But before I talk about this, I'm going to lay the groundwork
for how we can go about that.
So the first thing is that self-supervised learning
has taken over the world.
And I first need to explain
what self-supervised learning is,
or perhaps in a special case.
But really the success of LLMs and all that stuff,
and even image recognition these days,
and speech recognition translation,
all the cool stuff in AI,
it's really due to self-supervised learning
the generalization of the user self-supervised learning.
So a particular way of doing it is you take a piece of data,
let's say a text, you transform it or you corrupt it
in some way.
For a piece of text, that would be
replacing some of the words by blank markers, for example.
And then you train some gigantic neural net
to predict the words that are missing,
basically, to reconstruct the original input, okay?
This is how an LLM is trained.
It's got a particular architecture,
but that only lets the system look at words on the left
of the word to be predicted.
But it's pretty much what it is.
And this is a generative architecture,
because it produces parts of the input, okay?
There are systems of this type
that have been trained to produce images
and they use other techniques like diffusion models,
which I'm not going to go into.
I played with one, so Meta has one of course.
So you can talk to through WhatsApp and Messenger,
and there's a paper that describes
the system that Meta has built.
And I typed the prompt here, up there in that system,
a photo of a Harvard mathematician
proving the Riemann hypothesis on the blackboard
with the help of an intelligent robot,
and that's what it produces.
I check the proof, it's not correct,
actually, there's symbols here
that I have no idea what they are.
Okay, so, everybody is excited about generative AI
and particular type of it called auto-regressive LLM,
and really it's train very much like I described.
But as I said, the system can only use words
that are on the left of it
to predict a particular word when you train it.
So the result is that once the system is trained,
you can show it a sequence of words
and then ask it to produce the next word.
Okay, then you can inject that next word into the input.
You shift the input by one, okay?
So the stuff that was produced by the system
now becomes part of the input
and you ask it to produce the second word, shift that in,
produce the next, next word,
shift that in, et cetera, right?
So that's called auto-aggressive prediction.
It's not a new concept, it's very, very old
in statistics and signal processing,
but in economics actually.
But that's the way an LLM works.
It's auto-aggressive.
It uses its own prediction as inputs.
So those things work amazingly well
for the simplicity conceptually of how they're trained,
which is just predict missing words.
It's amazing how well they work.
Modern ones are trained typically on a few trillion tokens.
This slide is too old now, it should put a zero.
It's not one to 2 trillion, it's more like 20 trillion.
So a token is a sub-word unit, really,
it's on average 3/4 of a word.
And there is a bunch of those models
that have appeared in the last few years.
It's not just in the last year and a half
since ChatGPT came out.
That's what made it known to the wider public.
But those things have been around for quite a while.
Things like BlenderBot, Galactica, LlaMA, Llama-2,
Code Llama, which are produced by FAIR,
Mistral and Mixtral from a small French company
formed by former FAIR people,
and then various others Gemma or more recently by Google.
And then proprietary models, Meta AI,
which is built on top of Llama-2,
and then Gemini from Google, ChatGPT, GPT-4, et cetera.
And those things make stupid mistakes.
They don't really understand logic very well,
but if you tell them that A is the same thing as B,
they don't necessarily know that B
is the same as A, for example.
They don't really understand transitivity
of ordering relationships and things like this.
They don't do logic.
You have to sort of explicitly teach them to do arithmetics
or have them to call tools to do arithmetics.
And they don't have any knowledge of the underlying reality.
They've only been trained on text.
Some of them have been trained also on images,
but it's basically by treating images like text.
So it's very limited,
but it's very useful to have those things open sourced
and available to everyone,
because everyone can sort of experiment with them
and do all kinds of stuff.
And there's literally millions of people using Llama
as a basic platform.
So self-supervising is not just used to produce text,
but also to do things like translation.
So there's a system produced by my colleagues
a few months ago called SeamlessM4T.
It can translate 100 languages into a 100 languages.
And it can do text to text, text to speech,
speech to text, and speech to speech.
And for speech to speech,
it can actually translate languages that are not written,
which is pretty cool.
It's also available, you can play with it.
It's pretty amazing.
I mean, that's kind of superhuman in some way, right?
I mean, there's few humans that can translate 100 languages
into 100 languages in any direction,
who actually had a previous system
that could do 200 languages, but only from text,
not from speech.
But there are dire limitations to the system.
The first thing is the auto-aggressive prediction
is basically, a exponentially divergent process.
Every time the system produces a word,
there is some chance that this word is
outside of the set of proper answers.
And there is no way to come back to correct mistakes, right?
So the probability that a sequence of words
will be kind of a correct answer to the question
decreases exponentially with the length of the answer,
which is not a good thing.
And there's various kind of technical papers on this,
not by me, that tend to show this.
A lot of criticism also on the fact
that those systems can't really plan.
So the amount of computation that an LLM devotes
to producing a token is fixed, right?
You give it a prompt, it runs through
however many layers it has in the architecture
and then produces a token.
So per token, the amount of computation is fixed.
The only way to get a system
to think more about something
is to trick it into producing more tokens,
which is kind of a very kind of circuitous way
of getting you to do work.
And so there's been a quite a bit of research
on the question of whether those systems
are actually capable of planning,
and the answer is no, they really can't plan.
Whenever they can plan or produce a plan.
It's basically, because they've been trained
on a very similar situation and they already saw a plan
and they basically regurgitate a very similar plan,
but they can't really use tools in new ways, right?
And then there is the last limitation,
which is that they're trained on language.
And so they only know whatever knowledge
is contained in language.
And this may sound surprising,
but most of human knowledge
actually has nothing to do with language.
So they can be used for as writing assistance,
giving you ideas if you have the white page's anxiety
or something like this.
They're not good so far for producing factual content
and consistent answers,
although they're kind of being modified for that.
And we are easily fooled into thinking
that they're intelligent, because they're fluent,
but really they're not that smart.
And they really don't understand how the world works.
So we're still far from human-level AI.
As I said, most of human and animal knowledge
certainly is non-verbal.
So what are we missing?
Again, I'm reusing those examples of learning to drive
or learning to clear the dinner table.
We are going to have human-level AI,
not before we have domestic robots that can do those things.
And this is called a Moravec's paradox,
the fact that there are things that appear complex
for humans like playing chess
or planning a complex trajectory,
and they're fairly simple for computers.
But then things that we take for granted
that we think don't require intelligence,
like what a cat can do,
it's actually fiendishly complicated.
And the reason might be this,
so it might be the fact that
the data bandwidth of text
is actually very low, right?
So a 10 trillion token dataset
is basically, the totality of the publicly available text
on the internet, that's about 10 to the 13 bytes,
or 10 to the 13 tokens, I should say.
A token is typically two bytes.
There's about 30,000 possible tokens in a typical language.
So that's 2 to 10 of the 13 bytes for training in LLM.
It would take 170,000 years for a human to read
at eight hours a day, 250 words per minute
or 100,000 years, if you read fast
and you read 12 hours a day.
Now consider a human child, a 4-year-old child,
a 4-year-old child has been awake 16,000 hours at least,
that's what psychologists are telling us,
which by the way is only 30 minutes of YouTube uploads.
We have 2 million optical nerve fibers going into
our visual cortex, about a million from each eye.
Each fiber maybe carries about 10 bytes per second.
Jaim is going, "What?"
This is an upper bound.
And so the data volume that a 4-year-old has seen
through vision is probably on the order of 10 to 15 bytes.
That's way more than the totality
of all the texts publicly available on the internet.
50 times more, 50 times more data by the time you're four
that you're seen through vision.
So that tells you a number of things,
but the first thing it tells you is that
we're never going to get to human-level AI
by just training on language, it's just not happening.
There's just too much background knowledge
about the world that we get from observing the world
that current AI systems don't get.
So that leads me to this idea of objective-driven AI system.
What is it that sort of makes humans, for example,
capable of, or animals for that matter,
capable of kind of using tools and objects and situations
in new ways and sort of invent new ways of behaving?
So I wrote a fairly readable,
fairly long paper on this.
You see the URL here, it's not on archive,
because it's on this open review site,
which you can comment,
tell me how wrong this is and everything.
And the basic architecture
is kind of shown here.
So every time you have an arrow,
that means there is signals going through,
but also means there might be gradients going backwards.
So I'm assuming everything in there is differentiable.
And there is a perception module
that observes the world,
turn it into representations of the world,
a memory that might be sort of persistent memory,
factual memory, things like that.
A world model, which is really the centerpiece
of this system,
an actor and a cost module objective functions.
The configurator, I'm not going to talk about,
at least not for now.
So here is how this system works.
A typical episode is that the system observes the world,
feed this through this perception system.
Perception system produces some idea
of the current state of the world,
or at least the part of the world
that is observable currently.
Maybe it can combine this with the content of a memory
that contains the rest of the state of the world
that has been previously observed.
Okay, so you get some pretty good idea
where the current state of the world is.
And then the world model, the role of the world model
is to take into account the current state of the world
and hypothesized sequence of actions
and to produce a prediction
as to what is going to be the future state of the world
resulting from taking those actions, okay?
So state of the world at time, t, sequence of actions,
state of the world at time, t plus, whatever.
Now that outcome, that predicted state of the world
goes into a number of modules,
whose role is to compute basically a scalar objective.
So each of those square boxes here,
the red square boxes or pink ones,
they're basically scalar-valued function
that take representation of the state of the world
and tell you how far the state of the world
is from a particular goal,
objective target, whatever it is.
Or it takes a sequence of predicted states
and it tells you to what extent that sequence of state
is dangerous, toxic, whatever it is, right?
So those are the guardrail objectives.
Okay, so an episode now consists in what the system will do.
The way it operates, the way it produces its output
is going to be an action sequence,
is going to be by optimizing the objectives,
the red boxes,
whatever comes out of the red boxes
with respect to the action sequence, right?
So there's going to be an optimization process
that is going to look for search for
an action sequence in such a way
that the predicted outcome end state of the world
satisfies the objectives, okay?
So this is intrinsically very different principle
from just running through a bunch of layers
in the neural net.
This is intrinsically more powerful, right?
You can express pretty much any algorithmic problem
in terms of an optimization problem.
And this is basically an optimization problem.
And not specifying here exactly
what optimization algorithm to use.
If the action sequence space in the space
in which we do this inference is continuous,
we can use gradient-based methods,
because all of those modules are differentiable.
So we can back propagate gradients
through the backwards through those arrows
and then update the action sequence
to minimize the objectives and then converge to
an optimal action sequence
for the objective we're looking for,
according to a word model.
If a word model is something like
discrete time differential equation or something like this,
we might have to run it from multiple steps.
Okay, so the initial world sequence
is fed to the world model together with an initial action
that predicts the next state.
From that next state, we feed another action
that predicts the next, next state.
The entire sequence can be fed to the guardrail objectives,
and then the end result
is fed to the task objective, essentially.
So this is sort of a ideal situation
where the world model is deterministic,
'cause the world might be deterministic,
is very little uncertainty
about what's going to happen
if I do a sequence of action to grab this bottle,
I'm in control.
But most of the world is not completely predictable.
So you probably need some sort of latent variable
that you feed to your world model
that would account for all the things
you don't know about the world.
You might have to sample those latent variables
within a distribution to make multiple predictions
about what might happen in the future,
because of uncertainties in the world.
Really, what you want to do ultimately,
is not this type of kind of one level planning,
but you want to do hierarchical planning.
So basically, have a system that can produce multiple
representations of the state of the world,
have multiple level of abstraction,
so that you can make predictions
more or less longterm in the future.
So here's an example.
Let's say I'm sitting in my office at NYU in New York
and I want to go to Paris.
I'm not going to plan my entire trip from New York to Paris
in terms of millisecond by millisecond muscle control.
It's impossible.
It would be intractable in terms of optimization, obviously,
but also it's impossible,
because I don't know the condition that will occur.
Do I have to avoid a particular obstacle
that I haven't seen yet?
Is a street light going to be red or green?
How long am I going to wait to grab a taxi?
Whatever.
So I can't plan everything from the start,
but what I can do is I can do high level planning,
so high level planning at a very abstract level,
I know that I need to get to the airport and catch a plane.
Those are two macro actions, right?
So that determines a sub-goal for the lower level.
How do I get to the airport?
Well, I'm in New York, so I need to go down in the street
and have the taxi.
That sets a goal for the level below.
How do I get to the street where I get, I have to,
take the elevator down and then work out on the street?
How do I go to the elevator?
I need to stand up for my chair, open the door in my office,
walk to the elevator, push the button.
How do I get up from my chair?
And that I can't describe,
because it's like muscle control and everything, right?
So you can imagine that there is
this hierarchical planning thing going on.
We do this completely effortlessly,
absolutely all the time animals do this very well.
No AI system today is capable of doing this.
Some robotic system do hierarchical planning,
but it's hardwired, it's handcrafted, right?
So if you want to have a working robot,
walk from here to the door, stairs,
you first have a high level planning of the trajectory,
you're not going to walk directly through here,
you're going to have to go through the stairs, et cetera.
And then at the lower level, you're going to plan the motion
of the legs to kind of follow that trajectory.
But that's kind of handcrafted.
It's not like the system has learned to do this.
It was kind of built by hand.
So how do we get systems to spontaneously learn
the appropriate levels of abstractions
to represent action plans?
And we really don't know how to do this,
or at least we don't have any demonstration of any system
that does this, that actually works.
Okay, so next question is going to be,
if we're going to build a system of this type,
is how are we going to build a world model?
Again, world model is state of the world at time, t action,
predicted state of the world at time, t plus 1,
whatever the unit of time is.
And the question is, how do humans do this or animals?
So you look at what age babies learn basic concepts.
They sold this chart from Emmanuel Dupoux,
who's a psychologist in Paris.
And the basic things like basic object categories
and things like this that are learned pretty early on
without language, right?
Babies don't really understand language at the age
of four months, but they develop
the notion of object categories spontaneously,
things like solidity, rigidity of object,
a difference between animate and inanimate objects.
And then intuitive physics pops up around nine months.
So it takes about nine months for babies
to learn that objects that are not supported,
fall because of gravity,
and more concepts in intuitive physics.
It is not fast, right?
I mean, we take a long time to learn this.
Most of this, at least in the first few months of life
is learned mostly by observation,
who has very little interaction with the world,
'cause a baby until, three, four months
can't really kind of manipulate anything or affect the world
beyond their limbs.
So most of what they learn about the world
is mostly observation.
And the question is, what type of learning is taking place
when babies do this?
This is what we need to reproduce.
So there is a natural idea
which is to just transpose the idea
of self-supervised training for text
and use it for video, let's say, right?
So, take a video, call this y, full video
and then corrupt it by masking a piece of it,
let's say the second half of the video.
So call this masked video x,
and then train some gigantic neural net
to predict the part of the video that is missing.
And hoping that if the system predicts
what's going to happen in the video,
probably has good idea of what the underlying nature
of the physical world is.
A very natural concept.
In fact, neuroscientists have been thinking about
this kind of stuff for a very long time.
It's called predictive coding.
And I mean this idea that you learn by prediction
is really very standard.
You do this and it doesn't work.
We've tried for, my colleague and I
have been trying to do this for 10 years,
and you don't get good representations of the world,
you don't get good predictions.
The kind of prediction you get are very blurry,
kind of like the video at the top here
where the first four frames of that video are observed,
the last two are predicted by neural net
and it predicts very blurry images.
The reason being that it can't really predict
what's going to happen,
so it predicts the average of all the plausible things
that may happen.
And that's a very blurry video.
So doesn't work.
The solution to this is to basically abandon the idea
of generative models.
That might seem shocking given that this is
the most popular thing in machine learning at the moment.
But we're going to have to do that.
And the solution is that I'm proposing at least,
is to replace this by something I call
joint embedding predictive architectures, JEPA.
This is what a JEPA is.
So you take y, you corrupt it, same story
or you transform it in some way.
But instead of reconstructing y from x,
you run both x and y through encoders.
And what you reconstruct
is you reconstruct the representation of y
from the representation of x.
So you're not trying to predict every pixel,
you're only trying to predict a representation
of the input which may not contain all the information
about the input,
may contain only partial information.
So that's the difference between those two architectures.
On the left, generative architectures that reproduce y,
on the right, joint embedding architectures
that embed x and y into a representation space.
And you do the prediction in representation space.
And there's various flavors
of this joint embedding architecture.
The one on the left is an old idea called Siamese networks,
goes back to the early nineties I worked on.
And then there is
deterministic and non-deterministic versions
of those JEPA architectures.
I'm not going to go into the details.
The reason why you might need latent variables
in the predictor,
is because it could be that
the world is intrinsically unpredictable
or not fully observable or stochastic.
And so you need some sort of way
of making multiple predictions
for a single observation, right?
So the z variable here is basically parametizes
the set of things you don't know about the world
that you have not observed in the state of the world.
And that will parametize the set of potential predictions.
Now there's another variable here called a,
and that's what turns the joint embedding architecture
into a world model.
This is a world model, okay?
x is an observation,
sx is the representation of that observation.
a would be an action that you take.
And then sy is a prediction of
the representation of the state of the world
after you've taken the action, okay?
And the way you train the system
is by minimizing the prediction error.
So y would be the future observation
of the world, right?
x is the past and the present,
y is the future.
You just have to wait a little bit before you observe it.
You make a prediction, you take an action
or you observe someone taking an action,
you make a prediction about what the state,
the future state of the world is going to be.
And then you can compare the actual state of the world
that you observe with the predicted state
and then train the system to minimize the prediction error.
But there's an issue with this,
which is that that system can collapse.
If you only minimize the prediction error,
what it can do is ignore x and y completely,
produce sx and sy that are constant
and then the prediction problem becomes trivial.
So you cannot train a system of this type
by just minimizing the prediction error.
You have to be a little smarter about how you do it.
And to understand how this works,
you have to basically use a concept
called energy-based models,
which is, you can think of as a weakened version
of probabilistic modeling.
And for the physicists in the room,
the way to turn to go from energies to probabilities
is you take exponential minus and normalize.
But if you manipulate the energy function directly,
you don't need this normalization.
So that's the advantage.
So what is an energy-based model?
It's basically, an implicit function F of x,y
that measures the degree of incompatibility between x and y.
Whether y is a good continuation for x in the case of video,
whether y is a good set of missing words from x,
things like that, right?
But basically, that function takes the two argument x and y
and gives you a scalar value that indicates
to what extent x and y are compatible or incompatible.
It gives you zero if x and y are compatible or a small value
and it gives you a larger value if they're not.
Okay, so imagine that those two variables as scalar
and the observations are the black dots.
That's your training data, essentially.
You want to train this energy function
in such a way that it takes low values
on the training data and around,
and then higher value everywhere else.
And what I've represented here is kind of
the lines of equal energy if you want
the contours of equal energy.
So how are we going to do this?
So, okay, so the energy function is not a function
you minimized by training,
it's a function you minimized by inference, right?
If I want to find a y that is compatible with an x,
I search over the space of ys for a value of y
that minimizes F of x,y, okay?
So the inference process does not consist
in running feet forward through a neural net.
It consists in minimizing an energy function
with respect to y.
And this is computationally,
this is intrinsically more powerful
than running through a fixed number of layers
in the neural net.
So that gets around the limitation of auto-aggressive LLMs
that spanned a fixed amount of computation per token.
This way of doing inference
can span in a limited amount of resources
figuring out a good y
that minimizes F of x,y
depending on the nature of F and the nature of y.
So if y is a continuous variable
and your function hopefully is differentiable,
you can minimize it using gradient-based methods.
If it's not, if it's discreet,
then will have to do some sort of combinatorial search,
but that would be way less efficient.
So if you can make everything continuous and differentiable,
you're much better off.
And by the way, I meant, I forgot to mention something
when I talked about world model,
this idea that you have a world model
that can predict what's going to happen as a consequence
of a sequence of actions,
and then you have an objective you want to minimize
and you plan a sequence of action
that minimize the objective.
This is completely classical optimal control.
It's called model predictive control.
It's been around since the early sixties
if not the late fifties.
And so it's completely standard.
The main difference with what we want to do here
is that the world model is going to be learned
from sensory data as opposed to kind of a bunch of equations
you're going to write down for the dynamics
of a rocket or something.
Here we're just going to learn it from sensory data, right?
Okay, so there's two methods really
to train those energy functions,
so that they take the right shape.
Okay, so now we're going to talk about learning
how do you shape the energy surface in such a way
that it gives you low energy on the data points
and high energy outside?
And these two classes of methods
to prevent this collapse I was telling you about.
So the collapse is situation
where you just minimize the energy
for whatever training samples you have.
And what you get in the end is an energy function
that is zero everywhere.
That's not a good model.
You want an energy function that
takes low energy on the data points
and high energy outside.
So two methods.
Contrastive methods consist in generating
those green flashing points, contrastive samples
and pushing their energy up, okay?
So back propagate gradient through the entire system,
so that, and tweak the parameters,
so that the output energy goes up for a green point
and then so that it goes down for a blue point,
a data point.
But those tend to be inefficient in high dimensions.
So I'm more in favor of another set of methods
called regularized methods,
that basically work by minimizing the volume of space
that can take low energy,
so that when you push down the energy
of a particular region, it has to go up in other places,
because there is only a limited amount
of low energy stuff to go around.
So those are two classes of method
are going to argue for the regularized methods.
But really you should think about two classes of method
to train energy-based models.
And when I say energy-based models,
this also applies to probabilistic models,
which are essentially a special case
of energy-based models.
Okay, there's a particular type of energy-based model
which are called latent variable models.
And they consist in either in sort of models
that have a latent variable z that is not given to you
during training or during tests
that you have to infer the value of.
And you can do this by either minimizing the energy
with respect to z.
So if you have an energy function E of x,y,z,
you minimize it with respect to z,
and then you put that z into the energy function
and the resulting function does not depend on z anymore.
And I call this F of x,y, right?
So having latent variable models
is really kind of a very simple thing in many ways.
If you are a Bayesian or probabilist,
instead of inferring a single value for z,
you infer a distribution.
But I might talk about this later a little bit.
So depending on which architecture you're going to use
for your system, it may or may not collapse.
And so, if it can collapse,
then you have to use one of those objective functions
that prevent collapse either through contrastive training
or through regularization.
If you're a physicist,
you probably already know that it's very easy
to turn energies into probability distributions.
You compute P of y given x,
if you know the energy of x and y,
you do exponential minus some constant F of x,y
and then you normalize by the integral
over all the space of y, of the numerator.
So you get a normalized distribution of a y
and that's a perfectly fine way
of parameterizing a distribution if you really want.
The problem of course, in a lot of statistical physics
is that the denominator
called the partition function is intractable.
And so here I'm basically just circumventing the problem
by directly manipulating the energy function
and not worrying about the normalization.
But basically, this idea of pushing down,
pushing up the energy, minimizing the volume of stuff
that can take low energy,
that plays the same role of what would be normalization
in a probabilistic model.
I'm not going to go through this, it's in our chart,
you can take a picture if you want.
This is basically a list of all kinds of classical methods
as to whether they're contrastive or regularized.
All of them can be interpreted
as some sort of energy-based model
that is either one or the other.
And the idea that is used in LLM,
which is basically a particular version
of something called denoising auto-encoder
is a contrastive method.
So the way we train LLMs today
is contrastive, okay?
We take a piece of data, we corrupt it
and we train the system to reconstruct
the missing information.
That's actually a special case
of something called a denoising auto-encoder,
which is very old idea
that's been revived multiple times since then.
And this framework can allow us to interpret
a lot of classical models like K-means, sparse coding,
things like that.
But I don't want to spend too much time on this.
You can do probabilistic inference,
but I want to skip this.
This is for these free energies
and variational free energies and stuff like that.
But here's the recommendations I'm making,
abandon generative models
in favor of those joint embedding architectures,
abandon probabilistic modeling
in favor of this energy-based models,
abandon contrastive methods
in favor of those regularized methods.
And I'm going to describe one in a minute
and also abandon reinforcement learning,
but I've been seeing this for 10 years.
So they are four most popular things
in machine learning today,
which doesn't make me very popular.
So how do you train a JEPA with regularized methods?
So there's a number of different methods,
I'm going to describe two classes.
One for which we really understand why it works
and the other one it works,
but we don't understand why, but it works really well.
So the first class of method
consists in basically preventing this collapse
I was telling you about
where the output of the encoder is constant
or carries very little information about the input.
So what we're going to do is have a criterion during training
that tries to maximize the amount of information
coming out of the encoders to prevent this collapse.
And the bad news with this is that
to maximize the information content
coming out of a neural net,
we would need some sort of lower bound
on information content of the output
and then push up on it, right?
The bad news is that we don't have lower bounds
on information content, we only have upper bounds.
So we're going to need to cross our fingers,
take an upper bound on information content, push it up,
and hope that the actual information content follows.
And it kind of works, it actually works really well,
but it's not well-justified theoretically for that reason.
How do we do this?
So first thing we can do is make sure that
the variables that come out of the encoders
are not constant.
So over a batch of samples, you want each variable
of the output vector of the encoder
to have some non-zero variance, let's say one, okay?
So you have a cost function
that says I really want the variance
to be larger than one or standard deviation.
Okay, still the system can produce a non-informative output
by making all the outputs equal or highly correlated.
Okay, so you have a second criterion that says,
in addition to this, I want the different components
of the output vector to be uncorrelated.
So basically, I want a criterion
that says I want to bring the covariance matrix
of the vectors coming out of the encoder
as close to the identity matrix as possible,
but still is not enough,
because you will get uncorrelated variables
but it still could be very dependent.
So there's another trick which consists in
taking the representation vector sx
and running it through a neural net
that expands the dimension in a nonlinear way
and then decorrelate those variables
and we can show that under certain conditions
this actually has the effect of
making pairs of variables independent.
Okay, not just uncorrelated.
So a paper on this
here on archive.
Okay, so now we have a way of training one of those
joint embedding architectures to prevent collapse.
And it's really a regularized method.
We don't need to have contrastive samples,
we don't need to kind of pull things away from each other
or anything like that.
We just train it on training samples.
And we have this criterion.
Once we've trained that system,
we can use the representation learned by the system,
sorry, the representation learned by the system sx,
and then feed this to a subsequent classifier
that we can train supervised for a particular task.
For example, object recognition, right?
So we can train a linear classifier
or something more sophisticated
and I'm not going to bore you with the result,
but every role here is a different way
of doing self-supervised learning.
Some of them are generative,
some of them are joint embedding.
They use different types of criteria,
different types of distortions and corruption
for the images.
And the top systems, give you 70% correct on ImageNet,
when you train only the head on ImageNet,
you don't fine-tune the entire network,
you just use the features.
And what's interesting about self-supervised learning
is that those systems work really well.
They don't require a lot of data
to basically learn a new task.
So it's really good for transfer learning
or multitask learning or whatever it is.
You learn generic features
and then you use them as input to kind of a subsequent task,
with sort of variations of this idea.
So this method is called VICReg
and that means
variance, in variance, covariance, regularization.
Variance, covariance,
because of this covariance matrix criterion.
In variance, because we want the representation
of the corrupted and uncorrupted inputs to be identical.
With versions of this that work for object detection
and localization and stuff like that.
But there is another set of methods
and those, I have to admit that
I don't completely understand why they work.
These people like Yonglong Tian at FAIR
and Surya Ganguli at Stanford
who claim they understand
they'll have to explain this to me,
because I'm not entirely convinced.
And those are distillation methods.
So you have two encoders,
they have to be more or less identical
in terms of architectures.
Actually exactly identical,
they need to have the same parameters.
And you share the parameters between them.
So there is something called weight EMA.
EMA means exponential moving average.
So the encoder on the right
gets weights that are basically a running average
with exponential decaying coefficient
of the weight vectors produced by the encoder on the left
as learning takes place.
So it's kind of a smoothed-out version of the weights.
And Surya and Yonglong
have explanations why this
prevent the system from collapsing.
Encourage you to read that paper if you can figure it out.
And there's a number of different methods
that are using this self-supervised pre-training
to work really well.
Old methods like Bootstrap Your Own Latents from DeepMind
SimSiam by FAIR,
and then DINOv2, which is 1-year-old method
by colleagues at FAIR in Paris,
which is probably the best system
that produces generic features for images.
If you have a vision problem, you need some generic features
to be fed to some classifiers.
So you can train it with a small amount of data,
use in DINOv2.
Today, that's the best thing we have.
And it produces really nice features,
really good performance
with very small amounts of data for all kinds of things.
You can train it to do segmentation,
to do depth estimation, to do object recognition,
to estimate the height of the tree canopy,
on the entire earth,
to detect tumors in chest x-rays,
all kinds of stuff.
That is open source,
so a lot of people have been using it
for all kinds of stuff.
It's really cool.
A particular instantiation
of those distillation method
is something called I-JEPA.
So this is a JEPA architecture
that has been trained using this distillation method,
but it's different from DINOv.
And this works extremely well,
in fact, better than DINOv for the same amount of training
and it's very fast to train as well.
So this is the best method we have
and it compares very favorably to competing methods that use
generative models that are trained by reconstruction.
So there's something called MAE mask auto-encoder
and which are the hollow squares here on this graph.
Maybe I should show this one.
So this is a method also developed at Meta at FAIR,
but it works by reconstructing a photo, right?
So you take a photo, you mask some parts of it
and you train what amounts to auto-encoder
to reconstruct the parts that are missing.
And it's very difficult to predict
what's missing in an image,
because you can have complicated textures
and stuff like that.
And in fact, this system is much more expensive to train
and it doesn't work as well as
this joint embedding methods, right?
So the one lesson from this talk is
generative method for images are bad, they're good for text
but not too good for images.
Whereas joint embedding methods are good for images,
not yet good for text.
And the reason is images
are high-dimensional and continuous.
So generating them is actually hard.
It's possible to produce image generation system
that produce nice images
but they're not good, they don't produce good
internal representations of images.
On the other hand, generating models for text works,
because text is discreet.
So language is simple, because it's discreet, essentially.
Where this idea that language
is kind of the most sophisticated stuff,
because only humans can do it.
In fact, it's simple.
The real world is really what's hard.
So I-JEPA works really well for all kinds of tasks
and people have used this for all kind of stuff.
There's some mathematics to do here,
which I'm going to have to skip.
To talk about V-JEPA.
So this is a version of I-JEPA but for video
that was put online fairly recently.
And there the idea is you take a piece of video,
you mask part of it
and again you train one of those
joint embedding architectures
to basically predict the representation
of the full video from the representation
of the partially masked or corrupted video.
And this works really well in the sense that
when you take the representation learned by that system,
you feed it to a classifier
to basically classify the action
that is taking place in the video.
You get really good performance
and you get better performance than any other
self-supervised learning technique.
When you have a lot of training data,
it doesn't work as well as purely supervised
with all kinds of tricks and data augmentation,
but it comes really close
and it doesn't require labeled data or not much.
So that's kind of a big breakthrough a little bit.
The fact that we can train system to learn from video
in self-supervised manner,
because now we can might be able to use this
to learn world models, right?
Where the masking of the video is,
we take a video mask the second half of it
and ask the system to predict what's going to happen,
feeding it an action that is being taken in the video.
If you have that, you have a world model.
If you have a world model,
you can put it in a planning system.
If you can have a system that can plan,
then you might have systems that are a lot smarter
than current systems and they might be able to plan actions,
not just words.
They're not going to predict auto-aggressively anymore.
They're going to plan their answer kind of like what we do,
like we speak,
we don't produce one word after the other without thinking.
We usually kind of plan what we're going to say in advance,
at least some of us do.
So this works really well in the sense that
we get really good performance
on lots of different types of video
for classifying the action and various other tasks,
better than basically anything else
that people have tried before.
Certainly better than any system
that has been trained on video.
And this, the pre-training here
is on a relatively small amount of video actually,
it's not a huge dataset, this is speed.
So this is reconstructions of missing parts of a video
by that system
and it's done by training a separate decoder, right?
So it's not part of the initial training,
but in the end we can use the representation
as input to a decoder
that we trained to reconstruct
the part of the image that's missing.
And these are the result of completing basically
the entire middle of the image is missing
and the system is kind of filling in things
that are reasonable.
It's a cooking video and there's a hand
and knife and some ingredients.
Okay, it is another topic I want to talk about,
because I know there are mathematicians and physicists
in the room.
This is a recent paper, a collaboration between
some of us at FAIR
and Bobak Kiani,
who is a student at MIT with Seth Lloyd
and a bunch of people from MIT.
So this system is basically using this idea
of joint embedding to learn something about
partial differential equations
that we observe through a solution.
So look at the thing at the bottom.
We have a PDE, Burgers' equation.
What you see are diagrams of space time diagrams basically,
of a solution of that PDE.
And what we're going to do is we're going to take two windows,
separate windows on the solution of that PDE, okay?
And of course, the solution depends on
the initial condition.
You're going to get different solutions
for different initial conditions, right?
So we're going to take two windows over two different solutions
to that PDE, and we're going to do a joint embedding.
So we're going to train an encoder to produce representations,
so that the representation can be predicted,
the representation for one piece of the solution
can be predicted from a representation from the other piece.
And what the system ends up doing in that case
is basically predict or represent
the coefficient of the equation that is being sold, right?
The only thing that's common between one region
of the space, time solution of PDE
and another region,
is that it's the same equation with the same coefficient.
What's different is the initial condition.
But the equation itself is the same, right?
So the system basically discovers some representation
and when we train now a supervised system
to predict the coefficient of the equation,
it actually does a really good job.
In fact it does a better job than if we train it
completely supervised from scratch.
So that's really interesting,
these very tricks in this thing
for transformations of the solution
according to in variance properties of the equation,
which I'm not going to go into,
but that's using the VICReg procedure I described earlier.
So we applied this to a bunch of
different PDEs Kuramoto-Sivashinsky
where we try to kind of identify some of the coefficients
in the equation.
Navier-Stokes, we try to identify the buoyancy parameter
in Navier-Stokes, which is a constant term at the end.
And this works better again
than just training a supervised system
to predict what the buoyancy is from observing the behavior.
So this is pretty cool.
I mean there's already papers that have kind of
recycled this idea in other context.
Okay, so that's end of the technical part.
For the conclusion, we have a lot of problems to solve,
some of which are mathematical,
like the mathematical foundations of energy-based learning
I think are not completely worked out.
The idea that the dependency between sets of variables
is represented by an energy function
that takes low energy on the data manifold
and high energy outside, it's a very general idea.
It breaks the whole kind of hypothesis
of probabilistic modeling.
And I think we need to understand better,
what are the properties of such things?
We need to work on JEPA architectures
that have regularized rating variables.
I didn't talk much about this,
but that's kind of a necessity.
Planning algorithms in the presence of uncertainty,
hopefully using reading-based methods,
learning cost modules to guarantee safety, for example,
planning in the presence of inaccuracies
of the world model.
If your world model is wrong,
you're going to plan wrong sequences of actions,
because you're not going to predict the right outcomes.
So how you deal with that?
And then exploration mechanisms
to adjust the world model for regions of the space
where the system is not very good.
So we're working on self-supervised learning from video,
as I told you.
Evidence that can reason and plan driven by objectives.
So according to the objective-driven architecture I showed,
but for text as well as for robotic control.
And then trying to figure out if we can do this
sort of hierarchical planning idea
I was telling you about earlier.
Let's see.
So in this future
where every one of our interactions are mediated
by AI systems, what that means is that
AI systems will essentially constitute a repository
of all human knowledge,
and that everyone will use,
sort of like a Wikipedia you can talk to
and possibly knows more than Wikipedia.
Every one of those systems is necessarily biased, okay?
Is trained by on data
that is available on the internet.
There's more data in English than in any other language.
There's a lot of language for research is very little data.
So those systems are going to be biased necessarily.
And we've seen pretty dramatic examples recently
with the Jenny system from Google,
where the bias really was like,
so they spent so much effort to kind of make sure
the system was not biased,
that it was biased in a other obnoxious way.
And so bias is inevitable.
And it's the same as in the media and the press.
Every journal, every news magazine newspaper is biased.
The way we fix this is we have a high diversity
of very different magazines and newspapers.
We don't get our information from a single system.
We have a choice between various bias systems, basically.
This is what is going to have to happen for AI as well.
We're not going to have unbiased AI systems.
So the solution is to have lots and lots of bias systems,
bias for your language, your culture, your value system,
your centers of interest, whatever it is.
So what we need is a very simple platform
that allows basically anyone to fine-tune
an open source AI system,
open source LLM for their own language culture,
value system, centers of interest.
Basically, a weekly but not a weekly
where you write articles,
a weekly where you fine-tune a LLM.
That's the future of AI that I see, that I want to see,
a future in which all of our interaction are mediated
by AI systems that are produced by three companies
on the west coast of the U.S.
is not a good future,
and I work for one of those companies,
but I'm happy to say that Meta
has completely bought this idea that AI platforms
need to be open and is committed to open sourcing
the various incarnations of Llama.
The next one being Llama-3 coming soon.
So open source AI platforms are necessary.
They're necessary for even the preservation of democracy
for the same reason that diversity of the press
is necessary for democracy.
So one big danger is that open source AI platforms
will be regulated out of existence,
because of the fact that some people think AI is dangerous.
And so they say you can't put AI in the hands of everyone.
It's too dangerous.
You need to regulate it.
And that will kill AI, open source AI platforms.
I think that's much more dangerous.
The dangers of this are much, much higher
than the dangers of putting AI in the hands of everybody.
And how long is it going to take for us
to reach human-level AI with AI systems?
It's not going to be next year, like LLM says,
or LLM says before the end of the year, it's BS.
It's not going to be next year.
Despite what you might hear from open AI.
It's probably not going to be in the next five years.
It's going to take a while before the program I described here
works to the level that we want.
And it's not going to be an event.
It's not going to be AI achieved internally or anything.
It's not going to be like an event with all of a sudden
we discovered the secret to AGI
and all of a sudden we have super-intelligent system.
It's not going to happen that way.
We're going to build systems of the type I describe
and make them bigger and bigger
and learn them more and more stuff,
put more and more guardrails and objectives
and stuff like that
and walk our way out so that
as they become smarter and smarter,
they also become more secure and safe and well-behaved
and everything, right?
So it's not going to be an event, it's going to be progressive
motion towards more and more powerful
and more safe AI systems.
And we need contributions from everyone,
which is why we need open source models.
And I'll stop here.
Thank you very much.
- Thank you for a wonderful thought-provoking talk.
We have time for a few questions.
- [Audience Member] Hello, yeah,
I've been trying to figure out
why you put encoder in front of y,
because you're getting the representation
of the output image and you've been losing information
and does that mean your architecture is
as good as your encoder?
So I couldn't figure out why you put it that way.
So can you help me to understand?
- Sure, I have two answers to this.
Are you a physicist by any chance?
- Computer scientist. - Computer scientist, okay?
But there are physicists in the room, okay?
But this is very basic physics.
If you want to predict the trajectory of planets,
most of the information about any planet
is completely relevant to the prediction, right?
The shape, the size, the density, the composition,
all of that is completely relevant.
The only thing that matters is six variables,
which are position and velocities, right?
And you can predict the trajectory.
So the big question in making predictions
and planning and stuff like that
is what is the appropriate information
and the appropriate abstraction level
to make the prediction you want to make?
And then everything else eliminate it,
because if you spend all of your resources
trying to predict those things that are irrelevant,
you are completely wasting your time, right?
So that's the first answer.
The second answer is imagine that the video
I'm training the system on,
is a video of this room where I point the camera this way
and I pan slowly and I stop right before you.
And I ask the system, I predict what's going to happen next
in the video.
The system will probably predict that
the panning is going to continue.
There's going to be people sitting,
and at some point there's going to be a wall.
There's absolutely no way it can predict what we look like
or what anybody will look like.
No way it's going to predict how many steps
there are in the stairs.
No way it's going to predict the precise texture
of the wall or the carpet, right?
So there's all kinds of details here
that are completely unpredictable,
yet if you train a generative system to predict why,
it's going to have to actually devote a lot of resources
to predict those details, right?
So the whole question of the machine learning,
and to some extent science
is what is the appropriate representation
that allows you to make predictions that are useful, right?
So JEPA gives you that, generating models don't.
- [Morris] Hello, my name is Morris
and I'm a PhD student at MIT and I noticed that
you're a JEPA architecture looks a lot like
the common filter, you have a sequence of measurements,
and even when you want a common filter,
there is often a problem,
which is that you need a condition called observability
and you have a very clever way
of getting around this condition of observability,
because in your latent space,
you can come up with a clever regularize
for the things that you cannot see.
Does the world model help in coming up
with these regularizes?
And secondly, your control would probably come in
on the latent state.
Is that how you think it would work out in the end?
Or I mean, I yeah, that's my question.
- Yeah, okay.
Actually, it's not like a common filter.
A common filter, the encoders are reversed,
they're not encoders, they're decoders.
So I'm looking for the general picture here
of where I had the world model.
Yeah, this one is probably the best.
Okay, so in a common filter,
first of all, you get a sequence of observation
and here, the observation goes into an encoder
that produces the estimate of the state.
In a common filter is actually the other way around.
You have a hypothesized state
and you run into a decoder that produces the observation.
And what you do is you invert. - From the measurements.
- Right, right, I mean you're learning a hidden dynamics.
So in that sense it's similar,
but then you are generating the observation
from the hidden states, right?
So it's a bit reverse.
And then there is a constraint in,
at least in traditional camera filters
where the dynamics are linear.
Then there is extended camera filters where it's non-linear,
and then a particular provision to handle the uncertainties.
So you assume Gaussian distributions
of everything basically, right?
But yeah, there is a connection,
because there is a connection with optimal control
and common filters are kind of the thing in optimal control.
- [Audience Member] Hi, so I have a bit
of a less technical question,
but given that you're also a citizen of France
and broadly the EU,
and given all what you said about sort of
having the open models and sort of potentially
one of the main problems for these systems
being sort of regulatory capture or legislative problems,
what do you think about the new EU AI Act
and does that kind of influence you think
or might influence how Europe is going to proceed
with kind of R&D and AI development
and potentially Meta's presence in France?
- Well, so there, there are good things and bad things
in the EU AI Act.
The good things are things like, okay,
you can't use AI to give a social score to people,
that's a good idea.
You can't put cameras that do face recognition
in public spaces
unless there is special conditions
the Paris Olympic games or whatever.
So, I mean, those are good things
for privacy protection and stuff like that.
What is less good is that at the last minute
there were discussions
where they started putting provisions inside of it
for what they call frontier models, right?
So, powerful, this is because of ChatGPT,
let say if you're a powerful model,
it's potentially dangerous.
So we need to regulate research and development,
not just regulate products,
but regulate research and development.
I think that's completely wrong.
I think this is very destructive depending on
how it's applied.
I mean, it might be applied in ways that,
in the end are benign,
but it could be that
they might be kind of a little too tight about it.
And what is going to cause is that companies like Meta
are going to say, well, we're not going to open source
to Europe, right?
We're going to open source the rest of the world,
but if you're from Europe, you can download it.
And that would be really, really bad.
Some companies like Australia are probably going to move out.
So I think we're the fork in the road
where things could go bad.
I mean, there's a similar phenomenon in the U.S.
with the executive order of the White House,
where it could go one way or the other
depending on how it's applied.
In fact, the NTIA had a request or comment
that Meta us submitted one
and said, make sure that you don't legislate open source AI
out of existence,
because the reason to do this would be imaginary risks,
existential risks that are really completely
crazy, nuts, pardon my French.
But the idea somehow that,
all of a sudden you're going to discover the secret to AGI
and super-intelligence system
is going to take over the world within minutes
and it's just completely ridiculous.
This is not how the world works at all.
But there are people with a lot of money
who have funded a lot of think tanks
that have lobbied or basically lobbied government
into thinking this.
And so governments have organized meetings,
they're like, "Are we going to all be dead next year?"
Or stuff like that.
So you have to tell them first,
we're far away from human-level intelligence, don't believe,
the guys who tell you is it like Elon,
that it's just around the corner.
And second, we can build them in ways that are non-dangerous
and it's not going to be an event.
It's going to be gradual and progressive.
And we have ways to build those things in a safe way.
Don't rely on the fact that
current LLMs are unreliable and elucidate.
Don't project this to future systems.
Future systems will have completely different architecture
perhaps of the type that I described.
And that makes them controllable,
because you can put
guardrails and objectives and everything.
So discussing the existential risk of AI systems today,
super-intelligent system today
is insane, because they're not being invented yet.
We don't know what they would look like.
It's like discussing the safety of transatlantic flight
on a jet airliner in 1925.
The turbo jet was not invented yet,
and it didn't happen in one day, right?
It took decades before,
now you can fly halfway around the world in complete safety
with a two-engine jet plane.
That's amazing, incredibly safe, it took decades.
It's going to be the same thing.
- So that's a good place to wrap it up.
So let's thank Yann again for a wonderful talk.
- Thank you.
تصفح المزيد من مقاطع الفيديو ذات الصلة
Yann LeCun: Deep Learning, ConvNets, and Self-Supervised Learning | Lex Fridman Podcast #36
3. Cognitive Architectures
Season 2 Ep 22 Geoff Hinton on revolutionizing artificial intelligence... again
《與楊立昆的對話:人工智能是生命線還是地雷?》- World Governments Summit
Possible End of Humanity from AI? Geoffrey Hinton at MIT Technology Review's EmTech Digital
Ilya Sutskever: OpenAI Meta-Learning and Self-Play | MIT Artificial General Intelligence (AGI)
5.0 / 5 (0 votes)