Yann Lecun | Objective-Driven AI: Towards AI systems that can learn, remember, reason, and plan

Harvard CMSA
1 Apr 202476:53

Summary

TLDRYann LeCun在哈佛的演讲中探讨了人工智能的未来,特别是目标驱动的AI系统。他强调了当前AI系统与人类和动物相比在学习和理解世界方面的局限性,并提出了一种新的AI架构理念,包括自我监督学习和能量模型。LeCun认为,开源AI平台对于民主和技术创新至关重要,同时对未来的智能助理和增强人类全球智能表示乐观。

Takeaways

  • 🌟 人工智能(AI)的未来不应仅仅是基于现有的大型语言模型(LLMs),而应发展为具有目标驱动的AI架构。
  • 🚀 目前的AI系统与人类和动物相比在学习和理解世界方面存在显著不足,例如缺乏常识和目标驱动的行为。
  • 📈 自监督学习在AI领域的成功,特别是在文本、图像识别和语音翻译等方面,为未来的AI发展提供了新的可能性。
  • 🧠 人类和动物能够快速学习新任务并理解世界运作的原理,而现有的AI系统还远未达到这种能力。
  • 🔍 为了实现更智能的AI系统,我们需要构建能够从感官输入中学习世界模型、具有持久记忆和能够进行层次化规划的系统。
  • 🛠️ 我们目前的AI系统在逻辑理解、常识推理和现实世界知识方面存在局限,这些都需要通过新的学习范式来克服。
  • 🔗 未来的AI系统将与我们的数字世界互动,因此需要具备与人类智能相当的水平,以提供更好的用户体验和服务。
  • 🌐 开源AI平台对于实现AI技术的多样性和普及至关重要,有助于防止AI技术被少数公司垄断。
  • 📚 教育和研究机构需要深入研究能量模型和基于能量的学习,以发展更高效和安全的AI系统。
  • 📈 通过自监督学习和联合嵌入架构(JEPA),AI系统可以在不依赖大量标记数据的情况下学习有效的特征表示。
  • 🔑 为了确保AI系统的安全性和可控性,需要在其架构中嵌入保护性目标(Guardrail objectives)和安全控制机制。

Q & A

  • Yann LeCun 在演讲中提到了哪些关于机器学习未来发展的主要观点?

    -Yann LeCun 主要讨论了机器学习未来的发展,特别是关于目标驱动的人工智能(Objective-Driven AI)。他强调了当前AI系统相比于人类和动物的学习效率和理解世界的能力还有很大的差距。他提出了建立与人类智能水平相当的系统的必要性,并探讨了实现这一目标的挑战和可能的途径,包括自我监督学习、能量模型和预测架构等。

  • Yann LeCun 认为现有的AI系统在学习效率和理解世界方面存在哪些不足?

    -Yann LeCun 认为现有的AI系统相较于人类和动物在学习效率上有显著不足。例如,人类和动物能够快速地通过少量样本或尝试来运行新任务,而AI系统则需要更多的数据和计算资源。此外,人类和动物能够理解世界如何运作,具有推理和规划的能力,以及常识,而这些都是当前AI系统所缺乏的。

  • Yann LeCun 提出的Objective-Driven AI架构包含哪些关键组成部分?

    -Objective-Driven AI架构的关键组成部分包括感知模块、可能的持久记忆、世界模型、执行器、成本模块和目标函数。这种架构的目标是通过优化目标函数来规划一系列行动,从而实现预测的结果满足特定的目标。

  • Yann LeCun 为什么认为自我监督学习对于AI的发展至关重要?

    -Yann LeCun 认为自我监督学习是AI发展的关键,因为它是最近AI领域取得重大进展的主要原因。自我监督学习通过从数据中提取内在结构来训练模型,而不需要显式的标注。这种方法使得模型能够更好地理解和处理数据,从而在多种任务中表现出色,如语言模型、图像识别和语音识别等。

  • Yann LeCun 如何看待当前的强化学习?

    -Yann LeCun 认为当前的强化学习虽然给人带来了希望,但实际上效率很低,在现实世界中几乎是不切实际的,除非更多地依赖于自我监督学习。他提出需要新的学习范式,即目标驱动的AI架构,以实现更智能、更安全的AI系统。

  • Yann LeCun 提出了哪些关于未来AI系统的预测和建议?

    -Yann LeCun 预测未来的AI系统将会与我们的日常数字交互紧密相关,并且我们需要建立具有人类水平智能的系统。他建议我们需要开放源代码的AI平台,以便任何人都可以根据自己的语言、文化和价值观来微调AI系统。他还强调了在发展AI的同时,要确保其安全性和可控性,避免潜在的风险。

  • Yann LeCun 如何看待当前的生成式AI模型?

    -Yann LeCun 认为生成式AI模型在处理文本方面效果不错,但对于图像和其他高维连续数据则不够有效。他提出应该放弃生成式模型,转而使用联合嵌入预测架构(JEPA),因为这些架构能够提供更好的内部表示,并且对于预测和规划等任务更为有效。

  • Yann LeCun 为什么认为能量模型比概率模型更适合当前的AI学习?

    -Yann LeCun 认为能量模型提供了一种更直接的方式来处理数据的兼容性或不兼容性,而不需要担心概率模型中的归一化问题。能量模型允许我们通过直接操作能量函数来避免处理分母(分区函数),这在许多统计物理问题中是难以处理的。此外,能量模型在理论上更为灵活,可以更容易地实现正则化,从而避免模型崩溃。

  • Yann LeCun 如何看待当前的深度学习模型在逻辑理解方面的能力?

    -Yann LeCun 认为当前的深度学习模型,尤其是大型语言模型,虽然在文本生成方面表现出色,但它们并不真正理解逻辑。这些模型可能需要显式地教授如何执行算术运算,并且它们缺乏对现实世界的基本理解。因此,尽管它们能够流畅地生成文本,但并不意味着它们具有高级的智能或理解能力。

  • Yann LeCun 讨论了哪种自我监督学习方法,并认为它在图像识别方面的效果如何?

    -Yann LeCun 讨论了一种名为MAE(Masked Auto-Encoder)的自我监督学习方法,这种方法通过遮蔽图像的一部分并训练模型来重建缺失的部分。他认为,尽管这种方法在重建图像方面取得了一定的成功,但它在生成内部表示方面并不如联合嵌入方法有效。

  • Yann LeCun 对于未来AI系统的安全性和可控性有哪些建议?

    -Yann LeCun 建议通过建立目标驱动的AI架构来确保系统的安全性和可控性。这包括使用世界模型来预测行动的后果,以及通过优化目标函数来规划行动序列。此外,他还提出了使用guardrail目标来确保系统的安全可控。

  • Yann LeCun 如何看待当前AI系统在处理非语言知识方面的能力?

    -Yann LeCun 认为当前的AI系统主要依赖于文本数据,而大多数人类知识实际上是非语言的。他指出,我们通过观察世界获得的背景知识远远超过了公开可用的文本数据。因此,仅通过语言训练的AI系统无法达到人类水平的智能。

Outlines

00:00

🎓 哈佛数学科学与应用中心主任Dan Freed介绍

Dan Freed是哈佛数学科学与应用中心的主任,该中心由S.T. Yau创立,致力于数学与科学的双向互动。中心拥有众多博士后研究人员,涉及数学、物理、经济、计算机科学和生物学等领域。他们举办项目、研讨会、会议,并邀请专家进行特别讲座。本次讲座邀请了Meta的首席AI科学家、纽约大学教授Yann LeCun,他将谈论目标驱动的AI。

05:01

🤖 Yann LeCun关于未来AI的展望

Yann LeCun讨论了AI的未来,而非现状。他提出了一些提案而非结果,并分享了过去两年的初步成果。他强调机器学习与人类的差别,指出AI系统缺乏目标驱动的行为和常识。他提出了学习范式的变化,从监督学习到强化学习,再到自监督学习,并强调了构建与人类智能相当的系统的重要性。

10:03

🧠 对现有AI系统的批评

LeCun批评了现有AI系统的限制,如无法快速学习新任务、缺乏推理和计划能力、以及对世界的常识理解。他指出,尽管AI在某些任务上超越了人类,但它们并不具备普遍性。他还提到了人工智能的安全性和可控性,以及如何通过设定目标来实现这些特性。

15:05

🚀 自监督学习的兴起与应用

LeCun解释了自监督学习的概念及其在AI领域的成功应用,包括语言模型、图像识别、语音识别等。他提到了自监督学习的不同方式,如文本中的掩码填充,以及如何通过这种方式训练神经网络来预测缺失的信息。

20:06

🧠 人类与AI的智能差异

LeCun讨论了人类智能与AI之间的差异,强调人类和动物能够快速学习新任务,而AI系统则需要大量数据和时间。他指出,人类的知识大部分是非言语的,而当前的AI系统主要基于文本数据,这限制了它们的学习能力。

25:06

🛠️ 目标驱动的AI架构

LeCun提出了目标驱动的AI架构,强调了感知模块、记忆、世界模型、执行器和成本模块的重要性。他解释了如何通过优化目标来规划行动序列,并讨论了这种架构与现有AI系统的不同之处。

30:09

🧠 世界模型的构建与学习

LeCun探讨了如何构建世界模型,以及如何通过观察和预测来学习世界的状态。他提出了使用视频数据进行自监督学习的想法,并讨论了预测编码的概念。他还提出了一种新的方法,即联合嵌入预测架构(JEPA),以改进世界模型的学习。

35:11

🤔 对未来AI的思考与挑战

LeCun分享了他对构建更高级AI系统的思考,包括如何从视频中学习世界模型,以及如何将这些模型应用于规划系统。他强调了在不确定性中进行规划的挑战,以及如何通过探索机制来调整世界模型。他还讨论了开源AI平台的重要性,以及如何避免监管过度导致的创新受阻。

40:11

🌐 对话与提问环节

在对话和提问环节中,LeCun回答了关于JEPA架构、世界模型、控制策略以及对欧盟AI法案的看法等问题。他强调了开放源代码AI的重要性,并讨论了如何通过多样化的AI系统来减少偏见。他还提到了AI发展的未来趋势,以及如何通过渐进的方式实现更安全、更智能的AI系统。

Mindmap

Keywords

💡人工智能

人工智能是指由人造系统所表现出来的智能,这些系统能够执行需要人类智能才能完成的任务,如视觉识别、语言理解、决策等。在视频中,Yann LeCun讨论了当前人工智能的局限性,以及未来人工智能发展的方向,特别是目标驱动的人工智能架构。

💡目标驱动的AI

目标驱动的AI是指设计用于实现特定目标或任务的人工智能系统。这类系统能够根据设定的目标来规划和执行行动,而不仅仅是响应输入。在视频中,Yann LeCun强调了目标驱动AI的重要性,认为这是实现更高级人工智能的关键。

💡自我监督学习

自我监督学习是一种机器学习方法,它不依赖于外部的标注数据,而是通过数据本身的结构和模式来学习。在视频中,Yann LeCun讨论了自我监督学习在训练大型语言模型和图像识别系统中的应用,以及它的局限性和改进方向。

💡能量模型

能量模型是一种统计模型,它通过定义一个能量函数来衡量数据点之间的相似性或兼容性。在视频中,Yann LeCun提出了使用能量模型来训练JEPA架构,以避免系统崩溃并提高学习效果。

💡层次规划

层次规划是指在规划过程中采用多个抽象层次来组织和执行行动计划。这种规划方式允许系统先制定高层次的目标和策略,然后再细化到具体的行动步骤。在视频中,Yann LeCun提到了人类和动物在日常生活中自然而然地使用层次规划,而现有的AI系统尚未能够复制这种能力。

💡世界模型

世界模型是指对现实世界的状态和动态进行建模的系统,它能够基于当前状态和一系列行动预测未来的状态。在视频中,Yann LeCun强调了构建有效的世界模型对于实现目标驱动的AI至关重要,因为它们能够使AI系统理解和预测行动的后果。

💡知识偏见

知识偏见是指AI系统在训练过程中由于数据集中某些特征的过度代表而倾向于做出偏向性的预测或决策。在视频中,Yann LeCun提到了AI系统必然会有偏见,因为它们是基于互联网上可用的数据进行训练的,而这些数据本身可能就存在语言、文化等方面的偏见。

💡开放源代码AI平台

开放源代码AI平台是指将AI系统的源代码公开,允许任何人自由使用、修改和分发的AI系统。在视频中,Yann LeCun认为开放源代码AI平台对于实现知识多样性和避免监管过度至关重要。

💡监管

监管是指政府或其他监管机构对某一领域或行业制定和执行规则和标准的行为。在视频中,Yann LeCun讨论了对AI进行监管的可能影响,特别是过度监管可能抑制开放源代码AI平台的发展。

💡人工智能安全

人工智能安全是指确保AI系统在设计、开发和部署过程中不会带来伤害或风险的措施和实践。在视频中,Yann LeCun强调了在构建更高级的AI系统时,需要考虑安全性和可控性,以防止潜在的负面后果。

Highlights

Yann LeCun提出了目标驱动的人工智能概念,旨在构建能够理解和实现目标的AI系统。

现有的AI系统,如大型语言模型(LLMs),缺乏对世界的理解和推理能力,与人类和动物的学习方式相比存在明显不足。

LeCun强调了自我监督学习在AI领域的成功,尤其是在自然语言处理和图像识别等方面。

他提出了一种新的AI架构——联合嵌入预测架构(JEPA),用以替代当前的生成模型,并解决模型崩溃问题。

LeCun讨论了如何通过自我监督学习从视频中学习,这对于构建能够理解和预测复杂世界状态的AI系统至关重要。

他提出了一种名为VICReg的训练方法,通过最大化信息量来防止模型崩溃,并提高模型的泛化能力。

LeCun强调了开放源代码AI平台的重要性,认为这是实现民主和避免监管过度的关键。

他预测,实现人类水平智能的AI不会是一个突然的事件,而是一个渐进的过程,需要不断地增加AI系统的知识、能力和安全性。

LeCun讨论了如何使用AI系统来增强人类的全球智能,并提出了一个未来愿景,即每个人都能通过AI系统与数字世界互动。

他提出了一种基于能量的模型学习方法,这种方法通过直接操作能量函数而不是依赖概率分布来避免处理不可解的分母问题。

LeCun强调了规划算法在AI系统中的重要性,特别是在不确定性存在的情况下如何进行有效规划。

他提出了一种名为I-JEPA的模型,该模型在图像识别等任务上表现出色,优于其他自监督学习方法。

LeCun讨论了如何训练AI系统以理解偏微分方程(PDEs),这对于科学和工程领域的AI应用具有重要意义。

他强调了避免AI系统偏见的重要性,并通过提供多样化的AI系统来解决这个问题。

LeCun提出了对未来AI的愿景,即AI系统将成为人类知识的存储库,每个人都可以使用并根据自己的需求进行定制。

他警告说,过度的AI监管可能会扼杀开放源代码AI平台的发展,这对于科技进步和民主至关重要。

LeCun讨论了如何通过自我监督学习从视频中学习,这对于构建能够理解和预测复杂世界状态的AI系统至关重要。

他提出了一种名为V-JEPA的视频处理模型,该模型通过预测视频的缺失部分来学习世界模型,对于理解视频中的动作和事件具有潜在的应用价值。

LeCun强调了开放源代码AI平台的重要性,认为这是实现民主和避免监管过度的关键。

他预测,实现人类水平智能的AI不会是一个突然的事件,而是一个渐进的过程,需要不断地增加AI系统的知识、能力和安全性。

Transcripts

play00:00

- I'm Dan Freed,

play00:01

Director of the Center of Mathematical Sciences

play00:04

and Applications here at Harvard.

play00:07

This is a center that was founded 10 years ago by S.T. Yau.

play00:11

It's a mathematics center.

play00:13

We engage in mathematics

play00:15

and mathematics in interaction

play00:17

two-way interaction with science.

play00:19

We have quite a crew of postdocs

play00:22

doing research in mathematics

play00:24

and mathematics, in physics, in economics,

play00:27

in computer science and biology.

play00:30

We run some programs, workshops, conferences,

play00:33

and a few times a year we have special lectures,

play00:37

and today is one of them.

play00:39

This is the fifth annual Ding-Shum lecture.

play00:42

And we're very pleased today to have Yann LeCun,

play00:45

who's the chief AI scientist at Meta,

play00:49

and a professor at New York University,

play00:52

an expert on machine learning in many, many forms.

play00:55

And today, he'll talk to us about Objective-Driven AI.

play01:08

- Thank you very much.

play01:09

Thank you for inviting me, for hosting me.

play01:11

It seems to me like I give a talk at Harvard

play01:14

every six months or so, at least for the last few years,

play01:20

but to different crowds, physics department,

play01:24

Center for Mathematics,

play01:27

psychology, everything.

play01:35

So I'm going to talk obviously about AI,

play01:39

but more about the future than about the present.

play01:42

And a lot of it is going to be

play01:46

basically, proposals rather than results,

play01:48

but preliminary results on the way to go.

play01:53

I wrote a paper that I put online about two years ago

play01:58

on what this program is about.

play02:00

And you're basically going to hear a little bit of

play02:02

what we have accomplished in the last two years

play02:05

towards that program.

play02:07

If you're wondering about the picture here on the right,

play02:10

this is my amateurish connection with physics.

play02:13

I take also photography pictures.

play02:16

This is taken from my backyard in New Jersey.

play02:20

It's Messier 51, beautiful galaxy.

play02:27

Okay, machine learning sucks.

play02:32

At least compared to what we observe in humans and animals.

play02:36

It really isn't that good.

play02:41

Animals and humans can run new tasks extremely quickly

play02:45

with very few samples or trials.

play02:49

They understand how the world works,

play02:50

which is not the case for AI systems today.

play02:52

They can reason and plan, which is not the case

play02:54

for AI systems today.

play02:56

They have common sense, which is not the case

play02:58

for AI systems today.

play03:00

And the behavior is driven by objective,

play03:02

which is also not the case for most AI systems today.

play03:06

Objectives means, you set an objective

play03:09

that you try to accomplish

play03:10

and you kind of plan a sequence of action

play03:11

to accomplish this goal.

play03:14

And AI systems like LLMs don't do this at all.

play03:18

So the paradigms of learning, supervised learning

play03:22

has been very popular.

play03:25

A lot of the success of machine learning

play03:27

at least until fairly recently

play03:29

was mostly with supervised learning.

play03:31

Reinforcement learning gave some people a lot of hope,

play03:35

but turned out to be so inefficient

play03:36

as to be almost impractical in the real world,

play03:39

at least in isolation,

play03:41

unless you rely much more on something

play03:45

called self-supervised learning,

play03:46

which is really what has brought about

play03:48

the big revolution that we've seen in AI

play03:50

over the last few years.

play03:54

So the goal of AI really is,

play03:59

to build systems that are smart as humans, if not more.

play04:03

And we have systems that are better than humans

play04:05

at various tasks today.

play04:06

They're just not very general.

play04:09

So hence people who call human-level intelligence,

play04:12

artificial general intelligence, AGI.

play04:14

I hate that term,

play04:16

because human intelligence is actually not general at all,

play04:19

it's very specialized.

play04:22

So I think talking about general intelligence,

play04:24

but we will mean human-level intelligence

play04:27

is complete nonsense,

play04:29

but that ship has sailed unfortunately.

play04:33

But we do need systems that have human-level intelligence,

play04:37

because in a very near future, or not so near future,

play04:40

but in the near future, every single one of our interactions

play04:44

with the digital world will be mediated by an AI system.

play04:50

We'll have AI systems that are with us at all times.

play04:53

I'm actually wearing smart glasses right now.

play04:55

I can take a picture of you guys.

play04:58

Okay, I can click a button or I can say,

play05:01

"Hey, Meta, take a picture,"

play05:06

and it takes a picture.

play05:10

Or I can ask you the question,

play05:11

and there isn't a LLM that will answer that question.

play05:13

You're not going to hear it, because it's bone conduction,

play05:15

but it's pretty cool.

play05:18

So pretty soon we'll have those things

play05:20

and it will be basically the main way

play05:22

that we interact with the digital world.

play05:24

Eventually, those systems will have displays

play05:27

which this pair of glasses doesn't have,

play05:31

and we'll use those AI systems all the time.

play05:35

The way for them to be non-frustrating

play05:39

is for them to be as smart as human assistance, right?

play05:43

So we need human-level intelligence

play05:45

just for reasons of basically product design, okay?

play05:51

But of course, there's a more kind of interesting

play05:52

scientific question of really what is human intelligence

play05:55

and how can we reproduce it in machines

play05:58

and things like that.

play05:59

So it's one of those kind of small number of areas

play06:04

where there is people who want a product

play06:09

and are ready to pay for the development of it,

play06:11

but at the same time,

play06:12

it's a really great scientific question to work on.

play06:16

And there's not a lot of domains

play06:17

where that's the case, right?

play06:22

So, but once we have human-level smart assistant

play06:27

that have human-level intelligence,

play06:28

this will amplify humanity's global intelligence,

play06:34

if you want.

play06:35

I'll come back on this later.

play06:38

We're very far from that, unfortunately, okay?

play06:40

Despite all the hype you hear from Silicon Valley mostly,

play06:43

the people who tell you AGI is just around the corner.

play06:47

We're not actually that close.

play06:50

And it's because the systems

play06:53

that we have at the moment

play06:54

are extremely limited in some of the capabilities

play06:56

that we have.

play07:01

If we had system that approached human intelligence,

play07:03

we would have systems that can learn

play07:05

to drive a car in 20 hours of practice,

play07:07

like any 17-year-old.

play07:08

And we do have self-driving cars,

play07:11

but they are heavily engineered, they cheat by using maps,

play07:14

using all kinds of expensive sensors, active sensors,

play07:18

and they certainly use a lot more than

play07:20

20 hours of training data.

play07:23

So obviously, we're missing something big.

play07:25

If we had human-level intelligence,

play07:27

we would have domestic robots that could do simple tasks

play07:30

that a 10-year-old can learn in one shot,

play07:32

like clearing up the dinner table

play07:35

and clearing out the dishwasher.

play07:37

And unlike 10-year-olds,

play07:38

it wouldn't be difficult to convince them to do it, right?

play07:45

But in fact, it's not even humans, just what a cat can do.

play07:48

No AI system at the moment can do in terms of

play07:50

planning complex sequences of actions

play07:53

to jump on a piece of furniture or catch a small animal.

play08:00

So we're missing something big.

play08:04

And basically, what we're missing is systems

play08:07

that are able to learn how the world works,

play08:10

not just from text, but also from let's say video

play08:13

or other sensory inputs.

play08:15

Systems that have internal world models,

play08:18

systems that have memory, they can reason,

play08:20

they can plan hierarchically like every human and animal.

play08:24

So that's the list of requirements,

play08:27

systems that learn world models from sensory inputs,

play08:30

learning intuitive physics, for example,

play08:32

which babies learn in the first few months of life.

play08:35

Systems that have persistent memory,

play08:37

which current AI systems don't have.

play08:39

Systems that can plan actions,

play08:42

so as to fulfillment objectives.

play08:44

And systems that are controllable and safe,

play08:48

perhaps through the specification of Guardrail objectives.

play08:52

So this is the idea of objective-driven AI architectures.

play08:55

But before I talk about this, I'm going to lay the groundwork

play08:57

for how we can go about that.

play09:02

So the first thing is that self-supervised learning

play09:04

has taken over the world.

play09:06

And I first need to explain

play09:07

what self-supervised learning is,

play09:09

or perhaps in a special case.

play09:12

But really the success of LLMs and all that stuff,

play09:15

and even image recognition these days,

play09:18

and speech recognition translation,

play09:21

all the cool stuff in AI,

play09:22

it's really due to self-supervised learning

play09:24

the generalization of the user self-supervised learning.

play09:27

So a particular way of doing it is you take a piece of data,

play09:30

let's say a text, you transform it or you corrupt it

play09:34

in some way.

play09:35

For a piece of text, that would be

play09:38

replacing some of the words by blank markers, for example.

play09:42

And then you train some gigantic neural net

play09:44

to predict the words that are missing,

play09:46

basically, to reconstruct the original input, okay?

play09:52

This is how an LLM is trained.

play09:54

It's got a particular architecture,

play09:56

but that only lets the system look at words on the left

play10:03

of the word to be predicted.

play10:04

But it's pretty much what it is.

play10:06

And this is a generative architecture,

play10:08

because it produces parts of the input, okay?

play10:14

There are systems of this type

play10:15

that have been trained to produce images

play10:18

and they use other techniques like diffusion models,

play10:22

which I'm not going to go into.

play10:25

I played with one, so Meta has one of course.

play10:27

So you can talk to through WhatsApp and Messenger,

play10:30

and there's a paper that describes

play10:31

the system that Meta has built.

play10:34

And I typed the prompt here, up there in that system,

play10:39

a photo of a Harvard mathematician

play10:41

proving the Riemann hypothesis on the blackboard

play10:44

with the help of an intelligent robot,

play10:45

and that's what it produces.

play10:51

I check the proof, it's not correct,

play10:57

actually, there's symbols here

play10:58

that I have no idea what they are.

play11:04

Okay, so, everybody is excited about generative AI

play11:09

and particular type of it called auto-regressive LLM,

play11:15

and really it's train very much like I described.

play11:20

But as I said, the system can only use words

play11:22

that are on the left of it

play11:24

to predict a particular word when you train it.

play11:26

So the result is that once the system is trained,

play11:29

you can show it a sequence of words

play11:31

and then ask it to produce the next word.

play11:34

Okay, then you can inject that next word into the input.

play11:37

You shift the input by one, okay?

play11:40

So the stuff that was produced by the system

play11:43

now becomes part of the input

play11:44

and you ask it to produce the second word, shift that in,

play11:47

produce the next, next word,

play11:49

shift that in, et cetera, right?

play11:50

So that's called auto-aggressive prediction.

play11:52

It's not a new concept, it's very, very old

play11:55

in statistics and signal processing,

play11:56

but in economics actually.

play12:00

But that's the way an LLM works.

play12:03

It's auto-aggressive.

play12:05

It uses its own prediction as inputs.

play12:09

So those things work amazingly well

play12:12

for the simplicity conceptually of how they're trained,

play12:16

which is just predict missing words.

play12:18

It's amazing how well they work.

play12:21

Modern ones are trained typically on a few trillion tokens.

play12:25

This slide is too old now, it should put a zero.

play12:27

It's not one to 2 trillion, it's more like 20 trillion.

play12:31

So a token is a sub-word unit, really,

play12:34

it's on average 3/4 of a word.

play12:38

And there is a bunch of those models

play12:39

that have appeared in the last few years.

play12:42

It's not just in the last year and a half

play12:45

since ChatGPT came out.

play12:47

That's what made it known to the wider public.

play12:50

But those things have been around for quite a while.

play12:53

Things like BlenderBot, Galactica, LlaMA, Llama-2,

play12:56

Code Llama, which are produced by FAIR,

play12:58

Mistral and Mixtral from a small French company

play13:02

formed by former FAIR people,

play13:05

and then various others Gemma or more recently by Google.

play13:08

And then proprietary models, Meta AI,

play13:12

which is built on top of Llama-2,

play13:14

and then Gemini from Google, ChatGPT, GPT-4, et cetera.

play13:21

And those things make stupid mistakes.

play13:23

They don't really understand logic very well,

play13:25

but if you tell them that A is the same thing as B,

play13:28

they don't necessarily know that B

play13:31

is the same as A, for example.

play13:33

They don't really understand transitivity

play13:36

of ordering relationships and things like this.

play13:39

They don't do logic.

play13:41

You have to sort of explicitly teach them to do arithmetics

play13:44

or have them to call tools to do arithmetics.

play13:49

And they don't have any knowledge of the underlying reality.

play13:51

They've only been trained on text.

play13:52

Some of them have been trained also on images,

play13:54

but it's basically by treating images like text.

play13:57

So it's very limited,

play14:00

but it's very useful to have those things open sourced

play14:03

and available to everyone,

play14:04

because everyone can sort of experiment with them

play14:07

and do all kinds of stuff.

play14:09

And there's literally millions of people using Llama

play14:13

as a basic platform.

play14:15

So self-supervising is not just used to produce text,

play14:18

but also to do things like translation.

play14:19

So there's a system produced by my colleagues

play14:22

a few months ago called SeamlessM4T.

play14:25

It can translate 100 languages into a 100 languages.

play14:31

And it can do text to text, text to speech,

play14:33

speech to text, and speech to speech.

play14:36

And for speech to speech,

play14:37

it can actually translate languages that are not written,

play14:40

which is pretty cool.

play14:43

It's also available, you can play with it.

play14:46

It's pretty amazing.

play14:47

I mean, that's kind of superhuman in some way, right?

play14:48

I mean, there's few humans that can translate 100 languages

play14:51

into 100 languages in any direction,

play14:55

who actually had a previous system

play14:56

that could do 200 languages, but only from text,

play14:58

not from speech.

play15:02

But there are dire limitations to the system.

play15:04

The first thing is the auto-aggressive prediction

play15:08

is basically, a exponentially divergent process.

play15:12

Every time the system produces a word,

play15:14

there is some chance that this word is

play15:15

outside of the set of proper answers.

play15:19

And there is no way to come back to correct mistakes, right?

play15:23

So the probability that a sequence of words

play15:26

will be kind of a correct answer to the question

play15:30

decreases exponentially with the length of the answer,

play15:32

which is not a good thing.

play15:34

And there's various kind of technical papers on this,

play15:37

not by me, that tend to show this.

play15:41

A lot of criticism also on the fact

play15:44

that those systems can't really plan.

play15:46

So the amount of computation that an LLM devotes

play15:49

to producing a token is fixed, right?

play15:51

You give it a prompt, it runs through

play15:54

however many layers it has in the architecture

play15:56

and then produces a token.

play15:58

So per token, the amount of computation is fixed.

play16:01

The only way to get a system

play16:02

to think more about something

play16:03

is to trick it into producing more tokens,

play16:06

which is kind of a very kind of circuitous way

play16:08

of getting you to do work.

play16:13

And so there's been a quite a bit of research

play16:15

on the question of whether those systems

play16:17

are actually capable of planning,

play16:19

and the answer is no, they really can't plan.

play16:22

Whenever they can plan or produce a plan.

play16:25

It's basically, because they've been trained

play16:26

on a very similar situation and they already saw a plan

play16:30

and they basically regurgitate a very similar plan,

play16:33

but they can't really use tools in new ways, right?

play16:40

And then there is the last limitation,

play16:42

which is that they're trained on language.

play16:44

And so they only know whatever knowledge

play16:47

is contained in language.

play16:49

And this may sound surprising,

play16:50

but most of human knowledge

play16:52

actually has nothing to do with language.

play16:56

So they can be used for as writing assistance,

play17:00

giving you ideas if you have the white page's anxiety

play17:05

or something like this.

play17:06

They're not good so far for producing factual content

play17:10

and consistent answers,

play17:11

although they're kind of being modified for that.

play17:17

And we are easily fooled into thinking

play17:20

that they're intelligent, because they're fluent,

play17:23

but really they're not that smart.

play17:26

And they really don't understand how the world works.

play17:29

So we're still far from human-level AI.

play17:34

As I said, most of human and animal knowledge

play17:37

certainly is non-verbal.

play17:40

So what are we missing?

play17:44

Again, I'm reusing those examples of learning to drive

play17:47

or learning to clear the dinner table.

play17:50

We are going to have human-level AI,

play17:53

not before we have domestic robots that can do those things.

play17:59

And this is called a Moravec's paradox,

play18:01

the fact that there are things that appear complex

play18:03

for humans like playing chess

play18:05

or planning a complex trajectory,

play18:09

and they're fairly simple for computers.

play18:13

But then things that we take for granted

play18:15

that we think don't require intelligence,

play18:16

like what a cat can do,

play18:19

it's actually fiendishly complicated.

play18:22

And the reason might be this,

play18:24

so it might be the fact that

play18:30

the data bandwidth of text

play18:33

is actually very low, right?

play18:34

So a 10 trillion token dataset

play18:38

is basically, the totality of the publicly available text

play18:43

on the internet, that's about 10 to the 13 bytes,

play18:47

or 10 to the 13 tokens, I should say.

play18:49

A token is typically two bytes.

play18:51

There's about 30,000 possible tokens in a typical language.

play18:55

So that's 2 to 10 of the 13 bytes for training in LLM.

play19:00

It would take 170,000 years for a human to read

play19:04

at eight hours a day, 250 words per minute

play19:07

or 100,000 years, if you read fast

play19:10

and you read 12 hours a day.

play19:13

Now consider a human child, a 4-year-old child,

play19:17

a 4-year-old child has been awake 16,000 hours at least,

play19:20

that's what psychologists are telling us,

play19:25

which by the way is only 30 minutes of YouTube uploads.

play19:30

We have 2 million optical nerve fibers going into

play19:33

our visual cortex, about a million from each eye.

play19:37

Each fiber maybe carries about 10 bytes per second.

play19:40

Jaim is going, "What?"

play19:44

This is an upper bound.

play19:47

And so the data volume that a 4-year-old has seen

play19:50

through vision is probably on the order of 10 to 15 bytes.

play19:56

That's way more than the totality

play19:58

of all the texts publicly available on the internet.

play20:01

50 times more, 50 times more data by the time you're four

play20:06

that you're seen through vision.

play20:08

So that tells you a number of things,

play20:09

but the first thing it tells you is that

play20:10

we're never going to get to human-level AI

play20:14

by just training on language, it's just not happening.

play20:17

There's just too much background knowledge

play20:18

about the world that we get from observing the world

play20:21

that current AI systems don't get.

play20:28

So that leads me to this idea of objective-driven AI system.

play20:34

What is it that sort of makes humans, for example,

play20:38

capable of, or animals for that matter,

play20:39

capable of kind of using tools and objects and situations

play20:44

in new ways and sort of invent new ways of behaving?

play20:51

So I wrote a fairly readable,

play20:55

fairly long paper on this.

play20:58

You see the URL here, it's not on archive,

play21:00

because it's on this open review site,

play21:02

which you can comment,

play21:03

tell me how wrong this is and everything.

play21:08

And the basic architecture

play21:13

is kind of shown here.

play21:14

So every time you have an arrow,

play21:16

that means there is signals going through,

play21:18

but also means there might be gradients going backwards.

play21:21

So I'm assuming everything in there is differentiable.

play21:25

And there is a perception module

play21:26

that observes the world,

play21:28

turn it into representations of the world,

play21:30

a memory that might be sort of persistent memory,

play21:35

factual memory, things like that.

play21:36

A world model, which is really the centerpiece

play21:38

of this system,

play21:40

an actor and a cost module objective functions.

play21:44

The configurator, I'm not going to talk about,

play21:45

at least not for now.

play21:47

So here is how this system works.

play21:48

A typical episode is that the system observes the world,

play21:53

feed this through this perception system.

play21:55

Perception system produces some idea

play21:58

of the current state of the world,

play22:00

or at least the part of the world

play22:01

that is observable currently.

play22:04

Maybe it can combine this with the content of a memory

play22:07

that contains the rest of the state of the world

play22:09

that has been previously observed.

play22:11

Okay, so you get some pretty good idea

play22:12

where the current state of the world is.

play22:15

And then the world model, the role of the world model

play22:17

is to take into account the current state of the world

play22:19

and hypothesized sequence of actions

play22:24

and to produce a prediction

play22:27

as to what is going to be the future state of the world

play22:30

resulting from taking those actions, okay?

play22:34

So state of the world at time, t, sequence of actions,

play22:38

state of the world at time, t plus, whatever.

play22:42

Now that outcome, that predicted state of the world

play22:47

goes into a number of modules,

play22:51

whose role is to compute basically a scalar objective.

play22:54

So each of those square boxes here,

play22:57

the red square boxes or pink ones,

play22:59

they're basically scalar-valued function

play23:01

that take representation of the state of the world

play23:05

and tell you how far the state of the world

play23:08

is from a particular goal,

play23:10

objective target, whatever it is.

play23:14

Or it takes a sequence of predicted states

play23:17

and it tells you to what extent that sequence of state

play23:20

is dangerous, toxic, whatever it is, right?

play23:23

So those are the guardrail objectives.

play23:27

Okay, so an episode now consists in what the system will do.

play23:33

The way it operates, the way it produces its output

play23:36

is going to be an action sequence,

play23:39

is going to be by optimizing the objectives,

play23:44

the red boxes,

play23:46

whatever comes out of the red boxes

play23:48

with respect to the action sequence, right?

play23:50

So there's going to be an optimization process

play23:53

that is going to look for search for

play23:55

an action sequence in such a way

play23:58

that the predicted outcome end state of the world

play24:01

satisfies the objectives, okay?

play24:06

So this is intrinsically very different principle

play24:08

from just running through a bunch of layers

play24:10

in the neural net.

play24:11

This is intrinsically more powerful, right?

play24:13

You can express pretty much any algorithmic problem

play24:17

in terms of an optimization problem.

play24:19

And this is basically an optimization problem.

play24:21

And not specifying here exactly

play24:24

what optimization algorithm to use.

play24:27

If the action sequence space in the space

play24:29

in which we do this inference is continuous,

play24:32

we can use gradient-based methods,

play24:34

because all of those modules are differentiable.

play24:36

So we can back propagate gradients

play24:38

through the backwards through those arrows

play24:40

and then update the action sequence

play24:43

to minimize the objectives and then converge to

play24:46

an optimal action sequence

play24:48

for the objective we're looking for,

play24:50

according to a word model.

play24:54

If a word model is something like

play24:56

discrete time differential equation or something like this,

play25:00

we might have to run it from multiple steps.

play25:02

Okay, so the initial world sequence

play25:06

is fed to the world model together with an initial action

play25:09

that predicts the next state.

play25:11

From that next state, we feed another action

play25:14

that predicts the next, next state.

play25:16

The entire sequence can be fed to the guardrail objectives,

play25:19

and then the end result

play25:21

is fed to the task objective, essentially.

play25:27

So this is sort of a ideal situation

play25:31

where the world model is deterministic,

play25:36

'cause the world might be deterministic,

play25:38

is very little uncertainty

play25:41

about what's going to happen

play25:42

if I do a sequence of action to grab this bottle,

play25:47

I'm in control.

play25:48

But most of the world is not completely predictable.

play25:50

So you probably need some sort of latent variable

play25:52

that you feed to your world model

play25:54

that would account for all the things

play25:56

you don't know about the world.

play25:58

You might have to sample those latent variables

play26:01

within a distribution to make multiple predictions

play26:03

about what might happen in the future,

play26:06

because of uncertainties in the world.

play26:09

Really, what you want to do ultimately,

play26:11

is not this type of kind of one level planning,

play26:14

but you want to do hierarchical planning.

play26:16

So basically, have a system that can produce multiple

play26:20

representations of the state of the world,

play26:21

have multiple level of abstraction,

play26:23

so that you can make predictions

play26:26

more or less longterm in the future.

play26:28

So here's an example.

play26:31

Let's say I'm sitting in my office at NYU in New York

play26:35

and I want to go to Paris.

play26:38

I'm not going to plan my entire trip from New York to Paris

play26:42

in terms of millisecond by millisecond muscle control.

play26:45

It's impossible.

play26:47

It would be intractable in terms of optimization, obviously,

play26:50

but also it's impossible,

play26:51

because I don't know the condition that will occur.

play26:55

Do I have to avoid a particular obstacle

play26:57

that I haven't seen yet?

play26:59

Is a street light going to be red or green?

play27:03

How long am I going to wait to grab a taxi?

play27:05

Whatever.

play27:07

So I can't plan everything from the start,

play27:12

but what I can do is I can do high level planning,

play27:15

so high level planning at a very abstract level,

play27:18

I know that I need to get to the airport and catch a plane.

play27:20

Those are two macro actions, right?

play27:24

So that determines a sub-goal for the lower level.

play27:27

How do I get to the airport?

play27:30

Well, I'm in New York, so I need to go down in the street

play27:32

and have the taxi.

play27:34

That sets a goal for the level below.

play27:38

How do I get to the street where I get, I have to,

play27:42

take the elevator down and then work out on the street?

play27:45

How do I go to the elevator?

play27:46

I need to stand up for my chair, open the door in my office,

play27:49

walk to the elevator, push the button.

play27:51

How do I get up from my chair?

play27:55

And that I can't describe,

play27:57

because it's like muscle control and everything, right?

play27:59

So you can imagine that there is

play28:01

this hierarchical planning thing going on.

play28:03

We do this completely effortlessly,

play28:04

absolutely all the time animals do this very well.

play28:07

No AI system today is capable of doing this.

play28:13

Some robotic system do hierarchical planning,

play28:16

but it's hardwired, it's handcrafted, right?

play28:20

So if you want to have a working robot,

play28:24

walk from here to the door, stairs,

play28:28

you first have a high level planning of the trajectory,

play28:31

you're not going to walk directly through here,

play28:33

you're going to have to go through the stairs, et cetera.

play28:35

And then at the lower level, you're going to plan the motion

play28:38

of the legs to kind of follow that trajectory.

play28:40

But that's kind of handcrafted.

play28:42

It's not like the system has learned to do this.

play28:45

It was kind of built by hand.

play28:47

So how do we get systems to spontaneously learn

play28:50

the appropriate levels of abstractions

play28:53

to represent action plans?

play28:55

And we really don't know how to do this,

play28:58

or at least we don't have any demonstration of any system

play29:00

that does this, that actually works.

play29:05

Okay, so next question is going to be,

play29:08

if we're going to build a system of this type,

play29:10

is how are we going to build a world model?

play29:13

Again, world model is state of the world at time, t action,

play29:18

predicted state of the world at time, t plus 1,

play29:22

whatever the unit of time is.

play29:25

And the question is, how do humans do this or animals?

play29:30

So you look at what age babies learn basic concepts.

play29:34

They sold this chart from Emmanuel Dupoux,

play29:36

who's a psychologist in Paris.

play29:40

And the basic things like basic object categories

play29:43

and things like this that are learned pretty early on

play29:46

without language, right?

play29:47

Babies don't really understand language at the age

play29:49

of four months, but they develop

play29:52

the notion of object categories spontaneously,

play29:56

things like solidity, rigidity of object,

play29:58

a difference between animate and inanimate objects.

play30:01

And then intuitive physics pops up around nine months.

play30:04

So it takes about nine months for babies

play30:06

to learn that objects that are not supported,

play30:08

fall because of gravity,

play30:11

and more concepts in intuitive physics.

play30:13

It is not fast, right?

play30:15

I mean, we take a long time to learn this.

play30:17

Most of this, at least in the first few months of life

play30:20

is learned mostly by observation,

play30:22

who has very little interaction with the world,

play30:24

'cause a baby until, three, four months

play30:27

can't really kind of manipulate anything or affect the world

play30:30

beyond their limbs.

play30:32

So most of what they learn about the world

play30:34

is mostly observation.

play30:35

And the question is, what type of learning is taking place

play30:38

when babies do this?

play30:39

This is what we need to reproduce.

play30:43

So there is a natural idea

play30:44

which is to just transpose the idea

play30:45

of self-supervised training for text

play30:47

and use it for video, let's say, right?

play30:49

So, take a video, call this y, full video

play30:53

and then corrupt it by masking a piece of it,

play30:57

let's say the second half of the video.

play31:01

So call this masked video x,

play31:03

and then train some gigantic neural net

play31:05

to predict the part of the video that is missing.

play31:08

And hoping that if the system predicts

play31:12

what's going to happen in the video,

play31:13

probably has good idea of what the underlying nature

play31:16

of the physical world is.

play31:18

A very natural concept.

play31:20

In fact, neuroscientists have been thinking about

play31:21

this kind of stuff for a very long time.

play31:22

It's called predictive coding.

play31:24

And I mean this idea that you learn by prediction

play31:27

is really very standard.

play31:30

You do this and it doesn't work.

play31:33

We've tried for, my colleague and I

play31:36

have been trying to do this for 10 years,

play31:41

and you don't get good representations of the world,

play31:43

you don't get good predictions.

play31:45

The kind of prediction you get are very blurry,

play31:48

kind of like the video at the top here

play31:51

where the first four frames of that video are observed,

play31:55

the last two are predicted by neural net

play31:58

and it predicts very blurry images.

play32:00

The reason being that it can't really predict

play32:02

what's going to happen,

play32:03

so it predicts the average of all the plausible things

play32:05

that may happen.

play32:06

And that's a very blurry video.

play32:09

So doesn't work.

play32:11

The solution to this is to basically abandon the idea

play32:15

of generative models.

play32:18

That might seem shocking given that this is

play32:20

the most popular thing in machine learning at the moment.

play32:24

But we're going to have to do that.

play32:25

And the solution is that I'm proposing at least,

play32:30

is to replace this by something I call

play32:33

joint embedding predictive architectures, JEPA.

play32:36

This is what a JEPA is.

play32:39

So you take y, you corrupt it, same story

play32:41

or you transform it in some way.

play32:45

But instead of reconstructing y from x,

play32:48

you run both x and y through encoders.

play32:51

And what you reconstruct

play32:52

is you reconstruct the representation of y

play32:55

from the representation of x.

play32:57

So you're not trying to predict every pixel,

play32:59

you're only trying to predict a representation

play33:03

of the input which may not contain all the information

play33:07

about the input,

play33:08

may contain only partial information.

play33:13

So that's the difference between those two architectures.

play33:15

On the left, generative architectures that reproduce y,

play33:20

on the right, joint embedding architectures

play33:23

that embed x and y into a representation space.

play33:27

And you do the prediction in representation space.

play33:31

And there's various flavors

play33:32

of this joint embedding architecture.

play33:37

The one on the left is an old idea called Siamese networks,

play33:42

goes back to the early nineties I worked on.

play33:45

And then there is

play33:46

deterministic and non-deterministic versions

play33:48

of those JEPA architectures.

play33:50

I'm not going to go into the details.

play33:53

The reason why you might need latent variables

play33:57

in the predictor,

play33:58

is because it could be that

play33:59

the world is intrinsically unpredictable

play34:02

or not fully observable or stochastic.

play34:05

And so you need some sort of way

play34:06

of making multiple predictions

play34:07

for a single observation, right?

play34:10

So the z variable here is basically parametizes

play34:14

the set of things you don't know about the world

play34:17

that you have not observed in the state of the world.

play34:20

And that will parametize the set of potential predictions.

play34:24

Now there's another variable here called a,

play34:25

and that's what turns the joint embedding architecture

play34:29

into a world model.

play34:31

This is a world model, okay?

play34:33

x is an observation,

play34:38

sx is the representation of that observation.

play34:42

a would be an action that you take.

play34:44

And then sy is a prediction of

play34:47

the representation of the state of the world

play34:49

after you've taken the action, okay?

play34:53

And the way you train the system

play34:54

is by minimizing the prediction error.

play34:56

So y would be the future observation

play34:58

of the world, right?

play35:01

x is the past and the present,

play35:03

y is the future.

play35:05

You just have to wait a little bit before you observe it.

play35:08

You make a prediction, you take an action

play35:10

or you observe someone taking an action,

play35:12

you make a prediction about what the state,

play35:14

the future state of the world is going to be.

play35:15

And then you can compare the actual state of the world

play35:18

that you observe with the predicted state

play35:22

and then train the system to minimize the prediction error.

play35:26

But there's an issue with this,

play35:27

which is that that system can collapse.

play35:30

If you only minimize the prediction error,

play35:32

what it can do is ignore x and y completely,

play35:35

produce sx and sy that are constant

play35:38

and then the prediction problem becomes trivial.

play35:40

So you cannot train a system of this type

play35:42

by just minimizing the prediction error.

play35:43

You have to be a little smarter about how you do it.

play35:48

And to understand how this works,

play35:49

you have to basically use a concept

play35:52

called energy-based models,

play35:53

which is, you can think of as a weakened version

play35:58

of probabilistic modeling.

play36:02

And for the physicists in the room,

play36:07

the way to turn to go from energies to probabilities

play36:09

is you take exponential minus and normalize.

play36:12

But if you manipulate the energy function directly,

play36:14

you don't need this normalization.

play36:16

So that's the advantage.

play36:17

So what is an energy-based model?

play36:18

It's basically, an implicit function F of x,y

play36:21

that measures the degree of incompatibility between x and y.

play36:27

Whether y is a good continuation for x in the case of video,

play36:30

whether y is a good set of missing words from x,

play36:34

things like that, right?

play36:36

But basically, that function takes the two argument x and y

play36:39

and gives you a scalar value that indicates

play36:42

to what extent x and y are compatible or incompatible.

play36:45

It gives you zero if x and y are compatible or a small value

play36:50

and it gives you a larger value if they're not.

play36:53

Okay, so imagine that those two variables as scalar

play36:57

and the observations are the black dots.

play37:03

That's your training data, essentially.

play37:05

You want to train this energy function

play37:07

in such a way that it takes low values

play37:10

on the training data and around,

play37:13

and then higher value everywhere else.

play37:16

And what I've represented here is kind of

play37:19

the lines of equal energy if you want

play37:24

the contours of equal energy.

play37:27

So how are we going to do this?

play37:28

So, okay, so the energy function is not a function

play37:32

you minimized by training,

play37:34

it's a function you minimized by inference, right?

play37:36

If I want to find a y that is compatible with an x,

play37:41

I search over the space of ys for a value of y

play37:44

that minimizes F of x,y, okay?

play37:46

So the inference process does not consist

play37:49

in running feet forward through a neural net.

play37:51

It consists in minimizing an energy function

play37:54

with respect to y.

play37:56

And this is computationally,

play37:58

this is intrinsically more powerful

play37:59

than running through a fixed number of layers

play38:01

in the neural net.

play38:02

So that gets around the limitation of auto-aggressive LLMs

play38:06

that spanned a fixed amount of computation per token.

play38:09

This way of doing inference

play38:12

can span in a limited amount of resources

play38:17

figuring out a good y

play38:18

that minimizes F of x,y

play38:20

depending on the nature of F and the nature of y.

play38:25

So if y is a continuous variable

play38:27

and your function hopefully is differentiable,

play38:29

you can minimize it using gradient-based methods.

play38:33

If it's not, if it's discreet,

play38:34

then will have to do some sort of combinatorial search,

play38:37

but that would be way less efficient.

play38:38

So if you can make everything continuous and differentiable,

play38:43

you're much better off.

play38:47

And by the way, I meant, I forgot to mention something

play38:49

when I talked about world model,

play38:51

this idea that you have a world model

play38:52

that can predict what's going to happen as a consequence

play38:54

of a sequence of actions,

play38:57

and then you have an objective you want to minimize

play38:58

and you plan a sequence of action

play38:59

that minimize the objective.

play39:01

This is completely classical optimal control.

play39:04

It's called model predictive control.

play39:06

It's been around since the early sixties

play39:08

if not the late fifties.

play39:10

And so it's completely standard.

play39:13

The main difference with what we want to do here

play39:16

is that the world model is going to be learned

play39:17

from sensory data as opposed to kind of a bunch of equations

play39:21

you're going to write down for the dynamics

play39:22

of a rocket or something.

play39:24

Here we're just going to learn it from sensory data, right?

play39:28

Okay, so there's two methods really

play39:30

to train those energy functions,

play39:34

so that they take the right shape.

play39:35

Okay, so now we're going to talk about learning

play39:37

how do you shape the energy surface in such a way

play39:40

that it gives you low energy on the data points

play39:42

and high energy outside?

play39:44

And these two classes of methods

play39:45

to prevent this collapse I was telling you about.

play39:47

So the collapse is situation

play39:49

where you just minimize the energy

play39:51

for whatever training samples you have.

play39:53

And what you get in the end is an energy function

play39:55

that is zero everywhere.

play39:57

That's not a good model.

play39:58

You want an energy function that

play40:01

takes low energy on the data points

play40:02

and high energy outside.

play40:04

So two methods.

play40:05

Contrastive methods consist in generating

play40:08

those green flashing points, contrastive samples

play40:11

and pushing their energy up, okay?

play40:14

So back propagate gradient through the entire system,

play40:17

so that, and tweak the parameters,

play40:19

so that the output energy goes up for a green point

play40:22

and then so that it goes down for a blue point,

play40:24

a data point.

play40:26

But those tend to be inefficient in high dimensions.

play40:28

So I'm more in favor of another set of methods

play40:30

called regularized methods,

play40:32

that basically work by minimizing the volume of space

play40:35

that can take low energy,

play40:37

so that when you push down the energy

play40:38

of a particular region, it has to go up in other places,

play40:41

because there is only a limited amount

play40:44

of low energy stuff to go around.

play40:48

So those are two classes of method

play40:49

are going to argue for the regularized methods.

play40:52

But really you should think about two classes of method

play40:55

to train energy-based models.

play40:57

And when I say energy-based models,

play41:00

this also applies to probabilistic models,

play41:02

which are essentially a special case

play41:03

of energy-based models.

play41:09

Okay, there's a particular type of energy-based model

play41:11

which are called latent variable models.

play41:13

And they consist in either in sort of models

play41:17

that have a latent variable z that is not given to you

play41:20

during training or during tests

play41:21

that you have to infer the value of.

play41:23

And you can do this by either minimizing the energy

play41:26

with respect to z.

play41:27

So if you have an energy function E of x,y,z,

play41:29

you minimize it with respect to z,

play41:32

and then you put that z into the energy function

play41:34

and the resulting function does not depend on z anymore.

play41:36

And I call this F of x,y, right?

play41:38

So having latent variable models

play41:41

is really kind of a very simple thing in many ways.

play41:46

If you are a Bayesian or probabilist,

play41:48

instead of inferring a single value for z,

play41:50

you infer a distribution.

play41:53

But I might talk about this later a little bit.

play41:56

So depending on which architecture you're going to use

play41:58

for your system, it may or may not collapse.

play42:02

And so, if it can collapse,

play42:04

then you have to use one of those objective functions

play42:07

that prevent collapse either through contrastive training

play42:10

or through regularization.

play42:14

If you're a physicist,

play42:15

you probably already know that it's very easy

play42:18

to turn energies into probability distributions.

play42:22

You compute P of y given x,

play42:24

if you know the energy of x and y,

play42:26

you do exponential minus some constant F of x,y

play42:29

and then you normalize by the integral

play42:31

over all the space of y, of the numerator.

play42:34

So you get a normalized distribution of a y

play42:37

and that's a perfectly fine way

play42:38

of parameterizing a distribution if you really want.

play42:41

The problem of course, in a lot of statistical physics

play42:44

is that the denominator

play42:46

called the partition function is intractable.

play42:50

And so here I'm basically just circumventing the problem

play42:54

by directly manipulating the energy function

play42:57

and not worrying about the normalization.

play43:01

But basically, this idea of pushing down,

play43:05

pushing up the energy, minimizing the volume of stuff

play43:06

that can take low energy,

play43:08

that plays the same role of what would be normalization

play43:11

in a probabilistic model.

play43:15

I'm not going to go through this, it's in our chart,

play43:18

you can take a picture if you want.

play43:19

This is basically a list of all kinds of classical methods

play43:22

as to whether they're contrastive or regularized.

play43:25

All of them can be interpreted

play43:26

as some sort of energy-based model

play43:28

that is either one or the other.

play43:35

And the idea that is used in LLM,

play43:37

which is basically a particular version

play43:39

of something called denoising auto-encoder

play43:41

is a contrastive method.

play43:42

So the way we train LLMs today

play43:46

is contrastive, okay?

play43:48

We take a piece of data, we corrupt it

play43:51

and we train the system to reconstruct

play43:53

the missing information.

play43:55

That's actually a special case

play43:55

of something called a denoising auto-encoder,

play43:57

which is very old idea

play44:00

that's been revived multiple times since then.

play44:09

And this framework can allow us to interpret

play44:11

a lot of classical models like K-means, sparse coding,

play44:15

things like that.

play44:16

But I don't want to spend too much time on this.

play44:20

You can do probabilistic inference,

play44:21

but I want to skip this.

play44:23

This is for these free energies

play44:24

and variational free energies and stuff like that.

play44:28

But here's the recommendations I'm making,

play44:30

abandon generative models

play44:32

in favor of those joint embedding architectures,

play44:34

abandon probabilistic modeling

play44:36

in favor of this energy-based models,

play44:37

abandon contrastive methods

play44:38

in favor of those regularized methods.

play44:41

And I'm going to describe one in a minute

play44:43

and also abandon reinforcement learning,

play44:45

but I've been seeing this for 10 years.

play44:48

So they are four most popular things

play44:50

in machine learning today,

play44:53

which doesn't make me very popular.

play45:00

So how do you train a JEPA with regularized methods?

play45:05

So there's a number of different methods,

play45:06

I'm going to describe two classes.

play45:08

One for which we really understand why it works

play45:10

and the other one it works,

play45:11

but we don't understand why, but it works really well.

play45:14

So the first class of method

play45:16

consists in basically preventing this collapse

play45:19

I was telling you about

play45:21

where the output of the encoder is constant

play45:24

or carries very little information about the input.

play45:27

So what we're going to do is have a criterion during training

play45:30

that tries to maximize the amount of information

play45:32

coming out of the encoders to prevent this collapse.

play45:37

And the bad news with this is that

play45:39

to maximize the information content

play45:40

coming out of a neural net,

play45:42

we would need some sort of lower bound

play45:43

on information content of the output

play45:46

and then push up on it, right?

play45:49

The bad news is that we don't have lower bounds

play45:50

on information content, we only have upper bounds.

play45:54

So we're going to need to cross our fingers,

play45:57

take an upper bound on information content, push it up,

play45:59

and hope that the actual information content follows.

play46:04

And it kind of works, it actually works really well,

play46:08

but it's not well-justified theoretically for that reason.

play46:13

How do we do this?

play46:13

So first thing we can do is make sure that

play46:16

the variables that come out of the encoders

play46:21

are not constant.

play46:23

So over a batch of samples, you want each variable

play46:26

of the output vector of the encoder

play46:28

to have some non-zero variance, let's say one, okay?

play46:31

So you have a cost function

play46:33

that says I really want the variance

play46:34

to be larger than one or standard deviation.

play46:38

Okay, still the system can produce a non-informative output

play46:41

by making all the outputs equal or highly correlated.

play46:45

Okay, so you have a second criterion that says,

play46:48

in addition to this, I want the different components

play46:51

of the output vector to be uncorrelated.

play46:53

So basically, I want a criterion

play46:54

that says I want to bring the covariance matrix

play46:57

of the vectors coming out of the encoder

play47:00

as close to the identity matrix as possible,

play47:04

but still is not enough,

play47:05

because you will get uncorrelated variables

play47:08

but it still could be very dependent.

play47:10

So there's another trick which consists in

play47:13

taking the representation vector sx

play47:14

and running it through a neural net

play47:15

that expands the dimension in a nonlinear way

play47:18

and then decorrelate those variables

play47:21

and we can show that under certain conditions

play47:22

this actually has the effect of

play47:24

making pairs of variables independent.

play47:27

Okay, not just uncorrelated.

play47:31

So a paper on this

play47:35

here on archive.

play47:38

Okay, so now we have a way of training one of those

play47:40

joint embedding architectures to prevent collapse.

play47:43

And it's really a regularized method.

play47:45

We don't need to have contrastive samples,

play47:46

we don't need to kind of pull things away from each other

play47:50

or anything like that.

play47:51

We just train it on training samples.

play47:53

And we have this criterion.

play47:55

Once we've trained that system,

play47:57

we can use the representation learned by the system,

play48:01

sorry, the representation learned by the system sx,

play48:04

and then feed this to a subsequent classifier

play48:08

that we can train supervised for a particular task.

play48:13

For example, object recognition, right?

play48:14

So we can train a linear classifier

play48:16

or something more sophisticated

play48:18

and I'm not going to bore you with the result,

play48:21

but every role here is a different way

play48:24

of doing self-supervised learning.

play48:25

Some of them are generative,

play48:26

some of them are joint embedding.

play48:28

They use different types of criteria,

play48:31

different types of distortions and corruption

play48:33

for the images.

play48:35

And the top systems, give you 70% correct on ImageNet,

play48:39

when you train only the head on ImageNet,

play48:41

you don't fine-tune the entire network,

play48:44

you just use the features.

play48:47

And what's interesting about self-supervised learning

play48:49

is that those systems work really well.

play48:52

They don't require a lot of data

play48:54

to basically learn a new task.

play48:57

So it's really good for transfer learning

play48:58

or multitask learning or whatever it is.

play49:01

You learn generic features

play49:02

and then you use them as input to kind of a subsequent task,

play49:06

with sort of variations of this idea.

play49:08

So this method is called VICReg

play49:10

and that means

play49:11

variance, in variance, covariance, regularization.

play49:14

Variance, covariance,

play49:15

because of this covariance matrix criterion.

play49:19

In variance, because we want the representation

play49:21

of the corrupted and uncorrupted inputs to be identical.

play49:26

With versions of this that work for object detection

play49:29

and localization and stuff like that.

play49:31

But there is another set of methods

play49:33

and those, I have to admit that

play49:35

I don't completely understand why they work.

play49:39

These people like Yonglong Tian at FAIR

play49:41

and Surya Ganguli at Stanford

play49:43

who claim they understand

play49:45

they'll have to explain this to me,

play49:46

because I'm not entirely convinced.

play49:48

And those are distillation methods.

play49:50

So you have two encoders,

play49:51

they have to be more or less identical

play49:52

in terms of architectures.

play49:54

Actually exactly identical,

play49:55

they need to have the same parameters.

play49:57

And you share the parameters between them.

play49:59

So there is something called weight EMA.

play50:02

EMA means exponential moving average.

play50:04

So the encoder on the right

play50:06

gets weights that are basically a running average

play50:11

with exponential decaying coefficient

play50:13

of the weight vectors produced by the encoder on the left

play50:17

as learning takes place.

play50:19

So it's kind of a smoothed-out version of the weights.

play50:24

And Surya and Yonglong

play50:26

have explanations why this

play50:27

prevent the system from collapsing.

play50:32

Encourage you to read that paper if you can figure it out.

play50:36

And there's a number of different methods

play50:38

that are using this self-supervised pre-training

play50:43

to work really well.

play50:46

Old methods like Bootstrap Your Own Latents from DeepMind

play50:48

SimSiam by FAIR,

play50:50

and then DINOv2, which is 1-year-old method

play50:54

by colleagues at FAIR in Paris,

play50:57

which is probably the best system

play50:58

that produces generic features for images.

play51:00

If you have a vision problem, you need some generic features

play51:03

to be fed to some classifiers.

play51:05

So you can train it with a small amount of data,

play51:07

use in DINOv2.

play51:09

Today, that's the best thing we have.

play51:12

And it produces really nice features,

play51:14

really good performance

play51:15

with very small amounts of data for all kinds of things.

play51:19

You can train it to do segmentation,

play51:21

to do depth estimation, to do object recognition,

play51:26

to estimate the height of the tree canopy,

play51:29

on the entire earth,

play51:31

to detect tumors in chest x-rays,

play51:36

all kinds of stuff.

play51:37

That is open source,

play51:39

so a lot of people have been using it

play51:40

for all kinds of stuff.

play51:41

It's really cool.

play51:43

A particular instantiation

play51:45

of those distillation method

play51:46

is something called I-JEPA.

play51:48

So this is a JEPA architecture

play51:51

that has been trained using this distillation method,

play51:53

but it's different from DINOv.

play51:56

And this works extremely well,

play51:59

in fact, better than DINOv for the same amount of training

play52:04

and it's very fast to train as well.

play52:08

So this is the best method we have

play52:09

and it compares very favorably to competing methods that use

play52:14

generative models that are trained by reconstruction.

play52:17

So there's something called MAE mask auto-encoder

play52:21

and which are the hollow squares here on this graph.

play52:27

Maybe I should show this one.

play52:29

So this is a method also developed at Meta at FAIR,

play52:32

but it works by reconstructing a photo, right?

play52:36

So you take a photo, you mask some parts of it

play52:39

and you train what amounts to auto-encoder

play52:41

to reconstruct the parts that are missing.

play52:45

And it's very difficult to predict

play52:46

what's missing in an image,

play52:47

because you can have complicated textures

play52:51

and stuff like that.

play52:52

And in fact, this system is much more expensive to train

play52:56

and it doesn't work as well as

play52:58

this joint embedding methods, right?

play53:00

So the one lesson from this talk is

play53:03

generative method for images are bad, they're good for text

play53:06

but not too good for images.

play53:08

Whereas joint embedding methods are good for images,

play53:11

not yet good for text.

play53:13

And the reason is images

play53:17

are high-dimensional and continuous.

play53:19

So generating them is actually hard.

play53:23

It's possible to produce image generation system

play53:26

that produce nice images

play53:27

but they're not good, they don't produce good

play53:29

internal representations of images.

play53:35

On the other hand, generating models for text works,

play53:38

because text is discreet.

play53:40

So language is simple, because it's discreet, essentially.

play53:44

Where this idea that language

play53:45

is kind of the most sophisticated stuff,

play53:46

because only humans can do it.

play53:48

In fact, it's simple.

play53:49

The real world is really what's hard.

play53:53

So I-JEPA works really well for all kinds of tasks

play53:56

and people have used this for all kind of stuff.

play54:00

There's some mathematics to do here,

play54:01

which I'm going to have to skip.

play54:05

To talk about V-JEPA.

play54:06

So this is a version of I-JEPA but for video

play54:10

that was put online fairly recently.

play54:13

And there the idea is you take a piece of video,

play54:16

you mask part of it

play54:18

and again you train one of those

play54:20

joint embedding architectures

play54:21

to basically predict the representation

play54:25

of the full video from the representation

play54:26

of the partially masked or corrupted video.

play54:31

And this works really well in the sense that

play54:41

when you take the representation learned by that system,

play54:43

you feed it to a classifier

play54:45

to basically classify the action

play54:48

that is taking place in the video.

play54:50

You get really good performance

play54:51

and you get better performance than any other

play54:53

self-supervised learning technique.

play54:56

When you have a lot of training data,

play54:57

it doesn't work as well as purely supervised

play55:00

with all kinds of tricks and data augmentation,

play55:02

but it comes really close

play55:05

and it doesn't require labeled data or not much.

play55:08

So that's kind of a big breakthrough a little bit.

play55:13

The fact that we can train system to learn from video

play55:16

in self-supervised manner,

play55:18

because now we can might be able to use this

play55:19

to learn world models, right?

play55:22

Where the masking of the video is,

play55:26

we take a video mask the second half of it

play55:29

and ask the system to predict what's going to happen,

play55:31

feeding it an action that is being taken in the video.

play55:34

If you have that, you have a world model.

play55:35

If you have a world model,

play55:36

you can put it in a planning system.

play55:38

If you can have a system that can plan,

play55:40

then you might have systems that are a lot smarter

play55:44

than current systems and they might be able to plan actions,

play55:47

not just words.

play55:51

They're not going to predict auto-aggressively anymore.

play55:54

They're going to plan their answer kind of like what we do,

play55:57

like we speak,

play55:59

we don't produce one word after the other without thinking.

play56:01

We usually kind of plan what we're going to say in advance,

play56:06

at least some of us do.

play56:14

So this works really well in the sense that

play56:15

we get really good performance

play56:17

on lots of different types of video

play56:20

for classifying the action and various other tasks,

play56:23

better than basically anything else

play56:25

that people have tried before.

play56:26

Certainly better than any system

play56:29

that has been trained on video.

play56:30

And this, the pre-training here

play56:31

is on a relatively small amount of video actually,

play56:33

it's not a huge dataset, this is speed.

play56:38

So this is reconstructions of missing parts of a video

play56:44

by that system

play56:45

and it's done by training a separate decoder, right?

play56:47

So it's not part of the initial training,

play56:49

but in the end we can use the representation

play56:51

as input to a decoder

play56:52

that we trained to reconstruct

play56:53

the part of the image that's missing.

play56:55

And these are the result of completing basically

play56:59

the entire middle of the image is missing

play57:02

and the system is kind of filling in things

play57:04

that are reasonable.

play57:07

It's a cooking video and there's a hand

play57:10

and knife and some ingredients.

play57:15

Okay, it is another topic I want to talk about,

play57:17

because I know there are mathematicians and physicists

play57:19

in the room.

play57:20

This is a recent paper, a collaboration between

play57:24

some of us at FAIR

play57:25

and Bobak Kiani,

play57:30

who is a student at MIT with Seth Lloyd

play57:33

and a bunch of people from MIT.

play57:35

So this system is basically using this idea

play57:39

of joint embedding to learn something about

play57:42

partial differential equations

play57:44

that we observe through a solution.

play57:46

So look at the thing at the bottom.

play57:48

We have a PDE, Burgers' equation.

play57:52

What you see are diagrams of space time diagrams basically,

play57:57

of a solution of that PDE.

play58:00

And what we're going to do is we're going to take two windows,

play58:04

separate windows on the solution of that PDE, okay?

play58:08

And of course, the solution depends on

play58:09

the initial condition.

play58:10

You're going to get different solutions

play58:11

for different initial conditions, right?

play58:13

So we're going to take two windows over two different solutions

play58:17

to that PDE, and we're going to do a joint embedding.

play58:20

So we're going to train an encoder to produce representations,

play58:24

so that the representation can be predicted,

play58:26

the representation for one piece of the solution

play58:29

can be predicted from a representation from the other piece.

play58:34

And what the system ends up doing in that case

play58:36

is basically predict or represent

play58:39

the coefficient of the equation that is being sold, right?

play58:43

The only thing that's common between one region

play58:47

of the space, time solution of PDE

play58:50

and another region,

play58:52

is that it's the same equation with the same coefficient.

play58:54

What's different is the initial condition.

play58:56

But the equation itself is the same, right?

play58:59

So the system basically discovers some representation

play59:02

and when we train now a supervised system

play59:04

to predict the coefficient of the equation,

play59:08

it actually does a really good job.

play59:09

In fact it does a better job than if we train it

play59:12

completely supervised from scratch.