Ilya Sutskever: OpenAI Meta-Learning and Self-Play | MIT Artificial General Intelligence (AGI)

Lex Fridman
25 Apr 201860:15

Summary

TLDRIlya Sutskever在演讲中探讨了深度学习和人工智能的最新进展,包括元学习、强化学习和自我对弈。他强调了深度学习的重要性,并解释了其工作原理,特别是在找到最佳神经网络方面。他还讨论了强化学习中的探索问题,以及如何从失败中学习。Sutskever提出了通过自我对弈来提高智能体能力的观点,并展望了未来人工智能可能的发展方向,包括社会化智能体的创建和目标传达的重要性。

Takeaways

  • 🤖 AI领域的深度学习和元学习正在取得显著进展,但仍然存在许多挑战。
  • 🧠 深度学习的成功部分基于找到数据中规律性的最短时间程序,尽管这在计算上是不可行的。
  • 🔄 反向传播是深度学习中的关键算法,尽管它与大脑的工作原理不同。
  • 🧠 神经网络通过迭代地对基础神经网络进行小的改变来满足数据约束。
  • 🤔 强化学习是一个评估智能体在复杂随机环境中实现目标能力的框架。
  • 🎯 强化学习算法的目标是最大化预期奖励,但实际应用中可能需要考虑奖励的方差。
  • 🤖 元学习(meta learning)是一个有前景的领域,尽管目前还不完全成熟。
  • 📈 通过模拟和元学习,AI可以在模拟环境中学习并将其知识迁移到物理机器人上。
  • 🔄 自我对弈(self-play)是一种新兴的AI研究方法,可以促进智能体的快速发展。
  • 🌐 语言理解和生成模型是AI领域的关键挑战之一,目前还有很大的提升空间。
  • 🚀 未来AI的发展将对社会产生深远影响,确保AI的目标与人类价值观一致是一个重要的政治问题。

Q & A

  • Ilya Sutskever 在人工智能领域有哪些重要贡献?

    -Ilya Sutskever 是 OpenAI 的联合创始人和研究总监,他在深度学习和人工智能领域有着重要影响。他的工作在过去五年中被引用超过四万六千次,他是一些深度学习和人工智能领域重大突破思想的关键创意和推动力量。

  • 深度学习为何能够工作?

    -深度学习之所以能够工作,是因为它基于一个数学理论,即如果你能找到一个在数据上表现非常好的最短程序,那么你就能实现最佳的泛化。这意味着如果你能从数据中提取出所有规律性并将其编码到程序中,那么你就能做出最好的预测。尽管理论上存在这样的程序,但目前的工具和理解水平还无法找到这样的最佳短程序,因为这个问题在计算上是不可行的。

  • 什么是元学习,它的潜力和挑战是什么?

    -元学习是指训练算法学习如何学习的过程。它的潜力在于能够创建能够快速适应新任务的系统,这是非常具有前景的。然而,元学习面临的挑战是训练和测试分布必须相同,而在现实世界中,新的测试任务往往与训练数据有所不同,因此元学习在这种情况下可能会遇到困难。

  • 强化学习是如何工作的?

    -强化学习是一个框架,用于评估智能体在复杂随机环境中实现目标的能力。智能体通过与环境交互,尝试新的行为,并根据结果调整其策略。如果结果超出预期,智能体将在未来采取更多这样的行动。

  • 自我对弈在人工智能中有什么作用?(self-play)?

    -自我对弈是一种训练人工智能的方法,通过让AI系统在没有外部数据的情况下自我竞争和学习,从而提高其性能。这种方法在围棋游戏的AlphaGo Zero和OpenAI的Dota 2机器人中都取得了显著的成功。

  • 如何将模拟中训练的策略应用到物理机器人上?

    -为了将模拟中训练的策略应用到物理机器人上,可以通过在模拟器中引入大量变化来使策略具有适应性。然后在物理环境中部署策略时,策略需要通过尝试和错误来适应新环境的物理特性。

  • 如何理解深度神经网络的训练过程?

    -深度神经网络的训练过程可以看作是解决电路搜索问题的过程。通过迭代地对神经网络的基础进行小的调整,直到其预测满足数据的要求。这个过程非常深刻,因为它是通过梯度下降将信息从方程推入参数中,从而满足所有方程。

  • 如何理解强化学习中的探索问题?

    -在强化学习中,探索是指智能体在不知道如何行动时尝试新的行为。探索的重要性在于,只有通过尝试和偶尔获得奖励,智能体才能学习。因此,设计奖励函数以提供逐步的奖励增量是至关重要的,这样即使系统表现不佳,它也能获得奖励并从中学习。

  • 如何通过观察其他智能体来推断它们的目标和策略?

    -通过观察其他智能体的行为,我们可以推断它们的目标和策略。这是人类与其它动物相比在规模和范围上非常不同的一个方面。在非竞争环境中,观察和模仿他人的行为可以是一种有效的学习策略。

  • 如何确保人工智能系统的目标与人类的期望一致?

    -确保人工智能系统的目标与人类的期望一致是一个技术问题,但也是一个重大的政治问题。这需要在技术层面上开发出能够理解和执行人类指定目标的算法,同时在更广泛的社会层面上,需要解决如何确定正确的目标,以及如何确保系统按照这些目标行动的问题。

Outlines

00:00

🤖 人工智能与深度学习的突破

本段落介绍了Ilya Sutskever,他是OpenAI的联合创始人和研究总监。他的工作在深度学习和人工智能领域产生了巨大影响,过去五年的论文被引用超过四万六千次。他被认为是深度学习领域一些重大突破背后的主要创意和推动力。

05:01

🧠 深度学习的原理与工作方式

这部分讨论了深度学习的原理,解释了为什么深度学习能够工作。提到了一个数学理论,即如果能找到一个最短的程序来处理数据,那么就能达到最佳的泛化效果。然而,找到这样的程序在计算上是不可行的。尽管如此,通过使用反向传播和小型电路,我们能够找到解决问题的方法。

10:01

🔄 强化学习与智能体的目标达成

这一部分探讨了强化学习,它是一个评估智能体在复杂随机环境中实现目标能力的框架。强化学习算法通过多次运行智能体并计算其平均奖励来工作。这部分还讨论了强化学习的一个关键问题:如何在没有外部奖励的情况下,从观察中确定奖励。

15:02

🧠 元学习的概念与应用

元学习的目标是开发能够学习如何学习的算法。这种方法涉及在多个任务上训练系统,使其能够快速解决这些任务。元学习的一个成功案例是通过MIT的研究,快速识别字符。另一个例子是Google的神经架构搜索,通过解决小问题来找到能够解决大问题的神经网络架构。

20:04

🤹‍♂️ 自我对弈与智能体的进化

自我对弈是一种让智能体通过相互竞争来提高自身能力的方法。这种方法的一个经典例子是TD Gammon,它通过自我对弈学习并最终击败了世界背棋冠军。自我对弈的一个关键优势是它能够创建一个不断变化的环境,使智能体不断面临新的挑战。

25:05

🎮 强化学习在游戏中的应用

这部分讨论了强化学习在游戏领域的应用,特别是OpenAI的Dota 2机器人。通过自我对弈,这些机器人在短短几个月内从随机游戏玩法进步到世界锦标赛水平。这种快速的进步表明,自我对弈是一种强大的学习机制。

30:06

🤖 人工智能的目标传达与对齐问题

这部分探讨了如何向智能体传达目标,以及如何确保智能体的行为符合我们的期望。提出了一种通过人类反馈来训练智能体的方法,即通过观察人类选择更好的行为来训练智能体。这种方法需要解决的技术问题和政治问题都非常复杂。

35:07

🌟 人工智能的未来展望

最后一部分对人工智能的未来进行了展望,特别是关于如何让智能体在软件环境中学习并将其应用于现实世界的任务。讨论了自我对弈环境的潜力,以及如何通过持续训练和适应新环境来提高智能体的能力。

Mindmap

Keywords

💡深度学习

深度学习是一种通过构建多层神经网络来模拟人类大脑处理信息的机器学习技术。在视频中,Ilya Sutskever提到深度学习是人工智能领域的一个重要突破,它通过在数据中找到最短的程序来实现最佳泛化。例如,通过深度学习,神经网络能够在多次迭代中逐渐改善其预测能力,从而在复杂任务(如图像识别或自然语言处理)上取得显著成果。

💡元学习

元学习,也称为学习如何学习,是指设计算法和模型以便它们能够在多个任务上快速学习和适应的过程。在视频中,Ilya Sutskever讨论了元学习的概念,强调了通过在多个任务上训练系统,使其能够快速掌握新任务的重要性。元学习的目标是创建能够自我改进和适应新环境的智能系统。

💡强化学习

强化学习是一种机器学习范式,其中智能体通过与环境的交互来学习如何实现目标。智能体根据其行为获得的奖励或惩罚来调整其策略。在视频中,强化学习被描述为一个框架,用于评估智能体在复杂随机环境中实现目标的能力。

💡自我对弈

自我对弈是一种训练策略,其中两个相同的算法或模型相互竞争,以此来提高性能。这种方法可以使模型通过自我竞争和学习对手的策略来不断进步。在视频中,Ilya Sutskever提到了自我对弈的概念,并举例说明了如何通过自我对弈来训练模型,从而在没有外部数据的情况下达到高水平的表现。

💡神经网络

神经网络是一种模仿人类大脑神经元结构的计算模型,用于识别模式和处理复杂的数据。在视频中,Ilya Sutskever讨论了神经网络在深度学习中的核心作用,以及如何通过调整网络中的权重来学习数据中的规律。

💡反向传播

反向传播是一种高效计算神经网络中权重梯度的方法,用于训练神经网络。它通过从输出层向输入层反向传递误差来更新网络中的权重。在视频中,虽然反向传播在生物大脑中的直接对应尚不清楚,但它是深度学习中解决电路搜索问题的关键算法。

💡策略梯度

策略梯度是强化学习中的一种方法,它直接对策略(即智能体选择动作的规则)进行参数化,并计算这些参数的梯度以优化策略。在视频中,Ilya Sutskever提到策略梯度算法允许智能体尝试新动作,并根据结果调整动作选择的概率。

💡Q学习

Q学习是强化学习中的一种值函数方法,它通过学习动作值函数(Q函数)来估计在给定状态下采取特定动作的期望回报。在视频中,Ilya Sutskever提到Q学习算法可以从任何数据中学习,包括非自身生成的数据,这使得它在学习过程中更加稳健。

💡目标设定

目标设定是指为智能体或系统定义一个明确的目标或任务,以便它可以通过学习和适应来实现该目标。在视频中,Ilya Sutskever讨论了如何通过人类反馈来设定智能体的目标,以及如何确保智能体的目标与人类的价值观和意图一致。

💡合作

合作是指多个智能体或个体为了共同的目标或利益而协同工作。在视频中,Ilya Sutskever提到合作是生物进化中的一个重要概念,并且暗示了在多智能体环境中,合作可能是一个成功的策略。

Highlights

Ilya Sutskever, co-founder and research director of OpenAI, discusses the impact of deep learning and AI.

Sutskever's work has been cited over 46,000 times, showcasing his influence in the field.

The concept of finding the shortest program to achieve the best generalization in machine learning is introduced.

The computational intractability of finding the best short program is discussed, highlighting the limitations of current AI tools.

The discovery that small circuits can be optimized using backpropagation is highlighted as a foundational AI principle.

Neural networks are likened to parallel computers capable of complex computations and reasoning.

The potential of reinforcement learning to achieve goals in complex environments is explored.

Meta-learning, or learning to learn, is introduced as a promising but not fully realized concept.

The importance of representation learning and unsupervised learning for identifying high-level states in meta-learning is emphasized.

Self-play is presented as a powerful method for training AI agents, leading to rapid increases in competence.

The potential societal implications and alignment issues of superintelligent AI are discussed.

The use of human feedback for training AI, such as through reinforcement learning, is highlighted.

The limitations of meta-learning, particularly the requirement for training and test distributions to match, are noted.

The potential for AI to develop social skills, language, and other human-like traits through multi-agent interaction is speculated.

The importance of continuous learning and adapting to new environments is emphasized for AI agents.

The challenges of conveying goals to AI agents and aligning their objectives with human values are discussed.

The potential for AI to develop strategies and behaviors through self-organization and interaction with other agents is explored.

The role of complexity theory in understanding the problems that AI can solve and the limitations thereof is examined.

The future of generative language models and the importance of scaling up current models is discussed.

The potential use of evolutionary strategies in reinforcement learning for small, compact objects is mentioned.

The necessity of accurate physical world modeling and simulation for training AI agents is questioned.

Transcripts

play00:00

welcome back to 6 SZ row 99 artificial

play00:02

general intelligence today we have Ilya

play00:05

sutskever co-founder and research

play00:11

director of open AI he started in the

play00:14

amel group in Toronto Geoffrey Hinton

play00:16

then at Stanford with an jiaying

play00:18

co-founded DNN research for three years

play00:21

as a research scientist at Google brain

play00:22

and finally co-founded open AI citations

play00:27

aren't everything

play00:28

but they do indicate impact and his work

play00:31

recent work in the past five years has

play00:35

been cited over forty six thousand times

play00:38

he has been the key creative intellect

play00:42

and driver behind some of the biggest

play00:44

breakthrough ideas in deep learning and

play00:46

artificial intelligence ever

play00:48

so please welcome Ilya alright thanks

play00:59

for the introduction Lex

play01:01

alright thanks for coming to my talk I

play01:04

will tell you about some work we've done

play01:06

over the past year on on meta learning

play01:09

and software open AI and before I dive

play01:12

into some of the more technical details

play01:15

of the work I want to spend a little bit

play01:18

of time talking about deep learning and

play01:22

why it works at all in the first place

play01:25

which I think it's actually not a

play01:27

self-evident saying that they should

play01:29

work one fact it's actually a fact it's

play01:35

a mathematical theory that you can prove

play01:37

is that if you could find the shortest

play01:42

program the does very very well on your

play01:46

data then you will achieve the best

play01:48

generalization possible with a little

play01:51

bit of modification you can turn it into

play01:52

a precise theorem

play01:54

and on a very intuitive level it's easy

play01:58

to see what it should be the case if you

play02:01

have some data and you're able to find a

play02:04

shorter program which generates this

play02:06

data then you've essentially extracted

play02:09

all the all conceivable regularity from

play02:11

this data into your program and then you

play02:14

can use these objects to make the best

play02:15

predictions possible like if if you have

play02:19

data which is so complex but there is no

play02:22

way to express it as a shorter program

play02:25

then it means that your data is totally

play02:27

random there is no way to extract any

play02:29

regularity from it whatsoever now there

play02:32

is little known mathematical theory

play02:34

behind this and the proofs of these

play02:36

statements actually not even that hard

play02:38

but the one minor slight disappointment

play02:41

is that it's actually not possible at

play02:44

least given today's tools and

play02:45

understanding to find the best short

play02:48

program that explains or generates or

play02:52

solves your problem given your data this

play02:55

problem is computationally intractable

play02:58

the space of all programs is a very

play03:02

nasty space small changes to your

play03:04

program result in massive changes in the

play03:06

behavior of the program as it should be

play03:08

it makes sense you have a loop you

play03:11

change the inside of the loop of course

play03:13

you get something totally different so

play03:15

the space of programs is so hard at

play03:17

least given what we know today search

play03:19

there seems to be completely off the

play03:21

table well if we give up on shorts on

play03:27

short programs what about small circuits

play03:31

well it turns out that we are lucky it

play03:34

turns out that when it comes to small

play03:36

circuits you can just find the best

play03:38

small circuits circuits that solves the

play03:40

problem using back propagation and this

play03:43

is the miraculous fact on which the rest

play03:47

of AI stands it is the fact but then you

play03:51

have a circuit and you impose

play03:52

constraints on your circuits on your

play03:54

circuit using data you can find the way

play03:58

to satisfy these constraints these

play04:00

constraints using that problem by

play04:02

iteratively making small changes

play04:05

to the base of your neural network until

play04:08

its predictions satisfy the data what

play04:13

this means is that the computational

play04:15

problem that so the back propagation is

play04:17

extremely profound it is circuit search

play04:20

now we know that you can solve it solve

play04:22

it always but you can solve it sometimes

play04:25

and you can solve it at those times

play04:29

where we have a practical data set it is

play04:32

easy to design artificial data sets for

play04:34

which you cannot find the best neural

play04:35

network but in practice that seems to be

play04:38

not a problem you can think of training

play04:41

a neural network as solving a neural

play04:43

equation in many cases where you have a

play04:47

large number of equation terms like this

play04:50

f of X I theta equals y I so you got

play04:53

your parameters and they represent all

play04:54

your degrees of freedom and you use

play04:58

gradient descent to push the information

play05:01

from these equations into the parameters

play05:02

satisfy them all and you can see that

play05:06

the neural network let's say one with 50

play05:09

layers is basically a parallel computer

play05:13

that is given 50 time steps to run and

play05:17

you can do quite a lot with a 15 with 50

play05:19

time steps of a very very powerful

play05:21

massively parallel computer so for

play05:24

example I do I think it is not widely

play05:27

known that you can learn to sort sort n

play05:33

n bit numbers using a modestly sized

play05:37

neural network with just two hidden

play05:38

layers which is not bad it's not

play05:42

self-evident especially since we've been

play05:45

taught that sorting requires log n

play05:48

parallel steps with the neural network

play05:50

you can sort successful using only two

play05:53

parallel steps so there's some things

play05:55

like an arm is going on now these are

play05:58

parallel steps of threshold threshold

play06:00

neurons so they're doing a little bit

play06:02

more work let's answer to the mystery

play06:04

but if you've got 50 such layers you can

play06:05

do quite a bit of logic quite a bit of

play06:07

reasoning all inside the neural network

play06:09

and that's why it works

play06:11

given the data we are able to find the

play06:15

best neural network and because the

play06:17

neural network is deep because it can

play06:19

run computation inside of its act inside

play06:21

of its layers the best neural network is

play06:24

worth finding because that's really what

play06:26

you need you need something you need the

play06:29

model class which is worth optimizing

play06:32

but it also needs to be optimizable and

play06:35

deep neural networks satisfy both of

play06:37

these constraints and this is why

play06:39

everything works this is the basis on

play06:41

which everything else resides now I want

play06:45

to talk a little bit about reinforcement

play06:46

learning so reinforcement learning is a

play06:49

framework it's a framework of evaluating

play06:53

agents in their ability to achieve goals

play06:56

and complicated stochastic environments

play06:58

you've got an agent which is plugged

play07:00

into an environment as shown in the

play07:02

figure right here and for any given

play07:06

agent you can simply run it many times

play07:08

and compute its average reward now the

play07:13

thing that's interesting about the

play07:14

reinforcement learning framework is that

play07:16

there exist interesting useful

play07:20

reinforcement learning algorithms the

play07:22

framework existed for a long time it

play07:25

became interesting once we realized that

play07:27

good algorithms exist now these are

play07:29

there are perfect algorithms but they

play07:31

are good enough to do interesting things

play07:32

and all you want the mathematical

play07:37

problem is one where you need to

play07:39

maximize the expected reward now one

play07:44

important way in which the reinforcement

play07:46

learning framework is not quite complete

play07:48

is that it assumes that the reward is

play07:50

given by the environment you see this

play07:52

picture the agent sends an action while

play07:56

the reward sends it an observation in a

play07:58

both the observation and the reward

play08:00

backwards that's what the environment

play08:01

communicates back

play08:03

the way in which this is not the case in

play08:06

the real world is that we figure out

play08:11

what the reward is from the observation

play08:13

we reward ourselves we are not told

play08:16

environment doesn't say hey here's some

play08:18

negative reward it's our interpretation

play08:20

over census that lets us determine what

play08:23

the reward is and there is only one real

play08:26

true reward in life and this is

play08:28

existence or nonexistence and everything

play08:31

else is a corollary of that so well what

play08:35

should our agent be you already know the

play08:37

answer should be a neural network

play08:39

because whenever you want to do

play08:41

something dense it's going to be a

play08:42

neural network and you want the agent to

play08:44

map observations to actions so you let

play08:47

it be parametrized with a neural net and

play08:49

you apply learning algorithm so I want

play08:51

to explain to you how reinforcement

play08:53

learning works this is model free

play08:55

reinforcement learning the reinforcement

play08:56

learning has actually been used in

play08:58

practice everywhere but it's also deeply

play09:02

it's very robust it's very simple it's

play09:05

also not very efficient so the way it

play09:08

works is the following this is literally

play09:09

the one sentence description of what

play09:11

happens in short try something new add

play09:17

randomness directions and compare the

play09:21

result to your expectation if the result

play09:25

surprises you if you find that the

play09:28

results exceeded your expectation then

play09:31

change your parameters to take those

play09:33

actions in the future that's it this is

play09:36

the fool idea of reinforcement learning

play09:38

try it out see if you like it and if you

play09:41

do do more of that in the future and

play09:44

that's it that's literally it this is

play09:47

the core idea now it turns out it's not

play09:49

difficult to formalize mathematically

play09:51

but this is really what's going on

play09:52

if in a neural network in a regular

play09:54

neural network like this you might say

play09:57

okay what's the goal

play09:59

you run the neural network you get an

play10:01

answer you compare it to the desired

play10:03

answer and whatever difference you have

play10:05

between those two you send it back

play10:08

to change the neural network that's

play10:10

supervised line in reinforcement

play10:12

learning you run in your own network you

play10:14

add a bit of randomness to your action

play10:16

and then if you like the result your

play10:19

randomness turns into the desired target

play10:21

in effect so that's it trivial now math

play10:29

exists without explaining what these

play10:34

equations mean the point is not really

play10:36

to derive them but just to show that

play10:37

they exist there are two classes of

play10:40

reinforcement learning algorithms one of

play10:42

them is the policy gradient where

play10:45

basically what you do is that you take

play10:46

this expression right there the sum of

play10:49

expected we work the sum of rewards and

play10:51

it just crunched through the derivatives

play10:53

you expand the terms iran you do some

play10:56

algebra and you get a derivative and

play11:00

miraculously the derivative has exactly

play11:03

the form that i told you which is try

play11:07

some actions and if you like them

play11:09

increase the log probability of the

play11:11

actions that we truly follows from the

play11:13

math it's very nice when the intuitive

play11:15

explanation has a one-to-one

play11:17

correspondence to what you get in the

play11:19

equation even though you have to take my

play11:21

word for it if you are not familiar with

play11:22

it that's that equation at the top now

play11:26

there is a different class of

play11:27

reinforcement learning algorithms which

play11:28

is a little bit more difficult to

play11:29

explain it's called the Q learning based

play11:31

algorithms they are a bit less stable a

play11:34

bit more sample efficient and it has the

play11:39

property that it can learn not only from

play11:43

the data generated by the actor but from

play11:46

any other data as well so it has it has

play11:48

some rope but it has different

play11:50

robustness profile which would be a

play11:52

little bit important but it's only going

play11:53

to be a technicality so yeah this is the

play11:57

own policy of policy distinction but

play11:59

it's a little bit technical so if you

play12:00

find this hard to understand don't worry

play12:03

about it if you already know this then

play12:05

you already know it so now what's the

play12:08

potential for enforcement learning

play12:10

wasn't it promised what is it actually

play12:12

why should we be excited about it

play12:15

now there are two reasons the

play12:17

reinforcement learning algorithms of

play12:18

today already useful and interesting and

play12:22

especially if you have a really good

play12:24

simulation of your world you could train

play12:26

agents to do lots of interesting things

play12:30

but what's really exciting is if you can

play12:33

build a super amazing sample efficient

play12:35

out of reinforcement learning algorithm

play12:37

we just give it a tiny amount of data

play12:38

and the algorithm just crunches through

play12:41

it and extracts every bit of entropy out

play12:42

of it in order to learn in the fastest

play12:45

way possible now today our algorithms

play12:48

are not particularly efficient they are

play12:50

data inefficient but as our field keeps

play12:54

making progress this will change next I

play12:58

want to dive into the topic of meta

play12:59

learning the goal of meta learning so

play13:04

meta learning is a beautiful idea that

play13:07

doesn't really work but it kind of works

play13:09

and it's really promising too it's

play13:11

another promising idea so what's the

play13:14

dream we have some learning algorithms

play13:18

perhaps you could use those learning

play13:20

algorithms in order to learn to learn

play13:22

I'd be nice if we could learn to learn

play13:25

so how would you do that you will take a

play13:29

system which you train it not on one

play13:34

task but on many tasks and you ask you

play13:36

that it learns to solve these tasks

play13:38

quickly and that may actually be enough

play13:41

so here's how it looks like here's how

play13:43

most traditional metal earning look

play13:46

works like it looks like you have a

play13:48

model which is a big neural network what

play13:51

what you do is that you treat every

play13:54

instead of training cases you have

play13:57

training tasks and instead of test cases

play13:59

you have test tasks so your input may be

play14:02

instead of just your current test case

play14:05

it would be all the information about

play14:07

the new T above the test tasks plus the

play14:09

test case and you'll try to output the

play14:12

prediction reaction for that test case

play14:14

so basically you say yeah I'm going to

play14:17

give you your ten examples as part of

play14:19

your input to your model figure out how

play14:21

to make the best use of them it's a

play14:24

really

play14:25

straightforward idea u-turn the neural

play14:28

network into the learning algorithm by

play14:30

turning a training task into a training

play14:33

case so training to ask a constraining

play14:36

case this is meta learning just one

play14:39

sentence and so they've been several

play14:44

success stories which I I think are very

play14:46

interesting one of the success stories

play14:49

of meta learning is learning to

play14:51

recognize characters quickly so they've

play14:53

been a dataset

play14:56

produced by MIT by lake corral and this

play15:02

is a data set we have a large number of

play15:03

different handwritten characters and

play15:05

people have been able to train extremely

play15:08

strong meta learning system for this

play15:10

desk another successful another very

play15:13

successful example of meta learning is

play15:14

in that of neural architecture search by

play15:17

is openly from google where they found

play15:21

neural architecture that solved one

play15:24

problem well small problem and then you

play15:26

could generalize and then if you

play15:27

successfully solve large problems as

play15:28

well so this is the kind of the the

play15:32

small number of bits meta learning is

play15:34

that when you learn the architecture or

play15:36

maybe even learn a program small program

play15:37

or learning algorithm which you apply to

play15:39

new tasks so this is the other way of

play15:42

doing meta learning so anyway but the

play15:44

point is what's happening what's really

play15:46

happening in meta learning in most cases

play15:48

is that you turn a training task into a

play15:51

training case and pretend this is

play15:54

totally normal

play15:54

normal deep learning that's it this is

play15:57

the entirety of meta learning everything

play15:59

else suggests minor details next I wanna

play16:04

dive in so now that I've finished the

play16:06

introduction section I want to start

play16:08

discussing different work by different

play16:11

people from opening I and I want to

play16:13

start by talking about hindsight

play16:14

experience replay it's been a large

play16:17

effort by and recurvature all to develop

play16:20

a learning algorithm for reinforcement

play16:22

learning

play16:23

that doesn't solve just one task but it

play16:28

solves many tasks and it learns to make

play16:31

use of its experience in a much more

play16:33

efficient way and I want to discuss one

play16:36

problem in reinforcement learning it's

play16:38

actually I guess a set of problems which

play16:41

all related to each other at one really

play16:47

important thing you need to learn to do

play16:48

is to explore you're in that you start

play16:51

out in an environment you don't know

play16:54

what to do what do you do so one very

play16:56

important thing that has to happen is

play16:58

that you must get rewards from time to

play17:00

time if you try something and you don't

play17:03

get rewards then how can you learn so

play17:09

said that's the kind of the crux of the

play17:10

problem how do you learn and relatedly

play17:13

is there any way to meaningfully benefit

play17:17

from your ex from the experience from

play17:20

your attempts to from from your failures

play17:22

if you try to achieve a goal and you

play17:24

fail can you still learn from it you

play17:26

tell you instead of asking your

play17:28

algorithm to achieve a single goal you

play17:31

want to learn a policy that can achieve

play17:32

a very large family of goals for example

play17:34

instead of reaching one state you want

play17:37

to learn a policy that reaches every

play17:38

state of your system and what's the

play17:41

implication anytime you do something you

play17:44

achieve some state so let's suppose you

play17:47

say I want to achieve state a I try my

play17:51

best and I end up achieving state B I

play17:55

can either conclude well that was

play17:57

disappointing I haven't learned almost

play17:58

anything I'm still have no idea how to

play18:01

cheat how to achieve state aid but

play18:04

alternatively I can say well wait a

play18:05

second I've just reached a perfectly

play18:07

good state which is B can I learn how to

play18:10

achieve state B from my attempt to

play18:12

achieve state a an answer is yes you can

play18:15

and it just works and I just want to

play18:18

point out this is the one case there is

play18:20

a small subtlety here which may be

play18:23

interesting to those of you who are very

play18:26

familiar with on Part B the distinction

play18:28

between on policy and off policy when

play18:31

you try to achieve a you are on you're

play18:33

doing on policy learning for

play18:35

reaching the state a but you're doing

play18:38

off policy learning for it in the state

play18:40

be because you would take different

play18:41

actions if you would actually try to

play18:43

reach they'd be so that's why it's very

play18:45

important that the algorithm you use

play18:46

here can support of policy learning but

play18:49

that's a minor technicality at the crux

play18:52

the crux of the idea is you make the

play18:55

problem easier by ostensibly making it

play18:58

harder by training a system which can

play19:01

which aspires to reach to learn to reach

play19:04

every state to learn to achieve every

play19:06

goal to learn to master its environment

play19:09

in general you build a system which

play19:12

always learn something it learns from

play19:15

success as well as from failure because

play19:17

if it tries to do one thing one thing

play19:19

and it does something else

play19:20

it now has training data for how to

play19:22

achieve that something else I want to

play19:24

show you a video of how this thing works

play19:26

in practice so one challenge in

play19:30

reinforcement learning systems is the

play19:32

need to shape the reward so what does it

play19:34

mean

play19:35

it means that at the beginning of the

play19:37

system at the start of learning then the

play19:39

system doesn't know much it will

play19:42

probably not achieve your goal and so

play19:44

it's important that you design your

play19:45

reward function to give it gradual

play19:46

increments to make it smooth and

play19:48

continuous so that even when the system

play19:49

is not very good it achieves the goal

play19:51

now if you give your state your system a

play19:54

very sparse reward where the reward is

play19:56

achieved only when you reach a final

play19:57

state then it becomes very hard for

play20:01

normal reinforcement learning algorithms

play20:03

to solve a problem because naturally you

play20:05

never get the reward so you never learn

play20:07

no reward means no learning but here

play20:10

because you learn from failure as well

play20:13

as from success

play20:14

this is this problem simply doesn't

play20:16

occur and so this is this is nice I

play20:19

think you know let's let's look at the

play20:21

videos a little bit more like it's nice

play20:23

how this is it confidently and

play20:25

energetically moves the little green

play20:27

buck to its target and here's another

play20:29

one

play20:37

you

play20:50

okay so we can skip the it works on

play20:52

spawn on the face if you do it on

play20:53

physical robot as well but we can skip

play20:54

it so I think the point is that the

play20:58

hindsight experience replay algorithm is

play21:01

directionally correct because you want

play21:05

to make use of all your data and not

play21:07

only a small fraction of it now one huge

play21:10

question is where do you get the high

play21:13

level states where do the high level

play21:16

states come from because in the work of

play21:20

showing you so far

play21:21

the system is asked to achieve low level

play21:24

States so I think one thing it will

play21:26

become very important for this kind

play21:28

approaches is representation learning

play21:30

and unsupervised learning figure out

play21:33

what are the rights what are the right

play21:34

states what's the state space of goals

play21:37

that's worth achieving now I want to go

play21:43

through some real meta learning results

play21:46

and I'll show you a very simple way of

play21:49

doing seem to reel from simulation to

play21:53

the physical robot with meta learning

play21:55

and this is where my pain growl was an a

play21:58

and encouraged a really nice intern

play22:00

project in 2017 so I think we can agree

play22:06

that in the domain of robotics it would

play22:09

be nice if you could train your policy

play22:12

in simulation and then somehow this

play22:14

knowledge would carry over to the

play22:17

physical robot now we can build we can

play22:23

build simulators that are okay but they

play22:26

can never perfectly match the real world

play22:29

unless you want to have an insanely slow

play22:31

simulator and the reason for that is

play22:33

that it turns out that simulating freaky

play22:37

simulating contacts is super hard and I

play22:40

heard somewhere correct me if I'm wrong

play22:42

that simulating friction is np-complete

play22:44

I'm not sure but it's like stuff like

play22:48

that so your simulation is just not

play22:51

going to match reality there will be

play22:54

some resemblance but that's it

play22:55

how can we

play22:57

address this problem and I want to show

play22:59

you one simple idea so let's say one

play23:05

thing once one thing that would be nice

play23:07

is that if you could learn a policy

play23:10

learn a policy that would quickly adapt

play23:12

itself to the real world well if you

play23:17

want to learn a policy that can quickly

play23:19

adapt we need to make sure it has

play23:21

opportunities to adapt during training

play23:23

time so what do we do instead of solving

play23:27

a problem in just one simulator we add a

play23:30

huge amount of variability to the

play23:32

simulator we say we will randomize the

play23:35

friction so we will randomize the masses

play23:37

the length of the different objects and

play23:40

their I guess M dimensions so you try to

play23:44

randomize physics they simulate in lots

play23:48

of different ways and then importantly

play23:50

you don't tell the policy how you

play23:52

randomized it so what is it going to do

play23:55

then you take your policy and you put it

play23:57

in an environment then says well this is

play23:59

really really tough I don't know what

play24:01

the masses are and I don't know what the

play24:02

frictions are I need to try things out

play24:04

and figure out what the friction is as I

play24:07

get it responses from the environment so

play24:10

you're building you you learn a certain

play24:12

degree of adaptability into the policy

play24:15

and it actually works

play24:17

let's want to show you this is what

play24:19

happens when you just strain a policy in

play24:21

simulation and deploy it on the physical

play24:24

robot and here the goal is to bring the

play24:26

hockey puck towards the red dot and you

play24:30

will see that it will struggle and the

play24:38

reason it struggles is because of the

play24:39

systematic differences between the

play24:41

simulator and the real physical robot so

play24:47

I can even the basic movement is

play24:49

difficult for the policy because the

play24:51

assumptions are violated so much so if

play24:54

you do the training as I discussed we

play24:55

train a recurrent neural network policy

play24:57

which learns to quickly infer properties

play25:00

of the simulator in order to accomplish

play25:03

the task you can then give it the real

play25:05

thing

play25:06

the real physics and it will do much

play25:08

better so now this is not a perfect

play25:11

technique but it's definitely very

play25:12

promising it's promising whenever you

play25:14

are able to sufficiently randomize the

play25:16

simulator so it's definitely very nice

play25:20

to see the closed-loop nature of the

play25:22

policy you consider it would push the

play25:24

hockey puck and would correct it

play25:26

very very gently to bring it to the goal

play25:28

yeah so that that was cool so that was

play25:35

very that was a cool application of meta

play25:37

learning I want to discuss one more

play25:40

application of meta learning which is

play25:42

learning a hierarchy of actions and this

play25:47

was work done by France at all actually

play25:49

kept in France the ancient who did it

play25:52

was in high school I mean he wrote this

play25:54

paper so one thing that would be nice is

play26:04

if reinforcement learning was

play26:07

hierarchical if instead of simply taking

play26:10

micro actions you've had some kind of

play26:12

little subroutines that you could deploy

play26:15

maybe the term subroutine is a little

play26:17

bit too crude but if you had some idea

play26:20

of which action primitives are worth

play26:23

starting with now no one has been able

play26:26

to to get actually like real value add

play26:31

from curricula reinforcement learning

play26:33

yet so far all the really cool results

play26:35

all the really convincing is also

play26:36

reinforcement learning do not use it

play26:39

that's because we haven't quite figured

play26:42

out what's the right way for

play26:44

reinforcement learning for her ocular

play26:45

reinforcement learning

play26:47

I just want to show you one very simple

play26:49

approach where you use meta-learning to

play26:55

learn to learn a hierarchy of actions so

play26:57

here's what you do you have in this

play27:01

specific work you have a certain yeah

play27:05

let's say you have a certain number of

play27:07

low-level primitives let's say you have

play27:09

two ten of them and you have a

play27:11

distribution of tasks and your goal is

play27:15

to learn low level primitives such that

play27:20

when they are used inside a very brief

play27:24

run of some reinforcement learning

play27:25

algorithm you will make as much progress

play27:27

as possible so the idea is you want to

play27:31

get the greatest amount of progress you

play27:33

want to learn policies that result in

play27:35

the great story you want to learn

play27:38

primitives that result in the greatest

play27:41

amount of progress is possible when used

play27:43

inside learning so this is a meta

play27:45

learning setter because any distribution

play27:46

of tasks and here we've had if we've had

play27:50

a little maze here the distribution of a

play27:53

mazes and in this case the little bug

play27:54

learned three policies which move it in

play27:58

its fixed direction and as a result of

play28:01

having this hierarchy you're able to

play28:02

solve problems really fast but only when

play28:04

the hierarchy is correct

play28:05

so horican reinforcement learning is

play28:07

still working progress and this was an

play28:09

and this work is an interesting proof

play28:12

point of how Haruko reinforcement could

play28:19

be like how heretical reinforcement

play28:21

learning could be like if it worked now

play28:26

I want to just spend one slide

play28:29

addressing the limitations of high

play28:32

capacity method learning the specific

play28:35

limitation is that the training test

play28:41

distribution has to be equal to the test

play28:44

test distribution and I think this is a

play28:47

real limitation because in reality you

play28:50

the new test that you want to learn do

play28:52

in some ways being fundamentally

play28:54

different from anything you've seen so

play28:57

far so for example if you go to school

play28:59

you learn lots of useful things but then

play29:02

they go to work only a fraction of this

play29:05

of the things that you've learned

play29:07

carries over you need to learn if you

play29:10

need quite a few more things from

play29:11

scratch so metal owning would struggle

play29:15

with that because it really assumes that

play29:17

the Train the training data is that the

play29:19

distribution over the training task has

play29:21

to be equal to the distribution of the

play29:22

test tasks that's the limitation I think

play29:25

that as we develop better algorithms for

play29:28

being robust when the test tasks outside

play29:34

of the distribution of the training

play29:35

tasks the metal on would work much

play29:38

better now I want to talk about self

play29:42

play the links of play is a very cool

play29:45

topic that's starting to get attention

play29:49

only now and I want to start by

play29:52

reviewing very old work called TD gammon

play29:56

it's back from all the way from 1992 so

play30:00

it's 26 years old now it was done by

play30:02

Jerry to cero so this work is really

play30:06

incredible because it has so much

play30:13

relevance today what they did basically

play30:16

they said okay let's take two neural

play30:20

networks and let them let them play

play30:24

against each other let them play

play30:26

backgammon against each other and let

play30:28

them in tray let them be trained

play30:29

particularly so it's a super-modern

play30:32

approach and you would think this was a

play30:36

paper from 2017 except that then you

play30:39

look at this plot it shows that you only

play30:41

have ten hidden units twenty hidden

play30:42

units forty and eighty for the different

play30:44

M colors where you notice that the

play30:47

largest neural network works best so in

play30:50

some ways not much has changed and this

play30:52

is the evidence

play30:54

and in fact they were able to beat the

play30:57

world champion in backgammon and they

play30:58

were able to discover new strategies

play31:00

that the best human a backgammon players

play31:02

weren't ever not noticed and they've

play31:06

determined that the strategy discovered

play31:07

by TD gammon actually better

play31:09

so that's pure self play with cue

play31:11

learning which is which remained dormant

play31:16

until the DQ and work with Atari mid

play31:19

mind so now other examples of self play

play31:27

include alphago zero which was able to

play31:30

learn to beat the world champion and go

play31:32

without using any external data

play31:34

whatsoever another result of this vein

play31:37

is by open AI which is our dota 2 BOTS

play31:40

which was able to build the world

play31:42

champion on the 1v1 version of the game

play31:46

and so I want to spend a little bit of

play31:49

time talking about the allure of self

play31:52

play and why I think it's exciting so

play31:58

one important problem that's a that

play32:02

that's that we must face as we try to

play32:05

build truly intelligent systems is what

play32:09

is the task what are we actually

play32:11

teaching the systems to do and one very

play32:15

attractive attribute of self play is

play32:17

that the agents create the environment

play32:24

by virtue of the agent acting in the

play32:26

environment the environment becomes

play32:30

difficult for the other agents and you

play32:32

can see here an example of an iguana

play32:34

interacting with snakes that try to eat

play32:36

it

play32:37

unsuccessfully this time so we can see

play32:40

what will happen in a moment the iguana

play32:43

strains best and so the fact you have

play32:46

this arms race between the snakes and

play32:49

the iguana

play32:50

motivates their development potentially

play32:54

without bound and this is what happens

play32:56

in effect in but in biological evolution

play32:59

now interesting work in this direction

play33:02

was done in 1994 but Carl says there is

play33:06

a really cool video on YouTube by Carl

play33:09

seems you should check it out which

play33:11

really kind of shows all the work that

play33:12

he's done and here you have a little

play33:15

competition between agents where you

play33:17

evolved both the behavior and their

play33:19

morphology when you when the agents is

play33:22

trying to gain possession of a green

play33:24

cube and so you can see that the agents

play33:29

create the challenge for each other and

play33:30

that's why they need to develop so one

play33:36

thing that we did and this is work by

play33:38

advance a little from open ai is we said

play33:43

okay well can we demonstrate some

play33:46

unusual results in self play that would

play33:48

really convince us that there is

play33:50

something there so what we did here is

play33:53

that we created a small a small ring and

play33:56

you have these two humanoid figures and

play33:58

their goal is just to push each other

play34:00

outside the ring and they don't know

play34:03

anything about wrestling they don't know

play34:05

anything about standing your balance in

play34:07

each other they don't know anything

play34:08

about centers of gravity all they know

play34:10

is that if you don't do a good job then

play34:13

your competition is going to do a better

play34:14

job now one of the really attractive

play34:17

things about self play is that you

play34:22

always have an opponent that's roughly

play34:25

as good as you are in order to learn you

play34:29

need to sometimes win and sometimes lose

play34:32

but you can't always win sometimes you

play34:35

must fail sometimes you must succeed so

play34:39

let's see what will happen here yeah so

play34:43

it was able to do so the green humanoid

play34:44

was able to block the ball in a Cell in

play34:48

a well balanced self play environment

play34:53

petition is always level no matter how

play34:55

good you are or how bad you are you have

play34:58

a competition that makes it exact

play35:00

exactly of exactly the right challenge

play35:02

for you on one thing here so this video

play35:04

shows transfer learning it takes a

play35:06

little wrestling humanoid and you take

play35:09

its friend away and you start applying a

play35:12

big large random forces on it and you

play35:14

see if it can maintain its balance and

play35:16

the answer turns out to be but yes it

play35:19

can because it's been trained against an

play35:22

opponent it pushes it and so that's why

play35:25

even if it doesn't understand where the

play35:27

fresh force is being applied on it it's

play35:29

still able to balance itself so this is

play35:31

one potentially attractive feature of

play35:35

subway environments that you could learn

play35:36

a certain broad set of skills although

play35:40

it's real hard to control the square the

play35:42

skills will be and so the biggest open

play35:44

question with this research is how do

play35:46

you learn agents in a software

play35:49

environment such that they do whatever

play35:53

they do but then they are able to solve

play35:55

a battery of tasks that is useful for us

play35:58

that is explicitly specified externally

play36:00

yeah I also want to want to highlight

play36:06

one attribute of self play environments

play36:09

that we've observed in our dota BOTS and

play36:12

that is that we've seen a very rapid

play36:13

increase in the competence of the bots

play36:15

so over the period over the course of

play36:17

maybe five months we've seen the bots go

play36:19

from playing totally randomly all the

play36:24

way to the world champion and the reason

play36:28

for that is that once you have a self

play36:30

play environment if we put compute into

play36:33

it you turn it into data self play

play36:36

allows you to turn compute into data and

play36:40

I think you will see a lot more of that

play36:41

as being an extremely important thing to

play36:44

be able to turn compute into essentially

play36:46

data generalization simply because the

play36:50

speed of neural net processors will

play36:51

increase very dramatically over the next

play36:53

few years so neural net cycles will be

play36:55

cheap and it will be important to make

play36:57

use of this new of newly-found

play37:00

overabundance of cycles

play37:02

I also want to talk a little bit about

play37:04

the endgame of the self approach so one

play37:08

thing that we know about the human brain

play37:11

is that it has increased in sized fairly

play37:14

rapidly over the past two million years

play37:17

my theory the reason I think it happened

play37:21

is because our ancestors got to a point

play37:25

where the thing that's most important

play37:27

for your survival is your standing in

play37:30

the tribe and less the tiger and the

play37:33

lion once the most important thing is

play37:36

how you deal with those other things

play37:39

which have a large brain then it really

play37:40

helps to have a slightly larger brain

play37:42

and I think that's what happened and

play37:44

there exists at least one paper from

play37:47

science which supports this point of

play37:48

view so apparently there has been

play37:51

convergent evolution between social apps

play37:54

and social Birds even though in terms of

play37:57

various behaviors even though the

play38:02

divergence in evolutionary timescale

play38:04

between humans and birds has occurred a

play38:07

very long time ago and humans and humans

play38:09

apes and humans apes and birds have very

play38:12

different brain structure so I think

play38:17

what should happen if we succeed if we

play38:19

successfully follow the path of this

play38:22

approach is that you should create a

play38:23

society of agents which will have

play38:25

language and theory of mind negotiation

play38:29

social skills trade economy politics

play38:33

justice system all these things should

play38:35

happen inside the multi-agent

play38:37

environment and it will also be some

play38:39

alignment issue of how do you make sure

play38:41

that the agents we learn behave in a way

play38:43

that we want now I want to make a

play38:46

speculative digression here which is I

play38:51

want to make the following observation

play38:56

if you believe that this kind of society

play39:01

of agents is a plausible place where

play39:05

truly where the fuller fully general

play39:10

intelligence will emerge and if you

play39:12

accept that our experience with the dota

play39:16

BOTS we've seen a very rapid increase in

play39:17

competence will carry over once all the

play39:20

details are right if you assume both of

play39:22

these conditions then it should follow

play39:26

that we should see a very rapid increase

play39:27

in the competence of our agents as they

play39:30

live in the Society of agents so now

play39:34

that we've talked about a potentially

play39:38

interesting way of increasing the

play39:41

competence and teachings of an agent's

play39:43

social skills and language and a lot of

play39:46

things that actually exist in humans as

play39:48

well we want to talk a little bit about

play39:50

how you convey goals to agents and the

play39:57

question of the main goal to eight calls

play39:58

to agents is just a technical problem

play40:00

but it will be important because it is a

play40:05

lot more likely than not that the agents

play40:09

of evil train will eventually be

play40:12

dramatically smarter than us and this is

play40:14

work by the opening eye safety team by

play40:17

Paul Christiana at all and others so I'm

play40:22

just going to show you this video which

play40:23

basically explains how the whole thing

play40:25

works you there is some behavior looking

play40:29

for and you the human gets to see pairs

play40:33

of behaviors and you simply click on the

play40:36

one that looks better and after a very

play40:41

modest number of clicks you can get this

play40:45

little simulated leg to do back flips

play40:49

and there

play40:57

go picking out the back flips and in

play41:00

this to get this specific behavior it

play41:02

took about 500 clicks by human

play41:06

annotators the way it works is that you

play41:09

take all the so this is a very data

play41:11

efficient reinforcement learning

play41:13

algorithm but it is efficient in terms

play41:15

of rewards and not in terms of the

play41:18

environment interactions so what you do

play41:20

here is that you take all the clicks so

play41:22

you've got your here is one B here which

play41:26

is better than other you fit a reward

play41:29

function a numerical reward function to

play41:32

those clicks so you want to fit a reward

play41:34

function which satisfies those clicks

play41:35

clicks and you optimize this reward

play41:37

function with reinforcement learning and

play41:38

it actually works so this requires 500

play41:42

bits of information you've also been

play41:45

able to train lots of Atari games using

play41:48

several thousand bits of information so

play41:49

in all these cases you had human and

play41:51

human annotators or human judges just

play41:54

like in the previous slide looking at

play41:57

the pairs of trajectories and clicking

play42:00

on the one that they thought was better

play42:02

and here's an example of an unusual goal

play42:07

where this is a car racing game but the

play42:10

goal was to ask the the agent to train

play42:14

the white car drive right behind the

play42:17

orange car so it's a different goal and

play42:19

it was very straightforward to

play42:21

communicate this goal using this

play42:23

approach so then to finish off alignment

play42:30

is a technical problem it has to be

play42:31

solved but of course the determination

play42:35

of the correct goals we want array

play42:36

assistance the systems to have will be a

play42:38

very challenging political problem and

play42:41

on this note I want to thank you so much

play42:44

for your attention and I just want to

play42:47

say that will be a happy hour at

play42:48

Cambridge Brewing Company at 8:45 if you

play42:50

want to chat more about AI and other

play42:52

topics please come by I think that

play42:55

deserves an applause

play43:03

so back propagation is a or neural

play43:07

networks of bio-inspired but back

play43:09

propagation doesn't look as though it's

play43:10

what's going on in the brain because

play43:12

signals in the brain go one direction

play43:14

down the axons whereas back propagation

play43:16

requires the errors to be propagated

play43:18

back up the the wires so can you just

play43:23

talk a little bit about that whole

play43:24

situation where it looks as the brain is

play43:27

doing something a bit different than our

play43:28

highly successful algorithms our

play43:31

algorithm is going to be improved once

play43:33

we figure out what the brain is doing or

play43:35

is the brain really sending signals back

play43:37

even though it's got no obvious way of

play43:39

doing that what's what's happening in

play43:41

that area so that's a great question so

play43:44

first of all I'll say that the true

play43:45

answer is that the honest answer is that

play43:48

I don't know but I have opinions and so

play43:52

I'll say two things

play43:54

but first of all given that look if you

play43:59

agree if we agree like so rather it is a

play44:01

true fact the back propagation solves

play44:04

the problem of circuit search this

play44:08

problem feels like an extremely

play44:09

fundamental problem and for this reason

play44:12

I think that it's unlikely to go away

play44:14

now you also write that the brain

play44:16

doesn't obviously do back propagation

play44:19

although they've been multiple proposals

play44:20

of how it could be how it could be doing

play44:22

them for example there's been a work by

play44:27

Tim little crap and others where they've

play44:29

shown that if you use that it's possible

play44:31

to learn a different set of connections

play44:33

but can be used for the backward pass

play44:36

and that can result in successful

play44:37

learning now the reason this hasn't been

play44:40

like really pushed to the limit by

play44:42

practitioners is because they say well I

play44:44

got TF to the gradients I'm just not

play44:46

going to worry about it but you are

play44:48

right this is an important issue and you

play44:51

know one of two things is going to

play44:52

happen so my personal opinion is that

play44:55

back propagation is just going to stay

play44:56

with us till the very end and will

play44:58

actually build fully human level and

play45:00

beyond systems before we understand how

play45:02

the brain does what it does so that's

play45:07

what I believe but of course it is a

play45:09

difference that has to be acknowledged

play45:12

okay thank you do you think it was a

play45:14

fair matchup for the dota bot and that

play45:18

person given the constraints of the

play45:21

system so I'd say that like the biggest

play45:23

advantage computers have in games like

play45:26

this like one of the big advantages is

play45:28

that they obviously have a better

play45:30

reaction time although in DotA in

play45:33

particular the number of clicks per

play45:35

second over the top players is fairly

play45:38

small which is different from Starcraft

play45:40

so in Starcraft stuff up is a very

play45:42

compact mechanically heavy game because

play45:45

of a large number of units and so the

play45:47

top players that is click all the time

play45:49

in DotA every player controls just one

play45:52

hero and so that greatly reduces the

play45:55

total number of actions they need to

play45:56

make now still precision matters I think

play45:58

that will discover that but what I think

play46:02

it'll really happen is if you'll

play46:03

discover that computers have the

play46:05

advantage in any domain or rather every

play46:11

domain not yet so do you think that the

play46:16

emergent behaviors from the agent were

play46:18

actually kind of directed because the

play46:21

constraints already kinda in place like

play46:22

so it was kind of forced discover those

play46:23

or do you think that like that was

play46:26

actually something quite novel that like

play46:28

wow it actually discovered these on its

play46:30

own like you didn't actually am biased

play46:32

towards constraining it so it's

play46:33

definitely discover new strategies and I

play46:35

can share an anecdote where our tester

play46:38

we have a probe which would test the

play46:40

bots and he played against for a long

play46:44

time and the bots would do all kinds of

play46:46

things against the player the human

play46:48

player which were effective then at some

play46:50

point that Pro decided to play against

play46:53

the better plot Pro and he decided to

play46:55

imitate one of the things that the bot

play46:56

was doing and this image but by

play47:00

imitating if he was able to defeat a

play47:02

better pro so I think I think the

play47:04

strategy discovers are real and so like

play47:06

it means that like this very real

play47:08

transformative Tran you know I would say

play47:11

I think what that means is that he

play47:15

because the strategies discovered by the

play47:17

bot of the humans it means that we like

play47:19

a fundamental game plays deeply related

play47:22

for a long time now I've heard that the

play47:25

objective of reinforcement learning is

play47:27

to determine a policy that chooses an

play47:30

action to maximize the expected reward

play47:33

which is what you said earlier would you

play47:36

ever want to look at the standard

play47:38

deviation of possible rewards does that

play47:41

even make sense yeah I mean I think for

play47:44

sure I think it's a really application

play47:45

dependent one of the reasons to maximize

play47:49

the expected reward it's because it's

play47:52

easier to design algorithms for it

play47:53

so you write down this equation the

play47:58

formula you do a little bit of

play47:59

derivation you get something which

play48:02

amounts to a nice-looking algorithm now

play48:03

I think there exist like really there

play48:08

exist applications where you'd never

play48:10

want to make mistakes and you want to

play48:11

work on the standard deviation as well

play48:13

but in practice it seems that the just

play48:16

looking at the expected reward covers a

play48:18

large fraction of the B the situation as

play48:23

you'd like to apply this door Thanks

play48:28

we talked last week about motivations

play48:32

and that has a lot to do with the

play48:35

reinforcement and some of the ideas is

play48:39

that the our motivations are actually

play48:42

connection with others and cooperation

play48:45

and I'm wondering if they're thrown off

play48:48

and I understand it's very popular to

play48:50

have the computers play these

play48:52

competitive games but is there any use

play48:56

in like having an agent self play

play49:00

collaboratively collaborative games Yeah

play49:03

right that's an extremely good question

play49:05

I don't think one place from which we

play49:09

can get some inspiration is from the

play49:11

evolution of cooperation

play49:12

like I think cooperation like we

play49:17

cooperate ultimately because it's much

play49:21

better for you the person to be

play49:22

cooperative than not and so I think what

play49:27

should happen

play49:28

if you have a sufficiently open-ended

play49:31

game then cooperation will be the

play49:34

winning strategy and so I think we will

play49:37

get cooperation whether we like it or

play49:39

not Hey

play49:44

you mentioned the complexity of this

play49:47

simulation of friction I was wondering

play49:50

if you feel that there exists open

play49:52

complexity theoretic problems relevant

play49:55

to relevant to AI or whether it's just a

play49:57

matter of finding good approximations

play49:59

that humans of the types of problems

play50:02

that humans tend to solve yeah so

play50:04

complexity theory well like at a very

play50:09

basic level we know that whatever

play50:11

algorithm we gonna run is going to run

play50:14

fairly efficiently on some hardware so

play50:17

that puts a pretty strict upper bound

play50:20

and the true complexity of the problems

play50:22

we're solving but by definition we are

play50:25

solving problems which aren't too hard

play50:26

in a complexity theoretic sense now it

play50:30

is also the case that many of the

play50:32

problems so while the overall thing that

play50:35

we do is not hard from a complexity

play50:37

theory makes sense and indeed humans

play50:39

cannot solve np-complete problems in

play50:41

general it is true that many of the like

play50:45

optimization problems that we pose to

play50:47

our algorithms are intractable in the

play50:49

general case starting from a neural net

play50:51

optimization itself it is easy to create

play50:54

a family of data sets for a neural

play50:56

network with a very small number of

play50:57

neurons such that find a global optimum

play50:59

is np-complete and so how do we avoid it

play51:03

well we just try gradient descent anyway

play51:06

and somehow it works but without

play51:09

question like we cannot we do not solve

play51:14

problems which are truly intractable so

play51:17

I mean I hope this answer the question

play51:20

hello

play51:21

it seems like an important sub-problem

play51:24

on the path towards AGI will be

play51:26

understanding language and the state of

play51:29

generative language modeling right now

play51:31

is pretty abysmal what do you think are

play51:33

the most productive research

play51:35

trajectories towards generative language

play51:37

models so

play51:39

I'll first say that you are completely

play51:41

correct that the situation with language

play51:42

is still far from great although

play51:44

progress has been made even without any

play51:47

particular innovations beyond models

play51:51

that exist today simply scaling up

play51:53

models that exist today on larger

play51:55

datasets is going to go surprisingly far

play51:58

not even large datasets but larger and

play52:00

deeper models for example if you trained

play52:02

a language model be the thousand layers

play52:04

and it's the same layer I think it's

play52:07

gonna be a pretty amazing language model

play52:10

like we don't have the cycles for it yet

play52:12

but to think it will change very soon

play52:14

now I also agree with you that there are

play52:17

some fundamental things missing in a

play52:21

current understanding of deep learning

play52:23

which prevent us from really solving the

play52:27

problem that we want so I think one of

play52:28

these problems one of the things that's

play52:30

missing is that or that seems like

play52:31

patently wrong is the fact that we train

play52:37

a model then you stop training the model

play52:40

and you freeze it even though it's the

play52:43

training process where the magic really

play52:45

happens but the magic is that if you

play52:48

think about it like the training process

play52:51

is the true general part of the whole of

play52:54

the whole of the whole story because you

play52:56

tends to flow code doesn't care which

play52:57

data set to optimize it just says

play52:59

whatever just give me the data set I

play53:00

don't care which one solve I'll sew them

play53:02

all

play53:02

so like the ability to do that feels

play53:06

really special and I think we are not

play53:08

using it at test time like it's hard to

play53:12

speculate about like things which you

play53:14

don't know the answer but all I'll say

play53:15

is that simply train bigger deeper

play53:18

language models you'll go surprisingly

play53:20

far scaling up but also doing things

play53:23

like training a test them and inference

play53:24

the test time I think would be another

play53:26

important boosts the performance hi

play53:30

thank you for the talk so it seems like

play53:33

right now another interesting approach

play53:34

to solving reinforcement learning

play53:36

problems could be to go for the

play53:37

evolutionary roots using evolutionary

play53:40

strategies and although they have they

play53:42

their cave Hut's I wanted to know if I'd

play53:44

open a I particularly you're working on

play53:46

something related and what are what is

play53:48

your general opinion on them

play53:50

so like at present I believe that

play53:54

something evolutionary strategies is not

play53:55

great for reinforcement learning I think

play53:58

that normal reinforcement learning

play53:59

algorithms especially with big policies

play54:02

are better but I think if you want to

play54:04

evolve a small compact object like like

play54:08

a piece of code for example I think that

play54:10

would be a place where this would be

play54:12

seriously was considering but this all

play54:15

you know evolving a beautiful piece of

play54:19

code is a cool idea hasn't been done yet

play54:21

so still a lot of work to be done before

play54:23

we get there hi thank you so much for

play54:26

coming my question is you mentioned what

play54:29

is the right go is a political problem

play54:31

so I'm wondering if you can elaborate a

play54:33

bit on that and also what do you think

play54:35

would be their approach for us to maybe

play54:37

get there well I can't I can't really

play54:41

comment too much because all the

play54:42

thoughts that you know we have we now

play54:46

have a few people who are thinking about

play54:48

this full-time at opening I I don't have

play54:52

enough of a super strong opinion to say

play54:55

anything too definitive all I can say at

play54:58

the very high level is given the size

play55:00

like if you go into the future whenever

play55:02

soon or late you know whenever it's

play55:05

going to happen when you build a

play55:06

computer which can do anything better

play55:09

than a human it will happen because the

play55:11

brain is physical the impact on society

play55:14

is going to be completely massive and

play55:16

overwhelming it's it's very difficult to

play55:20

imagine even if you try really hard and

play55:23

I think what it means is that people who

play55:25

care a lot and that's what I was

play55:28

alluding to the fact that this will be

play55:30

something that many people who care

play55:32

about strongly and like as the impact

play55:36

increases gradually with self-driving

play55:38

cars more automation I think we will see

play55:40

a lot more people care do we need to

play55:43

have a very accurate model of the

play55:45

physical world and then simulate that in

play55:48

order to have these agents that can

play55:51

eventually come out into the real world

play55:53

and do something approaching you know

play55:56

human level intelligence tasks that's a

play55:59

very good question so I think if that

play56:02

were the case

play56:03

be in trouble and I am very certain that

play56:09

it could be avoided so specifically the

play56:11

real answer has to be that look you

play56:14

learn the problem so we learn to

play56:17

negotiate you learn to persist you not a

play56:18

lots of different useful life lessons in

play56:20

the simulation and yes you learn some

play56:22

physics too but then you go outside to

play56:24

the real world

play56:24

and you have to start over to some

play56:26

extent because many of you are deeply

play56:28

held assumptions will be false in one of

play56:31

the goals so what was that's one reasons

play56:33

I care so much about never stopping

play56:36

training you've accumulated your

play56:38

knowledge now we go into an environment

play56:40

for some of your assumptions of valid

play56:41

you continue training you try to connect

play56:43

the new data to your old data and this

play56:45

is an important requirement from our

play56:46

algorithms which is already met to some

play56:48

extent but it will have to be met a lot

play56:49

more so that you can take the partial

play56:52

knowledge if you've acquired then go in

play56:55

a new situation learn some more

play56:57

literally the example of you go to

play56:59

school ballon useful things then you go

play57:01

to work it's not a perfect it's not you

play57:04

know you pour your four years of CS and

play57:06

undergrad is not going to fully prepare

play57:08

you for whatever it is you need to know

play57:09

it work

play57:10

it will help somewhat you'll be able to

play57:12

get off the ground but it will be lots

play57:13

of new things you need to learn so

play57:14

that's that's the spirit of it I think

play57:16

of a toes of the school one of the

play57:19

things you mentioned pretty early on in

play57:20

your talk is that one of the limitations

play57:22

of this sort of style of reinforcement

play57:24

learning is there's no self-organization

play57:26

so you have to tell it when it did a

play57:28

good thing or did a bad thing and that's

play57:30

actually a problem in neuroscience is

play57:31

when you're trying to teach a rat to you

play57:32

know navigate maze you have to

play57:34

artificially tell it what to do so where

play57:36

do you see moving forward when we

play57:38

already have this problem with teaching

play57:39

you know not necessarily learning but

play57:41

also teaching so where do you see the

play57:42

research moving forward in that respect

play57:44

how do you sort of introduce this notion

play57:46

of self-organization so I think without

play57:48

question one really important thing you

play57:50

need to do is to be able to infer the

play57:54

goals and strategies of other agents by

play57:56

observing them that's a fundamental

play57:59

skill we need to be able to learn to to

play58:01

embed into the agent so if for example

play58:03

you have two agents one of them is doing

play58:05

something and the other agent says well

play58:07

that's really cool I want to be able to

play58:08

do that too and

play58:09

you go and do that and so I'd say that

play58:11

this is a very important component in

play58:12

terms of second every word oh you see

play58:16

what they do you infer the reward and

play58:19

now we have a knob which says you see

play58:21

what they're doing now go and try to do

play58:22

the same thing

play58:23

let's say this this is as far as I know

play58:26

as far as I know this is was one of the

play58:27

important ways in which humans are quite

play58:31

different from other animals in way

play58:34

which in the like scale and scope in

play58:40

which we copy the behavior of other

play58:41

humans might ask a quick follow-up work

play58:45

go for it so that's kind of obvious how

play58:47

that works in the scope of competition

play58:48

but what about just sort of arbitrary

play58:50

tasks like I'm in a math class for

play58:52

someone and I see someone doing a

play58:53

problem a particular way and I go that's

play58:55

a good strategy maybe I should try that

play58:56

out how does that work in a sort of non

play58:58

competitive environment so I think that

play59:00

this will be I think that's going to be

play59:03

a little bit separate from the

play59:05

competitive environment but it will have

play59:07

to be somehow either way you know

play59:11

probably baked in maybe volved into the

play59:14

system where like if you have other

play59:17

agents doing things they're generating

play59:20

data which you observe and the only way

play59:22

to truly make sense of the data that you

play59:24

see is to infer the goal of the agent

play59:26

the strategy their belief state that's

play59:29

important also for communicating them if

play59:32

you want to successfully communicate

play59:33

with someone you have to keep track both

play59:35

of their goal and of their belief state

play59:36

instead of knowledge so I think you will

play59:38

find that there are many I guess

play59:40

connections between understanding what

play59:44

other agents are doing inferring their

play59:45

goals imitating them and community

play59:47

successfully communicating them all

play59:49

right let's give in the happy hour a big

play59:51

hand

play59:52

[Applause]

play59:55

you

play59:55

[Applause]

play60:05

you

Rate This

5.0 / 5 (0 votes)

Related Tags
深度学习人工智能OpenAIIlya Sutskever强化学习元学习神经网络自我对弈合作策略目标推断
Do you need a summary in English?