Ilya Sutskever | OPEN AI has already achieved AGI through large model training

Me&ChatGPT
9 Aug 202457:48

Summary

TLDR本次演讲深入探讨了深度学习和强化学习领域的最新进展。演讲者首先解释了深度学习之所以有效的原因,并强调了寻找最佳短程序以解释数据的重要性。接着,讨论了强化学习算法,特别是策略梯度和Q学习,并指出了它们在处理探索和奖励稀疏问题上的挑战。此外,还介绍了元学习和自我对弈的概念,展示了它们在提高学习效率和解决复杂任务中的潜力。演讲者通过多个项目案例,如DOTA 2和机器人学习,展示了这些算法的实际应用,并对未来的研究方向提出了展望。

Takeaways

  • 🤖 自动化和人工智能的进步部分归功于深度学习的成功,尽管深度学习为何有效并非显而易见。
  • 🧠 深度学习的核心在于寻找能够最佳解释数据的最短程序或最小电路,这与机器学习的概念类有关。
  • 🔍 反向传播算法是深度学习中的关键,尽管其成功的原因仍然是一个谜,但它推动了人工智能领域的许多进展。
  • 🎯 强化学习是描述智能体行为的框架,通过与环境互动并接收奖励来学习,尽管算法还有改进空间,但已经能够成功完成许多任务。
  • 🔄 元学习(Meta learning)的目标是让机器学会如何学习,通过在多个任务上训练系统,使其能够快速学习新任务。
  • 🔧 通过将复杂问题转化为多个问题的上下文,可以更容易地解决,例如通过“事后经验回放”(Hindsight Experience Replay)算法。
  • 🤹‍♂️ 自我博弈(Selfplay)是一种训练方法,通过智能体之间的相互竞争来学习策略,这种方法在围棋和DOTA等游戏中取得了显著成果。
  • 🌐 在自我博弈环境中,智能体可以创造出无限复杂的行为,这对于构建具有高级智能的代理可能是有益的。
  • 🧬 通过模拟和自我博弈的结合,可以在模拟环境中训练出能够在现实世界中表现良好的策略。
  • 🧠 神经网络架构的创新,如引入记忆结构,对于提高模型的泛化能力和学习能力至关重要。

Q & A

  • 深度学习为什么有效?

    -深度学习之所以有效,是因为它能够找到解释数据的最佳小型电路。理论上,最佳短程序是解释数据的最佳方式,但实际中寻找最佳短程序是不可行的。小型电路在某种程度上可以执行不明显的计算,通过反向传播算法可以找到最佳小型电路来解释数据,这是深度学习有效的关键。

  • 什么是元学习,它在人工智能中的重要性是什么?

    -元学习是一种学习如何学习的方法,它通过在多个任务上训练系统来实现。元学习的目标是训练出一个能够快速解决新任务的模型。这种方法的重要性在于,它能够提高学习算法的泛化能力,减少对特定任务设计的依赖,从而推动人工智能的发展。

  • 强化学习在人工智能中扮演什么角色?

    -强化学习提供了一个框架,用于描述智能体的行为,其中智能体通过与环境互动并根据收到的奖励来学习。它的重要性在于存在有用的算法,这些算法即使在还有很多改进空间的情况下,也能在许多非显而易见的任务中取得成功。

  • 什么是策略梯度和Q学习,它们在强化学习中的作用是什么?

    -策略梯度是一种强化学习算法,它通过在策略中引入随机性并根据结果的好坏来调整策略。Q学习是另一种算法,它通过估计给定状态和动作的未来价值来学习。这两种算法的作用是提高智能体在环境中的决策能力,使其能够更有效地学习以达到目标。

  • 什么是后见式经验回放(Hindsight Experience Replay),它如何帮助解决强化学习中的探索问题?

    -后见式经验回放是一种算法,它通过将失败的尝试转化为学习新目标的机会来解决探索问题。例如,如果智能体试图达到状态A但达到了状态B,算法会利用这一路径来学习如何达到状态B。这种方法使得智能体不会浪费任何经验,并且能够从每次尝试中学习。

  • 自我对弈(selfplay)在人工智能中有什么应用?

    -自我对弈是一种训练方法,通过智能体与自身的不同版本对弈来学习。这种方法在围棋、国际象棋和DOTA等游戏中取得了显著的成功。自我对弈的应用在于它能够生成复杂的策略和行为,同时提供了持续的挑战和学习动力。

  • 为什么说自我对弈能够产生无界复杂性?

    -自我对弈能够产生无界复杂性,因为它创造了一个环境,智能体可以在其中不断挑战和超越自己。随着智能体变得更强,它们产生的策略和行为也会变得更加复杂,从而推动智能体的认知能力不断增长。

  • 在自我对弈中,如何保证智能体总是有进步的动力?

    -在自我对弈中,智能体总是与和自己同等水平的对手对弈,这意味着它们总是面临挑战。即使智能体变得非常擅长,对手也会同样擅长,因此智能体总是有动力去改进和学习新策略。

  • 转移学习在人工智能中的重要性是什么?

    -转移学习允许智能体将在一个领域学到的技能和知识应用到另一个领域。这在人工智能中非常重要,因为它可以减少学习新任务所需的数据和时间,提高智能体的适应性和灵活性。

  • 在人工智能中,如何提高模型的泛化能力?

    -提高模型的泛化能力通常涉及到改进学习算法、使用更多样化的数据集进行训练、以及采用正则化技术等方法。此外,元学习、自我对弈和后见式经验回放等技术也被用来提高模型在新任务上的表现。

Outlines

00:00

🤖 深度学习与元学习

本段落介绍了OpenAI在过去一年中的工作,特别强调了元学习与自我对弈这两个主题。讨论了深度学习为何有效,以及如何通过寻找最佳短程序来实现数据的最佳解释和预测。提到了小电路是次优选择,因为它们能够执行非显而易见的计算,并且可以通过反向传播找到最佳小电路。此外,还提到了强化学习算法,特别是策略梯度和Q学习,以及它们在现实世界中的应用和挑战。

05:00

🧠 强化学习与算法

这段落深入探讨了强化学习,包括其算法和在构建智能代理中的应用。强化学习框架允许代理通过与环境互动并接收奖励来学习。讨论了如何通过神经网络表示策略,并通过改变参数来改善模型。同时,介绍了现代强化学习算法,包括策略梯度和Q学习,以及它们如何通过随机性和递归性来优化代理的行为。

10:02

🔄 元学习与快速学习

本段落讨论了元学习的概念,即通过在多个任务上训练系统来实现快速学习。介绍了元学习的两种主要方法:一种是通过训练大型神经网络来快速解决任务分布中的问题,另一种是通过学习架构或算法来实现更广泛的泛化。举例说明了元学习在字符识别和图像数据集中的应用,并强调了元学习在提高学习效率方面的潜力。

15:02

🎯 目标导向学习与探索

这段落介绍了一种名为“hindsight experience replay”的算法,它通过将一个难题转化为多个问题的框架来解决探索问题。算法通过在尝试达到一个目标但失败时,利用这一尝试来学习达到实际达到的目标。这种方法使得学习过程更加高效,因为它不会浪费任何经验。此外,还讨论了如何通过自我对弈来提高算法的泛化能力。

20:05

🤖 自我对弈与智能代理

本段落探讨了自我对弈在智能代理发展中的重要性。自我对弈允许在简单环境中创建具有潜在无限复杂性的代理,这对于构建具有高级社交技能的智能代理至关重要。举例说明了自我对弈在棋类游戏和电子竞技中的应用,并讨论了如何将自我对弈环境中训练出的代理应用于现实世界任务。

25:06

🧠 神经网络与自我对弈

这段落讨论了神经网络在自我对弈中的应用,以及如何通过自我对弈来提高代理的智能。强调了自我对弈环境中计算机作为数据来源的重要性,并提出了一个假设:如果自我对弈环境足够开放,代理的认知能力可能会迅速提高,甚至达到超人水平。

30:07

🔧 转移学习与概念提取

本段落探讨了转移学习在提高代理性能方面的作用,特别是在将一个领域的学习成果应用到另一个领域时。讨论了概念提取的可能性和挑战,以及当前在这一领域的研究进展。强调了在实现有效的转移学习方面仍需克服的困难。

35:09

📚 课程学习与自我对弈

这段落讨论了课程学习在自我对弈中的作用,以及如何通过自我对弈来构建内置的课程。强调了自我对弈中探索问题的简化,以及如何通过自我对弈来激励代理不断改进。同时,提出了关于自我对弈环境设置的关键问题,以及如何将自我对弈训练出的代理应用于有用任务。

40:10

🧐 新架构与学习算法

本段落探讨了新架构在神经网络中的作用,以及它们如何影响学习算法和模型的泛化能力。讨论了软注意力机制作为近年来架构创新的一个例子,并提出了通过改变学习算法和模型范式来实现更好泛化的可能途径。

Mindmap

Keywords

💡深度学习

深度学习是机器学习的一个分支,它基于人工神经网络的学习算法,模仿人脑机制来处理数据和创建模式。在视频中,深度学习的工作原理被探讨,强调了它在处理复杂数据和模式识别方面的能力,是现代人工智能发展的关键驱动力。

💡元学习

元学习,也称为学会学习,是指通过训练系统在多个任务上学习,从而使系统能够快速适应新任务的过程。视频中提到,元学习的目标是让机器不仅学会特定任务,而是学会如何学习,从而提高其泛化能力。

💡自博弈

自博弈是一种通过与自己或其他实例进行对抗来训练模型的方法,常见于强化学习领域。视频中讨论了自博弈在训练智能体方面的潜力,如AlphaGo和Dota 2的案例,展示了自博弈如何促进智能体策略和行为的复杂性。

💡强化学习

强化学习是一种学习范式,智能体通过与环境的交互来学习最佳行为策略,以最大化累积奖励。视频中提到了强化学习算法,如策略梯度和Q学习,以及它们在训练过程中如何通过探索和利用来优化策略。

💡后向传播

后向传播是一种用于训练人工神经网络的监督学习算法,通过计算损失函数关于网络参数的梯度来更新网络权重。视频中强调了后向传播在深度学习中的重要性,尤其是在找到最佳小电路以解释数据方面的作用。

💡小电路

在视频中,小电路被提及作为短程序之后的最佳选择,因为它们能够执行非显而易见的计算。小电路在深度学习中相当于简化版的程序,可以通过后向传播算法找到最佳解,是实现数据泛化的关键。

💡策略梯度

策略梯度是强化学习中的一种算法,通过直接对策略函数的参数进行梯度上升来优化策略。视频中提到了策略梯度算法,说明了它是如何通过增加那些导致更好结果的行动的概率来改进策略的。

💡Q学习

Q学习是强化学习中的一种价值迭代方法,用于学习一个策略,能够估计在给定状态下采取特定行动的期望效用。视频中解释了Q学习如何利用递归性质来估计Q函数,并展示了它如何作为离策略学习算法工作。

💡后向经验回放

后向经验回放是一种在强化学习中用于提高样本效率的技术,通过将失败的尝试重新标记为目标来学习。视频中讨论了这种方法如何使智能体从不成功的尝试中学习,从而提高学习效率和探索能力。

💡模拟到现实迁移

模拟到现实迁移是指在模拟环境中训练模型,然后将学到的知识应用到现实世界中。视频中提到了通过在多样化的模拟环境中训练来提高模型的泛化能力,使其能够更好地适应现实世界的变化。

Highlights

深度学习有效性的原因分析,提出了寻找最佳短程序来解释数据是最佳泛化方法的理论。

小电路是继短程序之后的最佳选择,因为它们可以执行非显而易见的计算。

反向传播算法的成功是一个幸运的谜团,它推动了过去六年人工智能领域的所有进步。

强化学习框架描述了智能体的行为,通过与环境互动并根据成功与否获得奖励。

现代强化学习算法通过在策略中加入随机性来探索更好的行为方式。

策略梯度和Q学习算法是两种主要的强化学习算法,它们在稳定性和样本效率上有所不同。

元学习的概念是训练系统在多个任务上学习,以便快速解决新任务。

通过元学习,可以在有限的数据样本上实现快速学习,这在字符识别任务中得到了验证。

元学习的一个挑战是如何设计算法以泛化到训练时未见过的分布之外的任务。

介绍了一种名为“hindsight experience replay”的算法,它通过将失败的尝试转化为学习新目标的机会来提高学习效率。

展示了hindsight experience replay算法在稀疏奖励环境下的有效性,以及在真实物理环境中的应用潜力。

讨论了如何通过元学习在模拟环境中训练策略,并使其能够泛化到真实世界。

介绍了一种层次化强化学习方法,通过学习低级动作来加速学习过程。

自我对弈是创建复杂智能体的一种方法,它允许在简单环境中产生无限复杂性。

自我对弈的一个关键优势是它提供了一个不断激励改进的环境,因为对手总是同样优秀。

提出了一个开放性问题:如果自我对弈环境足够开放,是否会实现智能体认知能力的极速提升。

强调了与伯克利大学的合作对于这些研究成果的重要性。

Transcripts

play00:01

about some of the work we've done at

play00:03

open AI over the past year and this is

play00:06

like a an arrow subset that Focus the

play00:09

talk will be a subset of that work

play00:12

focusing on meta learning and selfplay

play00:15

which are two topics I like very

play00:18

much but I've been told that this is a

play00:21

more slightly broader a little bit more

play00:23

of a general interest talk so I want to

play00:25

begin the presentation by talking a

play00:29

little bit about why deep learning

play00:31

actually

play00:34

works and I think it's not a

play00:37

self-evident

play00:38

question why deep learning works it's

play00:41

not self-evident that it should

play00:43

work and I want to

play00:46

give some a perspective which I think is

play00:49

not entirely obvious on

play00:51

that so one thing that you can actually

play00:54

prove

play00:55

mathematically that the best possible

play00:57

way of generalizing that's completely

play01:01

unimprovable is to find the best short

play01:04

program that explains your data and then

play01:06

use that to make

play01:08

predictions and you can prove that it's

play01:10

impossible to do better than that so if

play01:13

you think about machine learning you

play01:15

need to think about concept classes what

play01:16

are you looking for given the data and

play01:18

if you're looking for the best short

play01:19

program it's impossible to generalize

play01:21

better than that and it can be proved

play01:24

and the proof is not even that

play01:28

complicated and like the intuition of it

play01:31

basically is that any regular any

play01:33

regularity that can possibly exist is

play01:36

expressible as a short program if you

play01:38

have some piece of data which cannot be

play01:40

compressed VI the slightly shorter

play01:42

program then that piece of data is

play01:43

totally random so you can take my word

play01:47

on it that it therefore

play01:49

follows the short programs are the best

play01:52

possible way to generalize if only we

play01:54

could use them problem is it is

play01:57

impossible to find the best short

play01:59

programing descrip data at least given

play02:01

today's knowledge the computational

play02:04

problem of finding the best short

play02:06

program is intractable in practice

play02:08

undecidable in

play02:10

theory so no short programs for us but

play02:15

what about small circuits small circuits

play02:17

are the next best

play02:18

thing after short programs because a

play02:22

short a small circuit can also performs

play02:26

non obvious computation if you have a

play02:27

really deep really wide circuit maybe

play02:30

you know many many thousand layers and

play02:33

many millions of neurons wide you can

play02:34

run lots of different algorithms on the

play02:36

inside so it comes close it comes close

play02:39

to short

play02:40

programs and extremely fortunately the

play02:44

problem of

play02:45

finding the best small circuit given the

play02:48

data is solvable with

play02:51

backrop and

play02:53

so basically what it boils down to is

play02:56

that we can find the best small circuit

play02:59

that explains the data and small

play03:01

circuits are kind of like programs but

play03:02

not really they are a little bit worse

play03:04

it's like finding the

play03:06

best parallel program that trans for 100

play03:09

steps or less 50 steps that solves your

play03:12

problem and that's where the

play03:13

generalization comes

play03:16

from now we don't know why don't know

play03:20

exactly why back propagation is

play03:22

successful at finding the best short

play03:26

circuit given given your data it's a

play03:30

mystery and it's a very fortunate

play03:31

mystery it Powers all the progress that

play03:33

we've made in all the progress that's

play03:35

been made in uh artificial intelligence

play03:37

over the past six

play03:38

years so I think there is an element of

play03:42

luck here we are lucky that it

play03:44

works one thing which I one useful

play03:47

analogy that I like to make when

play03:50

thinking about

play03:51

generalization is that

play03:54

models learning models that in some ways

play03:57

have greater computational power

play03:59

generalize better so you could make this

play04:03

you could make the case that the deeper

play04:05

your neural network is the closer it

play04:07

comes to the all the ultimate best short

play04:10

programs and so the better will

play04:14

generalize so that's that tries to touch

play04:17

on the question of where does

play04:19

generalization come

play04:21

from I think the full answer is going to

play04:23

be unknown for quite some time because

play04:25

it also has to do with the specific data

play04:27

that we happen to want to solve

play04:30

it is very nice indeed that the problems

play04:33

we want to solve happen to be solvable

play04:34

with these classes of

play04:37

models one other statement I want to

play04:40

make is

play04:42

that I think that the back propagation

play04:45

algorithm is going to stay with us until

play04:46

the very end because the problem that it

play04:48

solves is so fundamental which is given

play04:51

data find the best small circuit that

play04:54

fits to

play04:55

it it seems unlikely that this problem

play04:58

that we will not want to Sol this

play05:00

problem in the

play05:01

future and so for this reason I feel

play05:03

like backprop is really

play05:08

important now I want to spend a little

play05:10

bit of time talking about reinforcement

play05:13

learning and

play05:15

so reinforcement learning is a framework

play05:18

for describing the behavior of Agents

play05:21

you've got an agent which takes actions

play05:23

interacts with an

play05:24

environment and receives rewards when it

play05:27

succeeds

play05:31

and it's pretty clear that it's a very

play05:32

general

play05:33

framework but the thing that makes

play05:35

reinforcement learning interesting is

play05:37

that there exist useful algorithms in

play05:39

reinforcement learning so in other words

play05:41

the

play05:42

algorithms of reinforcement learning

play05:44

make the framework interesting even

play05:46

though these algorithms have still a lot

play05:48

of room for improvement they can already

play05:50

succeed in lots of nonobvious

play05:53

tasks and so therefore it's worth

play05:55

pushing on these algorithms if you make

play05:57

really good reinforcement learning

play05:58

algorithms perhaps you'll build

play06:00

very clever agents and

play06:03

so the way the way the reinforcement

play06:05

learning problem is

play06:07

formulated

play06:09

is as follows you have

play06:13

some policy class where policy is just

play06:16

some function which takes inputs and

play06:18

produces actions and for any given

play06:21

policy you can run it and you can figure

play06:22

out its performance it's

play06:26

cost and your goal is just to find the

play06:29

best policy that minimizes cost

play06:31

maximizes reward rewards now one way in

play06:34

which this framework formulation is

play06:35

different from reality is that in

play06:38

reality the agents generate the rewards

play06:41

to

play06:42

themselves and the only true cost

play06:45

function that exists is

play06:52

survival

play06:54

so the if you want to build good

play06:57

reinforce um any reinforcement learning

play06:59

algorithm at all you need to represent

play07:00

the policy somehow so how you going to

play07:03

represent anything the answer is always

play07:05

using a neural network the neural

play07:08

network is going to take the actions and

play07:10

produce take the observations and

play07:11

produce actions and then for a given

play07:14

setting of The parameters you

play07:16

could figure out how you could calculate

play07:19

how good they are and then you could

play07:21

calc you could you could figure out how

play07:23

to compute the way to change these

play07:25

parameters to improve the model so if

play07:28

you make if you change the parameters of

play07:29

the model model many times and make many

play07:30

small

play07:31

improvements then you may make a big

play07:34

Improvement and very often in practice

play07:36

the Improvement ends up being big enough

play07:39

to solve the

play07:41

problem so I want to talk a little bit

play07:43

about how reinforcement learning

play07:44

algorithms

play07:46

work the modern ones the modal free ones

play07:49

the one that every the ones that

play07:50

everyone uses

play07:52

today and you take your policy and you

play07:56

add a little bit of Randomness to your

play07:57

actions somehow

play07:59

so you deviate from your usual

play08:03

behavior and then you simply check if

play08:06

the resulting cost was better than

play08:09

expected and if it

play08:11

is you make it more likely by the way I

play08:14

want for the for um I'm actually curious

play08:18

how many people are familiar with the

play08:20

basics please raise your hand okay so

play08:23

the audience here is informed so I can

play08:25

skip through the introductory Parts

play08:27

don't don't skip too much

play08:31

all

play08:33

right I'll skip only a little

play08:37

bit but the point is you do something

play08:39

randomly and you see if it's better than

play08:41

usual and if it is do more of that and

play08:44

do a lot of and repeat this many

play08:48

times so in reinforcement learning there

play08:51

are two classes of

play08:53

algorithms one of them is called policy

play08:56

gradients which is basically what I just

play08:58

described and there is a beautiful

play09:01

formula above which says that if you

play09:04

just take the derivative of your cost

play09:06

function and do a little bit of math you

play09:08

get something which is exactly as

play09:10

described where you just take a random

play09:12

take some random actions with a little

play09:14

bit of Randomness and if the result is

play09:16

better than expected then increase the

play09:18

probability of taking these actions in

play09:20

the future then there is also the q-

play09:22

learning algorithm which is a little bit

play09:24

less stable a little bit more sample

play09:26

efficient won't explain in too detail in

play09:29

too much detail how it how it works but

play09:31

it has the property that it is off

play09:34

policy which

play09:36

means that it can learn not just from

play09:39

its own actions and I want to explain

play09:41

what it means on policy means that you

play09:43

can only learn you can

play09:47

only learn at all if you the one who's

play09:51

taking the actions while off policy

play09:53

means that you can learn from anyone's

play09:55

other anyone's actions it doesn't just

play09:57

have to be your own so it's a bit more

play10:00

it seems like a more useful thing

play10:01

although it's interesting that the

play10:03

algorithm which is more

play10:05

stable the stable algorithms tend to be

play10:08

policy gradient based the on policy ones

play10:10

the ones that the Q Q learning which is

play10:13

of policy is also less stable at least

play10:15

as of today things change

play10:18

quickly now I'll spend a little bit of

play10:20

time illustrating how Q learning works

play10:24

even though I think it may be familiar

play10:26

uh this may be familiar to many to many

play10:27

people and basically have this Q

play10:29

function which tries to estimate for a

play10:31

given State and a given action how good

play10:34

or bad the future is going to

play10:36

be and you have this trajectory of

play10:40

States because your Pol your agent is

play10:41

taking many actions in the world it's

play10:44

relentlessly pursuing a

play10:46

goal well the Q function is this

play10:49

recursive Rec Rec recursive property

play10:52

where the Q function of sa is basically

play10:54

just the Q function of S Prime a prime

play10:58

plus the reward you got to so you got

play11:00

this recursivity and you can use this

play11:02

recursivity to estimate the Q function

play11:03

and that gives you the Q learning

play11:05

algorithm and I want explain why it's

play11:07

off policy all you need is to um for for

play11:11

the purposes of this presentation just

play11:13

take my word for

play11:15

it and now what's the potential here why

play11:18

is this exciting so yes the

play11:21

reinforcement learning algorithms that

play11:22

we have right now they are very sample

play11:23

and efficient they're really bad at

play11:25

exploration yet although progress is

play11:27

being made

play11:30

but you can kind of see that if you had

play11:32

a really great reinforcement learning

play11:35

algorithm that would be just really data

play11:38

efficient and explore really well and

play11:40

make really good use

play11:42

of lots of sources of information then

play11:46

we'd be in good shape in terms of the go

play11:48

in terms of building intelligent

play11:51

agents but we still have work to do

play11:54

there still all be data

play11:57

inefficient so now I want to talk a

play11:59

little bit about meta learning which

play12:00

will

play12:01

be um an important part of this

play12:04

talk and I want to explain what it is so

play12:08

so there is the abstract the the dream

play12:11

of metal learning the abstract idea that

play12:12

metal learning is the idea that you can

play12:14

learn to

play12:15

learn kind of in the same way in which

play12:18

biological evolution has learned the

play12:21

learning algorithm of the

play12:25

brain and spiritually the way you'd

play12:27

approach this problem is by training a

play12:31

system not on one task but on many

play12:34

tasks and if you do that then suddenly

play12:37

you've trained your system to solve new

play12:40

tasks really

play12:42

quickly so that would be a nice thing if

play12:44

you could do that be great if you could

play12:46

learn to learn we wouldn't need to

play12:47

design the algorithms

play12:49

ourselves use the learning algorithm

play12:51

that we have right now to do the the

play12:52

rest of the thinking for

play12:54

us you're not quite there yet but meta

play12:58

learning has had

play13:00

a fair bit of success and I just want to

play13:02

show um explain the dominant

play13:07

the most common way of doing meta

play13:09

learning the most common way of doing

play13:11

meta learning is the most attractive one

play13:13

where you basically say that you you

play13:15

want to reduce the problem of meta

play13:17

learning to traditional deep learning

play13:20

where you basically take

play13:23

your familiar supervised learning

play13:25

framework and you replace each data

play13:27

point with a

play13:29

task from your training set of

play13:31

tasks and so what you do is that all all

play13:34

these algorithms have the same kind of

play13:35

high level shape where you have a model

play13:38

which receives information about the

play13:41

task plus an task instance and it needs

play13:44

to make the

play13:46

prediction and it's pretty easy to see

play13:48

that if you do that then you will train

play13:49

a model which can receive a new

play13:51

description of a task and make good

play13:54

predictions

play13:56

there and there have been some very um

play13:59

some pretty successful comp compelling

play14:02

success stories and I'll mention some of

play14:04

them a lot of a lot of metal learning

play14:06

work was done in Berkeley as well but

play14:09

I'll mention some of the visual ones the

play14:11

early ones that I think are notable

play14:14

because you see this task right here

play14:16

it's I took this figure from a paper by

play14:19

Brandon Lake

play14:21

atal and this but I think the data set

play14:25

came earlier so this is the right

play14:27

citation but

play14:32

one of the criticisms of one of the ways

play14:35

in which neural Nets were criticized is

play14:40

that they can't learn quickly which is

play14:42

kind of

play14:44

true

play14:46

and a team in Josh ten bom's lab have

play14:49

developed

play14:50

this data set which has a very large

play14:54

number of different characters and a

play14:57

very small number of examples for each

play14:58

character

play14:59

specifically as a challenge for neural

play15:02

networks and it turns out that the

play15:04

simple metal learning approach where you

play15:06

just say that I want to train a neural

play15:08

network that can learn to recognize any

play15:10

character really quickly that approach

play15:12

Works super well and it's been able to

play15:14

get um superhuman performance and as far

play15:18

as I know the best performance is

play15:19

achieved by mishal and I believe it's

play15:22

the work done with

play15:24

Peter and they it's basically

play15:27

superum and it's just a neuronet

play15:29

So Meta learning sometimes work really

play15:33

well there is also a very different take

play15:36

on meta learning which is a lot more uh

play15:38

which is a lot closer to the approach of

play15:40

instead of learning the parameters of a

play15:42

big model let's learn something Compact

play15:44

and small like the architecture or even

play15:47

the algorithm which is what evolution

play15:49

did and here you just say why don't you

play15:52

search in architecture space and find

play15:54

the best architecture this is also a

play15:56

form of meta learning it also

play15:58

generalizes really well because this

play15:59

work if you do if you learn an

play16:02

architecture on a small image data set

play16:04

it will work really well on a large

play16:05

image data set as well and the reason it

play16:07

generalizes well is because the amount

play16:09

of information

play16:12

in an architecture is small and this is

play16:15

work from a Google by zop and

play16:19

Lee so metal learning

play16:21

works sometimes there are signs of Life

play16:24

The Promise is very strong it's just so

play16:26

compelling yeah I just just set

play16:28

everything right and then your existing

play16:30

learning algorithm you learn the

play16:32

learning algorithm of the future that

play16:34

would be

play16:35

nice so now I want to dive

play16:41

into a detailed description of

play16:45

uh one algorithm that we've done it's

play16:48

called hindsight experience replay and

play16:51

it's been a large collaboration with

play16:52

many people driven primarily by

play16:55

andrial and this is not exactly metal

play16:58

learning this is almost metal

play17:00

learning

play17:04

and basically what happened there is

play17:14

that the way to think about what this

play17:16

algorithm does is that you try to solve

play17:19

a a hard

play17:22

Problem by making it harder and as a

play17:25

result it becomes

play17:27

easier and so you frame one problem into

play17:30

the framework into the context of many

play17:32

problems you have very many problems

play17:35

that you're will to solve simultaneously

play17:37

and that makes it

play17:38

easy and the problem here is basically a

play17:40

combination of exploration where in

play17:44

reinforcement learning we need to take

play17:46

the right action if you don't take the

play17:48

right action you don't learn if you

play17:49

don't get rewards how can you improve

play17:52

all your effort that doesn't lead to

play17:54

reward will be wasted would be nice if

play17:56

you didn't have that

play17:59

and

play17:59

so if our rewards are

play18:02

sparse and if we try to achieve our goal

play18:05

and to

play18:05

fail the model doesn't learn so how do

play18:07

we fix that so it's a really simple idea

play18:10

it's super intuitive you basically say

play18:12

you have the starting

play18:14

point you try to reach the state a but

play18:17

you reach the state b

play18:20

instead

play18:24

and so what can we learn something from

play18:27

this well we have a data we have a

play18:30

trajectory of how to reach the state B

play18:32

so maybe we can use this flawed attempt

play18:35

at reaching a as an opportunity to learn

play18:37

the state

play18:39

B and so this is

play18:42

very correct

play18:46

directionally means that you don't waste

play18:48

experience and but you need enough

play18:50

policy algorithm in order to learn it

play18:52

and that's why I've emphasized the of

play18:53

policy stuff earlier because your policy

play18:56

tries to reach a but you going to use

play18:59

this data to teach a different policy

play19:00

which which ISP so you have this big

play19:04

parameterized

play19:08

function and you just simply tell it

play19:12

which state you reach it's it's super

play19:14

it's super super straightforward and

play19:16

it's intuitive and it works really well

play19:20

too hindsight experience replace I'm

play19:23

going to show you the video it's

play19:26

a it's pretty cool

play19:31

and so in this case the reward is very

play19:33

sparse and

play19:34

binary

play19:37

and so so I I should just say because

play19:40

the reward is sparse in binary this

play19:42

makes it very hard for traditional

play19:44

reinforcement learning

play19:45

algorithms because you never get to see

play19:47

the reward if you were to shape your

play19:50

reward perhaps you could solve these

play19:52

problems a little bit better although we

play19:53

still found it um you know when the

play19:55

people that that were working on this

play19:57

have tried it they still found it

play20:00

difficult but this algorithm just works

play20:04

on these cool tasks and just the videos

play20:06

look

play20:08

cool so let's keep

play20:11

watching you get these very nice

play20:13

confident looking movements from The

play20:16

hset Experience replay

play20:18

algorithm and it just makes sense like

play20:21

anytime something happens we want to

play20:22

learn from it and so we want this to be

play20:24

the basis of all future algorithms

play20:33

now again this is in the uh absolutely

play20:35

sparse binary reward setting which means

play20:38

that the standard reinforcement learning

play20:40

algorithms are very

play20:41

disadvantaged but even if you try to

play20:43

shape a reward what's one thing that you

play20:45

discover is that shaping rewards is

play20:47

sometimes easy but sometimes quite

play20:50

challenging and here is the same thing

play20:52

working

play20:53

on uh real physical blocks

play21:00

okay so this is

play21:03

the this this basically sums sums up the

play21:06

hindsight experience replay results can

play21:08

you tell us what acronym is represented

play21:10

by HR hindsight experience

play21:13

replay and like what you can see is like

play21:17

if you want to one of the limitations of

play21:19

all these results is that

play21:21

they the state is very low

play21:25

dimensional and if you have a general

play21:27

environment which is very high

play21:28

dimensional inputs and very long

play21:30

histories you got a question of how do

play21:32

you represent your goals and so what it

play21:35

means is that representation learning is

play21:36

going to be very

play21:38

important and unsupervised learning is

play21:42

probably doesn't work yet but I think

play21:44

it's pretty

play21:45

close and we should keep thinking about

play21:48

how to really fuse unsupervised learning

play21:50

with reinforcement learning I think this

play21:52

is a fruitful area for the

play21:55

future now I want to talk about a

play21:57

different project on using

play21:59

on on doing transfer from SIM to real

play22:02

with meta learning and this work is by

play22:06

pangal and multiple people who did this

play22:08

work are from Berkeley unfortunately I

play22:10

don't have the full list here

play22:15

so it would be nice if you could train

play22:18

our robots in simulation and then deploy

play22:20

them on physical robots simulation is

play22:23

easy to work

play22:26

with but it's also very clear that you

play22:28

can can't simulate most

play22:30

things so then can anything be done

play22:34

here and I just want to explain one very

play22:37

simple idea of how you could do that and

play22:40

answer is basically you train a policy

play22:45

that doesn't just solve the task in one

play22:47

simulated setting but it solves the task

play22:50

in a family of simulated settings so

play22:52

what does it mean you say okay I'm going

play22:54

to randomize the friction coefficient

play22:56

and gravity and pretty much anything you

play22:59

can think of the length of your robotic

play23:02

claims and their masses and the

play23:03

frictions and

play23:06

sizes and your policy isn't told what

play23:09

you've done you just need to figure it

play23:10

it needs to figure it out by interacting

play23:12

with the

play23:15

environment well if you do that then

play23:17

you'll develop a robust policy that's

play23:19

pretty good at figuring out what's going

play23:20

on at least in the

play23:23

simulations and if this is done

play23:26

then the resulting system will be much

play23:29

more

play23:32

likely to generalize its Knowledge from

play23:36

the simulation to the real world and

play23:38

this is an instance of meta learning

play23:39

because in effect you're learning a

play23:42

policy which is very

play23:45

quick at identifying the precise physics

play23:49

you using so I would say this is a

play23:50

little bit I mean calling it metal

play23:51

learning is a bit of a stretch it's more

play23:52

of a kind of a robust adaptive Dynamic

play23:57

thing but but it also has a metal fi to

play24:01

it I want to show this video of the

play24:06

Baseline so this is what happens when

play24:08

you

play24:13

don't this is what happens when you

play24:15

don't uh do this uh robustification of

play24:17

the policy so you try to get the hcky Pu

play24:20

into the

play24:22

red a DOT and it just fails really

play24:27

dramatically and and

play24:32

um doesn't look very good and if you add

play24:35

these robus defications then the result

play24:38

is a lot

play24:39

better than it's like you know even when

play24:41

it pushes it around and it overshoots

play24:43

it's just no

play24:54

problem so it looks pretty good

play24:59

so I think this toy example illustrates

play25:03

that the approach of training a policy

play25:05

in

play25:06

simulation and then making sure that the

play25:09

policy doesn't solve just one instance

play25:10

of the simulation but many different

play25:12

instances of it and figures out which

play25:15

one it

play25:21

is

play25:24

then it could succeed to generalizing to

play25:26

the real to the real physical robot

play25:29

so that's encouraging now I want to talk

play25:32

about another project by France

play25:38

all

play25:40

and it's about doing hierarchical

play25:43

reinforcement

play25:44

learning so hierarchical reinforcement

play25:46

learning is one of those ideas that

play25:47

would be

play25:49

nice if we

play25:52

could get it to work because one of the

play25:55

problems with reinforce with

play25:56

reinforcement learning as it's Curr ly

play25:58

done today is that you have very long

play26:03

Horizons which you have trouble dealing

play26:05

with and you have trouble dealing with

play26:06

that exploration is not very directed so

play26:10

it's not as fast as you would like and

play26:13

the credit assignment is challenging as

play26:18

well and so we can do a very simple

play26:21

metal learning approach

play26:24

where you basically say that you want to

play26:27

learn lowlevel actions which make

play26:31

learning fast so you have a distribution

play26:33

over

play26:34

tasks

play26:39

and you have a distribution of a tasks

play26:43

and you want to find a set of low-level

play26:47

policies such

play26:49

that if you use them inside the

play26:51

reinforcement learning algorithm you

play26:53

learn as quickly as

play26:55

possible and so if you do that

play26:59

you can learn pretty sensible Locomotion

play27:01

strategies that go in a persistent

play27:13

Direction and

play27:17

so here it is we got three policies the

play27:20

high level and the the system has been

play27:23

learned to find to find the policies

play27:26

that will solve problems like this and

play27:28

there is a specific distribution over

play27:30

this kind of problem that solves it as

play27:31

quickly as

play27:32

possible so that's pretty

play27:37

nice now one thing I want to mention

play27:40

here

play27:42

is the one important limitation of high

play27:45

capacity metal learning so there are two

play27:48

kinds of there are two ways to do metal

play27:50

learning one is

play27:51

by learning a big neural network that

play27:55

can quickly solve problems in a

play27:57

distribution of

play27:59

tasks

play28:01

and the other one is by learning an

play28:04

architecture or an algorithm so you

play28:06

learn a small object so if you learn an

play28:09

architecture if you learn an algorithm

play28:11

in a metal learning setting it will

play28:13

likely generalize to many other tasks

play28:16

but this is not the case or at least it

play28:18

is much less the case for high-capacity

play28:20

meta learning where if you just want to

play28:22

for example train a very large recurrent

play28:24

neural network

play28:28

you want to learn a very large recurrent

play28:30

neural network that

play28:33

solves many tasks it will be very

play28:37

committed to the distribution of task

play28:40

that you've train it on and if you give

play28:42

it a task that's meaningfully outside of

play28:44

the distribution it will not succeed so

play28:47

as a kind of a

play28:50

slightly the kind of example I have in

play28:52

mind is well let's say you take your

play28:53

system and you train it to do math you

play28:55

know a little bit of math and teach a

play28:56

little bit of programming and you teach

play28:58

it how to read could it do

play29:00

chemistry well not according to this

play29:02

Paradigm at least not obviously because

play29:05

it really needs to have the task to come

play29:07

from the same distribution in the

play29:09

training and in and in test

play29:12

time so I think for this to work we will

play29:14

need to improve our the generalization

play29:16

of our algorithms

play29:18

further and now I want to finish by

play29:20

talking about selfplay

play29:25

s self play is a really cool topic

play29:28

it's been around for a long

play29:31

time

play29:35

and I think it's really interesting and

play29:38

intriguing and

play29:40

mysterious and I want to start by

play29:43

talking about the uh

play29:46

very earliest work on selfplay that I

play29:49

know of and that's DD gam it was done

play29:53

back in

play29:54

1992 it was by taro

play29:58

single author

play29:59

work and in this work they've used Q

play30:03

learning with selfplay to train a neural

play30:07

network that beats the world champion in

play30:09

backgamon so I think this may sound

play30:11

familiar in 2017 and 2018 but that's in

play30:14

1992 that's back when your CPUs were of

play30:17

like I don't know 33 MHz or something

play30:20

and if you look at this plot you see it

play30:22

shows the performance as a function of

play30:24

time with different numbers of hidden

play30:26

neurons you see okay you have 10 hidden

play30:28

units versus that's the red that's the

play30:30

red red curve and 20 hidden units is the

play30:33

green curve all the way to the purple

play30:36

curve and yeah it's basically nothing

play30:38

changed in 25 years just the number of

play30:40

zeros and the number of hidden

play30:44

units and in fact they've even

play30:46

discovered unconventional strategies

play30:49

that surprised experts in

play30:51

back so that's just

play30:54

amazing that it's that this work was

play30:56

done so long ago and it had so it was

play30:58

looking forward into the future so much

play31:01

and this approach basically remained

play31:03

dormant people were trying out a little

play31:05

bit but it really was revived by the

play31:07

Atara results of deep

play31:11

mind

play31:17

and you know we've also had very

play31:19

compelling self-play results in Alpha go

play31:21

zero where they could train a very

play31:23

strong go player from no knowledge at

play31:26

all to beating all humans

play31:28

same is true about our Dota 2 results it

play31:32

again started from zero and just did

play31:34

lots and lots of self play and I want to

play31:37

talk a little bit about why I think self

play31:39

play is really

play31:40

exciting because you get things like

play31:43

this like you

play31:46

can self-play makes it possible to

play31:50

create very simple

play31:51

environments that support potentially

play31:54

unbounded

play31:56

complexity Unbound

play31:58

Ed sophistication in your agents

play32:03

unbounded scheming in social

play32:06

skills

play32:08

and it seems relevant towards building

play32:12

for building intelligent agents and

play32:14

there is work on artificial life by uh

play32:18

Carl Sims from 94 and you can see that

play32:22

already there it looks very very

play32:23

familiar you see these little evolved

play32:25

creatures whose morphologies are evolved

play32:27

as well and here they are competing for

play32:30

the possession of a little green

play32:33

Cube and again this was done in 1994 on

play32:36

Tiny

play32:37

computers and just like many and just

play32:40

like other uh promising ideas that we

play32:43

may that we are familiar with didn't

play32:45

have enough computer to really push them

play32:48

forward but I think that this is the

play32:51

kind of thing that we could get with

play32:53

large scale selfplay and I want to show

play32:56

some work that we've done just trying to

play32:58

revive this concept a little bit and I'm

play33:00

going to show this video this was work

play33:01

by B salad Al was a productive summer

play33:05

internship there is a bit of music here

play33:07

let me turn it

play33:09

off actually maybe I can keep it

play33:15

on no I can't I

play33:19

can't but the point is what's the point

play33:22

you got the super simple environment

play33:24

which in this case is just the summer

play33:26

ring

play33:28

and you just tell the agents you get a

play33:31

plus one when the other agents get gets

play33:33

outside the

play33:34

ring and the reason I find is so well I

play33:38

personally like it because these things

play33:39

look alive like they have this breaths

play33:42

of complicated

play33:44

behaviors that they learn just in order

play33:47

to stay in the

play33:50

game and so you can kind of see that if

play33:52

you let your imagination run

play33:54

wild then yeah so this this selfplay is

play33:58

not

play34:00

symmetric and also the human these

play34:02

humanoids are a bit unnatural because

play34:04

they they they don't feel pain and they

play34:08

don't get tired and they don't have you

play34:10

know a whole lot of energy

play34:18

constraints oh it blocked it that was

play34:21

good so that's pretty good too so here

play34:24

the goal you can guess what the goal is

play34:28

that that was that was a nice

play34:37

Dodge and now this so this is example so

play34:40

one of the things that would be

play34:42

nice is that if you could take these

play34:45

selfplay environments train our agents

play34:47

to do some kind

play34:49

of tasks from the selfplay and then take

play34:53

the agent outside and get it to do

play34:55

something useful for us I think if that

play34:57

possible that would be amazing and here

play34:59

there is like a tiniest the tiniest of

play35:01

tests where we take the sumo wrestling

play35:03

agent and we just apply we put it we put

play35:06

it isolated and alone inside the ring it

play35:08

doesn't have a friend and we just apply

play35:11

big forces on it and see if it can

play35:12

balance itself and of course it can

play35:14

balance itself because it's been

play35:17

trained because it's been trained

play35:19

against an opponent that tried to push

play35:21

it so it's really good at resisting

play35:23

force in

play35:24

general and so kind of the the mental IM

play35:27

here is that imagine you take a ninja

play35:30

and then you ask it to to learn to

play35:32

become a chef because the ninja is

play35:34

already so dexterous it should have a

play35:36

really fairly easy time to be a very

play35:38

good good cook that's the kind of high

play35:41

level idea here it hasn't happened yet

play35:43

but one thing I'd like to ask yeah and

play35:46

so and so I think one of the key

play35:48

questions in this line of work is how

play35:52

can you set

play35:54

up a type of self-play environment which

play35:58

once you

play35:59

succeed it can solve useful tasks for us

play36:02

which are different just from the

play36:03

environment itself and that's the big

play36:05

difference between games in games the

play36:07

goal is to actually win the environment

play36:09

but that's not what we want we want it

play36:10

to just be generally good at being

play36:13

clever and then Sol solve a problems you

play36:15

know do my homework type

play36:18

agent I want to um yeah I want to show

play36:23

one one slide which I think is

play36:25

interesting so one of the one of the

play36:28

reasons like if you like I would like to

play36:30

ask you to let your imaginations run

play36:32

wild and imagine

play36:34

that neural net the hardware designers

play36:38

of neural Nets have built enormous giant

play36:40

computers and this selfplay has been

play36:42

scaled up massively one thing that's

play36:46

notable that we know about biological

play36:50

evolution is that social species tend to

play36:54

be tend to have larger brains they tend

play36:56

to be smarter

play36:58

we know that this is true for any it is

play37:01

very often the case that whenever you

play37:02

have two species which are related but

play37:04

one is social and one isn't then a

play37:07

social one tends to be smarter we know

play37:10

that human biological evolution really

play37:12

accelerated over the past few million

play37:15

years probably because at that

play37:19

point well this is a bit

play37:22

speculative but the theory here my

play37:25

theory at least is that humans became

play37:28

sufficiently competent with respect to

play37:30

their environment so you're not they

play37:32

stop being afraid of The Lion and the

play37:34

biggest concern became the other human

play37:36

what the other humans think of you what

play37:39

are they gossiping about you where you

play37:41

stand in the packing

play37:42

order and so I think this kind of

play37:48

environment created an incentive for the

play37:50

large brains and I was able you know as

play37:54

is often the case in science it's very

play37:55

easy to find some s ific support for

play37:58

your hypothesis which we did so there

play38:02

exists a paper in

play38:07

science which supports the claim

play38:09

that social

play38:13

environments stimulate the development

play38:15

of larger clever brains and the specific

play38:18

evidence they present there is the

play38:20

convergent evolution in smart social

play38:22

apes and smart birds like crows

play38:28

who apparently they have similar

play38:29

cognitive functions even though they

play38:30

have very different brain structures now

play38:33

I'm only 75% confident in this claim but

play38:36

I'm pretty sure that birds don't have

play38:38

the same kind of Cortex as we

play38:40

do because the evolutionary split

play38:42

occurred a long time back in the

play38:46

past

play38:51

so I think it's interesting I think it's

play38:53

I like I like I think this is intriguing

play38:56

at the very least but yeah you could

play38:59

create a society of agents and just keep

play39:00

scaling it up and perhaps you're going

play39:02

to

play39:04

get agents that are going to be smart

play39:07

now I want to finish one with one

play39:09

observation

play39:11

about environments that are trained with

play39:13

selfplay and this is and this is um a

play39:17

plot from our from the the strength of

play39:19

our DOTA bot as a function of

play39:21

time going from April all the way to

play39:25

August and basically you just fix the

play39:27

bugs and you scale up your selfplay

play39:30

environment and you scale up the amount

play39:32

of compute and you get a very rapid

play39:34

increase in the strength of the

play39:37

system and it makes sense in selfplay

play39:41

environments the computer is the data so

play39:45

you can generate more of

play39:46

it so I guess I want to finish with the

play39:49

provocative question which

play39:51

is if you have a self a sufficiently

play39:54

open-ended self-play

play39:56

environment will get extremely rapid

play40:00

increase in the cognitive ability of

play40:01

your

play40:02

agents all the way to superhuman and on

play40:06

this note I will finish the presentation

play40:09

thank you so much for your

play40:16

attention yeah before before before I

play40:19

before I start the question answering um

play40:21

session I want to say that one one

play40:22

important thing I want to say is that

play40:24

many of these Works were done in

play40:25

collaboration with many people from

play40:27

Berkeley and especially Peter ril and I

play40:29

want to I want to highlight that okay

play40:32

great uh I wonder if you can show the

play40:34

last slide cuz you it seemed like it was

play40:36

a very important conclusion but you went

play40:38

over it very quickly yeah so this is a

play40:40

very this is a it's

play40:43

a it is a bit

play40:47

speculative and it really is a question

play40:50

of the specific statement here is that

play40:54

if you

play40:55

believe that you going to get truly

play40:58

smart human level agents as a result of

play41:00

some kind of massive scale selfplay

play41:04

will you also experience the same kind

play41:08

of Rapid increase in the capability of

play41:10

the agent that you see that that we we

play41:14

saw in our experience with DOTA and in

play41:17

general because you can convert Compu

play41:19

into Data so you put more compute this

play41:22

thing gets better

play41:27

yeah so I mean that's that's sort of a

play41:30

general remark obviously you do you

play41:32

compute more you get you get better

play41:33

results but I didn't quite grasp the um

play41:38

uh the difference between these two

play41:39

panels well

play41:46

so so it's really a question of

play41:54

uh so let's say it really boils down to

play41:57

this

play41:57

it's a question of where do the the what

play42:00

are the limits to progress in the fields

play42:03

and in capabilities are do the limits

play42:05

come

play42:06

from like in other

play42:09

words given the right algorithms which

play42:12

currently don't yet

play42:13

exist once you have them how will the

play42:17

increase in the in the actual capability

play42:20

of the system look like I think there is

play42:22

definitely a possibility that it will be

play42:24

like on the right side that once you

play42:26

have you know you figure out your

play42:28

hierarchical reinforcement learning you

play42:29

figured out concept

play42:31

learning you got your supervis learning

play42:33

is in good

play42:34

shape and then the massive neural net

play42:38

Hardware arrives and you have a huge

play42:40

neural net much bigger than the human

play42:42

brain this will happen like how how how

play42:46

will the plot look like over

play42:49

time so you're you're projecting that

play42:52

we've only seen the very beginning okay

play42:54

so let's uh throw it up to questions and

play42:56

I see you already have your hand

play43:00

up thank you for that um you mentioned

play43:03

hierarchy and I'm wondering if you have

play43:05

an example of a hierarchical selfplay

play43:08

that would uh you know increase the

play43:10

slope of this curve yeah so we don't

play43:12

have H we have not tried heral selfplay

play43:16

this is more a statement from our

play43:18

experience with our DOTA bot where you

play43:21

start at basically losing to everyone

play43:24

and then your true skill metric which is

play43:26

like Cano rating just increase pretty

play43:29

much linearly all the way to the best

play43:31

humans so that's and I think this is a

play43:35

gen it seems like it could be a general

play43:38

property of self-play

play43:42

systems which game was this DOTA DOTA

play43:47

yeah okay more

play43:50

questions hey IIA hey very nice talk

play43:53

thank you I had a question on

play43:55

environments do you have any thoughts on

play43:57

going Beyond like sumo wrestling

play44:00

environments like what what are good

play44:01

environments to to

play44:04

study well these are the question of

play44:07

what

play44:08

makes a good

play44:12

environment

play44:15

so I think there are two ways of getting

play44:18

good

play44:19

environments one of them is from trying

play44:22

to solve problems that we care about and

play44:24

they naturally generate environments

play44:29

I think another one is to think of

play44:32

open-ended environments where you can

play44:34

build lot so one of the one of the

play44:37

slightly

play44:38

unsatisfying features of most of the

play44:40

environments that we have today is that

play44:42

there are a little bit not open-ended

play44:44

you got a very kind of narrow domain and

play44:47

you want to perform a task in this

play44:48

narrow domain but one but some

play44:50

environments which are very interesting

play44:52

to think about are one where there is no

play44:53

limit to the depth of these environments

play44:56

and some of these examp examples include

play44:58

programming math even Minecraft in

play45:01

Minecraft you could build structures of

play45:03

greater and greater complexity and you

play45:04

know at first people build little homes

play45:06

in Minecraft then they build big castles

play45:09

and now people you can find people who

play45:11

are building entire cities and even

play45:12

computers inside Minecraft now obviously

play45:14

Minecraft has an obvious challenge which

play45:16

is problem which is what do we want the

play45:20

agents to do there so it needs to be

play45:22

addressed but kind of uh directionally

play45:27

these would be nice environments to

play45:29

think about

play45:31

more

play45:33

okay someone up

play45:41

here uh this is this is sort of similar

play45:44

to that last

play45:45

question but I was wondering uh what the

play45:48

effect if you know of complicated uh

play45:52

non-agent objects and non-agent entities

play45:55

in the environment is on how well

play45:57

self-play works for instance in the Sumo

play46:00

environment the reason that the

play46:03

self-play agents can become very complex

play46:04

in use very complex strategies is

play46:06

because that's necessary in order to

play46:08

compete against this other agent which

play46:10

is also using very complex strategies If

play46:13

instead you were uh working maybe not

play46:17

against another agent but against a very

play46:18

simple agent that doesn't train but

play46:20

through some very complicated system you

play46:22

had to operate a lot of machines in this

play46:25

environment or something like that how

play46:26

how does that affect the the

play46:28

effectiveness of this yeah I mean I

play46:31

think I think it depends a little bit on

play46:32

the specifics like for sure that you

play46:34

know if you have a complicated

play46:35

environment or complicated problem was

play46:37

produced

play46:38

somehow then you will also need to

play46:41

develop a pretty competent agent I think

play46:43

the thing that's interesting about the

play46:45

self-play approach is that you generate

play46:48

the challenge yourself so the question

play46:51

of where does The Challenge come from is

play46:55

answered for you there's a mic problem

play46:56

oh there's a mic problem might be a mic

play46:58

problem uh oh I know it doesn't seem to

play47:01

be muted let me check okay anyway let's

play47:04

let's continue any more

play47:08

questions okay uh so uh oh Bo we have

play47:12

quite a

play47:16

few um going back a bit to the hindsight

play47:19

experience policy you talk about you

play47:21

give the example of you know you trying

play47:23

to reach the red spot a and you instead

play47:26

reach some spot be and you're going to

play47:28

use that to train I guess I was

play47:30

wondering if you could elaborate on that

play47:31

a little bit more I mean I'm not very

play47:32

familiar with ddpg so perhaps that's

play47:35

critical understanding this but I guess

play47:36

what I'm wondering is how do you turn

play47:38

every experience into you know hitting

play47:40

the ball this way translates into this

play47:42

motion without doing it in a reward

play47:44

based way yeah so basically you just say

play47:47

you you you have a policy which is

play47:49

parameterized by a gold

play47:53

state so then you say in effect you have

play47:55

a family of policies one for for every

play47:57

possible

play47:58

goal and then you say okay I'm going to

play48:00

run a pol I'm going to run the policy

play48:02

that tries to reach State a and it

play48:04

reached State b instead so I'm going to

play48:07

say well this is great training data for

play48:09

the policy which reaches State B so

play48:12

that's how you do it in effect like if

play48:14

you want more details we could talk

play48:16

about it

play48:19

offline

play48:23

uh okay so um two question

play48:27

one is a very simple question about

play48:28

about HR again so if a task is difficult

play48:33

for example you know hitting a fast ball

play48:35

in baseball right so even the best

play48:37

humans can do it know 38% of the time or

play48:41

something like that right um so the

play48:44

danger is that if you miss you're going

play48:47

to say oh I was trying to miss so so now

play48:49

I take this as a training example of how

play48:51

to miss all right which is not right you

play48:54

actually doing the optimal action right

play48:56

but you're per appar just can't track

play48:57

the ball fast enough so that's the best

play49:00

you can do so it seems like you would

play49:02

you would run into trouble on tasks like

play49:04

that I mean okay should should should I

play49:07

answer the first question before you ask

play49:09

the second let's do that so the method

play49:11

is still not absolutely perfect but on

play49:15

the question of what happens when you

play49:17

miss when you're trying to actually

play49:19

succeed then yeah we have a lot of data

play49:21

on how to not reach the

play49:23

state like so you're trying to reach a

play49:25

certain desired state which is hard to

play49:27

reach you try to do that you reach a

play49:28

different state so you say okay well I'm

play49:30

going to I will train my

play49:32

system to reach this

play49:35

state but next time I'm going to say I

play49:37

still want to it what it means is that

play49:39

for that specific problem the approach

play49:42

of this approach will be less beneficial

play49:45

than for approach when approach where

play49:47

the tasks are a little bit more

play49:48

continuous where you can have a more of

play49:49

a heel climbing effect you gradually

play49:51

like let's say in a programming in the

play49:54

in the setting context of programming

play49:56

you learn to program simple programs you

play49:58

learn to write different sub routines

play49:59

and you gradually increase your

play50:01

competence the set of States you know

play50:03

how to reach so I agree that when there

play50:05

is a very narrow state which is very

play50:06

hard to reach then it will not help but

play50:09

whenever there is a kind of a continuity

play50:11

to the states then this approach will

play50:15

help okay so the SEC the second question

play50:18

is about self-play so when when I saw

play50:21

your title what I thought you were going

play50:22

to say was was the so if if you think

play50:26

about

play50:27

um alphago right if we tried to train

play50:29

alphago by playing it against the

play50:33

existing world

play50:34

champion since it would never win a

play50:37

single game for the first 50 million

play50:39

games right it would learn nothing at

play50:41

all yeah um but because we play it

play50:44

against itself it always has a 50%

play50:46

chance of winning so you're always going

play50:47

to get a gradient signal no matter how

play50:50

poorly you play yeah that's very

play50:52

important now the so the the question is

play50:56

you know is there some magic trick there

play50:58

that you can then apply to tasks that

play51:02

are intrinsically difficult to get uh to

play51:05

get any reward signal on right so if you

play51:07

take spider solitire for example if you

play51:10

watch an ordinary human play spider

play51:13

solitire they lose the first 100 games

play51:16

and then they give up they say this is

play51:17

impossible you know I I hate this game

play51:20

right there's no reward signal there

play51:21

because you're just not good enough to

play51:22

ever

play51:23

win um and so is there a way you can

play51:26

convert spider solitire into a

play51:28

two-player game and somehow guarantee

play51:31

that you always get a a gradient signal

play51:33

for that game so that's a very good

play51:35

question that's a very good what what

play51:36

you said is a very good point I just

play51:38

want to before before I um elaborate on

play51:40

your question I just want to

play51:42

also talk about the fact that one of the

play51:46

key things of self plays that you always

play51:47

have an ni will evenly match the point

play51:50

and what it means that you also have

play51:53

potentially an indefinite incentive for

play51:56

improvement

play51:57

like even if you are really really

play51:59

competent if you have a super competent

play52:00

agent the opponent will be just as

play52:02

competent and so if Done Right the

play52:06

system will be incentivized to improve

play52:08

and

play52:11

so I think yeah I I think it's it's an

play52:13

important thing to emphasize and that's

play52:15

also by the way why the exploration

play52:17

problem is much easier because you

play52:20

explore the strategy space together with

play52:22

your opponent and it's actually

play52:23

important not to have just one opponent

play52:25

but to have a whole little family of

play52:26

them for

play52:27

stability but that's that that's that's

play52:30

basically crucial now on your second

play52:31

question of what to do when you just

play52:33

can't get the reward so very often if

play52:37

the problem is hard enough I think there

play52:39

isn't much you can do without having

play52:41

some kind of deep domain you know side

play52:43

information about the task but one

play52:45

approach that is popular and it's been

play52:47

pursued by multiple uh groups is to use

play52:50

like asymmetric selfplay for exploration

play52:54

you've got a predictor which Tri to

play52:55

predict what's going going to

play52:57

happen and you've got a policy which

play53:01

tries to take action which surprise the

play53:03

predictor so the predictor is going to

play53:05

say okay well if you're going to I I

play53:07

basically have opinions about what will

play53:10

be the consequences of the different

play53:12

actions and the actor tries to find

play53:16

regions of space which surprise the

play53:17

predictor so you have this kind of a

play53:20

self plates not exactly self plates more

play53:21

of a kind of a competitive adversarial

play53:23

scenario where the agent is incentivized

play53:25

to

play53:27

cover the entire space it doesn't answer

play53:29

the question of how to solve a hard task

play53:31

like so like spider soliter because if

play53:34

if you actually need to be super good I

play53:38

think I think that's tough but at least

play53:42

you can see how this can give you a

play53:43

general guide of how to move forward in

play53:46

general I think we had a question back

play53:48

here some

play53:55

question what what do you think is

play53:57

exciting in terms of new architectures

play54:01

such as you know they've been

play54:03

building they've been adding like memory

play54:05

structures to neural Nets like the DNC

play54:08

paper yeah so what do you see the role

play54:10

of new architectures playing in

play54:12

actually uh achieving what we want for

play54:14

generalization metal learning yeah so I

play54:17

think I think this is a very good

play54:18

question a question of

play54:20

architectures and I'd say that it's very

play54:22

rare to find a really a genuinely good

play54:24

new architecture and and through through

play54:28

genuine innovation in architecture space

play54:29

is

play54:30

uncommon I'd say the biggest innovation

play54:33

in architecture space over the past many

play54:36

years has been soft

play54:38

attention so soft attention is

play54:40

legitimately a major advance in

play54:43

architectures but it's also very hard to

play54:46

innovate in architecture

play54:48

space because the basic architectures is

play54:51

so good I think that better

play54:54

generalization will be achieved not and

play54:56

this is my opinion it's not backed by

play54:58

data yet I think that better

play55:01

generalization will not be achieved by

play55:02

means of just improving the

play55:05

architecture but by means of changing

play55:08

the learning algorithm and possibly even

play55:10

the Paradigm of the way we think about

play55:12

our models I think things like uh

play55:15

minimum description length and

play55:16

compression will be a lot more

play55:20

popular but it's not I think these are

play55:22

non obvious questions but basically I

play55:24

think architecture is important whenever

play55:27

you can actually in good new good new

play55:29

architectures for the heart problems uh

play55:31

how about curriculum learning to learn

play55:33

to hit a fast ball start with a slow

play55:35

ball yeah for sure curriculum learning

play55:37

is a very important idea it's how human

play55:41

learn it's it's how humans

play55:44

learn and it's very guess a pleasant

play55:47

surprise that our neural networks also

play55:49

benefit from

play55:51

curriculums one nice thing about

play55:54

self-play is that the curriculum is

play55:56

built in it's like

play55:58

intrinsic what you lose in self play is

play56:00

the ability to direct the selfplay to a

play56:03

specifi

play56:05

point so I I have a question uh you

play56:08

showed us the nice videos the uh

play56:10

wrestlers and uh and the robots and so

play56:13

forth and uh I I assume it's similar to

play56:16

deep learning in the sense that there's

play56:18

a framework of linear algebra underlying

play56:21

the whole thing so is there anything

play56:23

there other than linear algebra I mean

play56:26

of neural net I mean so it's not it's

play56:28

it's even it's you just take two agents

play56:29

and you apply reinforcement learning

play56:31

algorithms and a reinforcement learning

play56:33

algorithm is a neural net with a

play56:35

slightly different way of dating the

play56:36

parameters so it's all it's all matrix

play56:38

multiplication all the way down yeah

play56:40

just want to multiply big matrices as

play56:42

fast as possible

play56:44

right okay oh we have one

play56:49

more so you mentioned something about uh

play56:52

transfer learning uh and the importance

play56:54

of that um what do you think think about

play56:56

concept extraction and uh transferring

play57:00

that and if that's something that you

play57:02

think is possible or people are doing

play57:04

right now so I think it really depends

play57:06

on what you mean by con concept

play57:08

extraction exactly I think it's

play57:10

definitely the case that our transfer

play57:12

learning abilities are still

play57:15

rudimentary and we don't yet have

play57:17

methods that can extract like seriously

play57:20

high level Concepts from one domain and

play57:23

then apply it in another domain I think

play57:25

there are ideas on how to do to approach

play57:26

that but nothing that's really

play57:28

convincing on a task that matters not

play57:34

yet well we really had a lot of uh

play57:36

questions and the reason is that you

play57:37

gave very short succinct answers for

play57:40

which we are very grateful thank you

play57:41

very much uh let's uh give a great hand

play57:45

thank you

Rate This

5.0 / 5 (0 votes)

Связанные теги
深度学习元学习自博弈人工智能神经网络机器学习算法创新强化学习模型泛化技术前瞻
Вам нужно краткое изложение на английском?