Ilya Sutskever | OPEN AI has already achieved AGI through large model training
Summary
TLDR本次演讲深入探讨了深度学习和强化学习领域的最新进展。演讲者首先解释了深度学习之所以有效的原因,并强调了寻找最佳短程序以解释数据的重要性。接着,讨论了强化学习算法,特别是策略梯度和Q学习,并指出了它们在处理探索和奖励稀疏问题上的挑战。此外,还介绍了元学习和自我对弈的概念,展示了它们在提高学习效率和解决复杂任务中的潜力。演讲者通过多个项目案例,如DOTA 2和机器人学习,展示了这些算法的实际应用,并对未来的研究方向提出了展望。
Takeaways
- 🤖 自动化和人工智能的进步部分归功于深度学习的成功,尽管深度学习为何有效并非显而易见。
- 🧠 深度学习的核心在于寻找能够最佳解释数据的最短程序或最小电路,这与机器学习的概念类有关。
- 🔍 反向传播算法是深度学习中的关键,尽管其成功的原因仍然是一个谜,但它推动了人工智能领域的许多进展。
- 🎯 强化学习是描述智能体行为的框架,通过与环境互动并接收奖励来学习,尽管算法还有改进空间,但已经能够成功完成许多任务。
- 🔄 元学习(Meta learning)的目标是让机器学会如何学习,通过在多个任务上训练系统,使其能够快速学习新任务。
- 🔧 通过将复杂问题转化为多个问题的上下文,可以更容易地解决,例如通过“事后经验回放”(Hindsight Experience Replay)算法。
- 🤹♂️ 自我博弈(Selfplay)是一种训练方法,通过智能体之间的相互竞争来学习策略,这种方法在围棋和DOTA等游戏中取得了显著成果。
- 🌐 在自我博弈环境中,智能体可以创造出无限复杂的行为,这对于构建具有高级智能的代理可能是有益的。
- 🧬 通过模拟和自我博弈的结合,可以在模拟环境中训练出能够在现实世界中表现良好的策略。
- 🧠 神经网络架构的创新,如引入记忆结构,对于提高模型的泛化能力和学习能力至关重要。
Q & A
深度学习为什么有效?
-深度学习之所以有效,是因为它能够找到解释数据的最佳小型电路。理论上,最佳短程序是解释数据的最佳方式,但实际中寻找最佳短程序是不可行的。小型电路在某种程度上可以执行不明显的计算,通过反向传播算法可以找到最佳小型电路来解释数据,这是深度学习有效的关键。
什么是元学习,它在人工智能中的重要性是什么?
-元学习是一种学习如何学习的方法,它通过在多个任务上训练系统来实现。元学习的目标是训练出一个能够快速解决新任务的模型。这种方法的重要性在于,它能够提高学习算法的泛化能力,减少对特定任务设计的依赖,从而推动人工智能的发展。
强化学习在人工智能中扮演什么角色?
-强化学习提供了一个框架,用于描述智能体的行为,其中智能体通过与环境互动并根据收到的奖励来学习。它的重要性在于存在有用的算法,这些算法即使在还有很多改进空间的情况下,也能在许多非显而易见的任务中取得成功。
什么是策略梯度和Q学习,它们在强化学习中的作用是什么?
-策略梯度是一种强化学习算法,它通过在策略中引入随机性并根据结果的好坏来调整策略。Q学习是另一种算法,它通过估计给定状态和动作的未来价值来学习。这两种算法的作用是提高智能体在环境中的决策能力,使其能够更有效地学习以达到目标。
什么是后见式经验回放(Hindsight Experience Replay),它如何帮助解决强化学习中的探索问题?
-后见式经验回放是一种算法,它通过将失败的尝试转化为学习新目标的机会来解决探索问题。例如,如果智能体试图达到状态A但达到了状态B,算法会利用这一路径来学习如何达到状态B。这种方法使得智能体不会浪费任何经验,并且能够从每次尝试中学习。
自我对弈(selfplay)在人工智能中有什么应用?
-自我对弈是一种训练方法,通过智能体与自身的不同版本对弈来学习。这种方法在围棋、国际象棋和DOTA等游戏中取得了显著的成功。自我对弈的应用在于它能够生成复杂的策略和行为,同时提供了持续的挑战和学习动力。
为什么说自我对弈能够产生无界复杂性?
-自我对弈能够产生无界复杂性,因为它创造了一个环境,智能体可以在其中不断挑战和超越自己。随着智能体变得更强,它们产生的策略和行为也会变得更加复杂,从而推动智能体的认知能力不断增长。
在自我对弈中,如何保证智能体总是有进步的动力?
-在自我对弈中,智能体总是与和自己同等水平的对手对弈,这意味着它们总是面临挑战。即使智能体变得非常擅长,对手也会同样擅长,因此智能体总是有动力去改进和学习新策略。
转移学习在人工智能中的重要性是什么?
-转移学习允许智能体将在一个领域学到的技能和知识应用到另一个领域。这在人工智能中非常重要,因为它可以减少学习新任务所需的数据和时间,提高智能体的适应性和灵活性。
在人工智能中,如何提高模型的泛化能力?
-提高模型的泛化能力通常涉及到改进学习算法、使用更多样化的数据集进行训练、以及采用正则化技术等方法。此外,元学习、自我对弈和后见式经验回放等技术也被用来提高模型在新任务上的表现。
Outlines
🤖 深度学习与元学习
本段落介绍了OpenAI在过去一年中的工作,特别强调了元学习与自我对弈这两个主题。讨论了深度学习为何有效,以及如何通过寻找最佳短程序来实现数据的最佳解释和预测。提到了小电路是次优选择,因为它们能够执行非显而易见的计算,并且可以通过反向传播找到最佳小电路。此外,还提到了强化学习算法,特别是策略梯度和Q学习,以及它们在现实世界中的应用和挑战。
🧠 强化学习与算法
这段落深入探讨了强化学习,包括其算法和在构建智能代理中的应用。强化学习框架允许代理通过与环境互动并接收奖励来学习。讨论了如何通过神经网络表示策略,并通过改变参数来改善模型。同时,介绍了现代强化学习算法,包括策略梯度和Q学习,以及它们如何通过随机性和递归性来优化代理的行为。
🔄 元学习与快速学习
本段落讨论了元学习的概念,即通过在多个任务上训练系统来实现快速学习。介绍了元学习的两种主要方法:一种是通过训练大型神经网络来快速解决任务分布中的问题,另一种是通过学习架构或算法来实现更广泛的泛化。举例说明了元学习在字符识别和图像数据集中的应用,并强调了元学习在提高学习效率方面的潜力。
🎯 目标导向学习与探索
这段落介绍了一种名为“hindsight experience replay”的算法,它通过将一个难题转化为多个问题的框架来解决探索问题。算法通过在尝试达到一个目标但失败时,利用这一尝试来学习达到实际达到的目标。这种方法使得学习过程更加高效,因为它不会浪费任何经验。此外,还讨论了如何通过自我对弈来提高算法的泛化能力。
🤖 自我对弈与智能代理
本段落探讨了自我对弈在智能代理发展中的重要性。自我对弈允许在简单环境中创建具有潜在无限复杂性的代理,这对于构建具有高级社交技能的智能代理至关重要。举例说明了自我对弈在棋类游戏和电子竞技中的应用,并讨论了如何将自我对弈环境中训练出的代理应用于现实世界任务。
🧠 神经网络与自我对弈
这段落讨论了神经网络在自我对弈中的应用,以及如何通过自我对弈来提高代理的智能。强调了自我对弈环境中计算机作为数据来源的重要性,并提出了一个假设:如果自我对弈环境足够开放,代理的认知能力可能会迅速提高,甚至达到超人水平。
🔧 转移学习与概念提取
本段落探讨了转移学习在提高代理性能方面的作用,特别是在将一个领域的学习成果应用到另一个领域时。讨论了概念提取的可能性和挑战,以及当前在这一领域的研究进展。强调了在实现有效的转移学习方面仍需克服的困难。
📚 课程学习与自我对弈
这段落讨论了课程学习在自我对弈中的作用,以及如何通过自我对弈来构建内置的课程。强调了自我对弈中探索问题的简化,以及如何通过自我对弈来激励代理不断改进。同时,提出了关于自我对弈环境设置的关键问题,以及如何将自我对弈训练出的代理应用于有用任务。
🧐 新架构与学习算法
本段落探讨了新架构在神经网络中的作用,以及它们如何影响学习算法和模型的泛化能力。讨论了软注意力机制作为近年来架构创新的一个例子,并提出了通过改变学习算法和模型范式来实现更好泛化的可能途径。
Mindmap
Keywords
💡深度学习
💡元学习
💡自博弈
💡强化学习
💡后向传播
💡小电路
💡策略梯度
💡Q学习
💡后向经验回放
💡模拟到现实迁移
Highlights
深度学习有效性的原因分析,提出了寻找最佳短程序来解释数据是最佳泛化方法的理论。
小电路是继短程序之后的最佳选择,因为它们可以执行非显而易见的计算。
反向传播算法的成功是一个幸运的谜团,它推动了过去六年人工智能领域的所有进步。
强化学习框架描述了智能体的行为,通过与环境互动并根据成功与否获得奖励。
现代强化学习算法通过在策略中加入随机性来探索更好的行为方式。
策略梯度和Q学习算法是两种主要的强化学习算法,它们在稳定性和样本效率上有所不同。
元学习的概念是训练系统在多个任务上学习,以便快速解决新任务。
通过元学习,可以在有限的数据样本上实现快速学习,这在字符识别任务中得到了验证。
元学习的一个挑战是如何设计算法以泛化到训练时未见过的分布之外的任务。
介绍了一种名为“hindsight experience replay”的算法,它通过将失败的尝试转化为学习新目标的机会来提高学习效率。
展示了hindsight experience replay算法在稀疏奖励环境下的有效性,以及在真实物理环境中的应用潜力。
讨论了如何通过元学习在模拟环境中训练策略,并使其能够泛化到真实世界。
介绍了一种层次化强化学习方法,通过学习低级动作来加速学习过程。
自我对弈是创建复杂智能体的一种方法,它允许在简单环境中产生无限复杂性。
自我对弈的一个关键优势是它提供了一个不断激励改进的环境,因为对手总是同样优秀。
提出了一个开放性问题:如果自我对弈环境足够开放,是否会实现智能体认知能力的极速提升。
强调了与伯克利大学的合作对于这些研究成果的重要性。
Transcripts
about some of the work we've done at
open AI over the past year and this is
like a an arrow subset that Focus the
talk will be a subset of that work
focusing on meta learning and selfplay
which are two topics I like very
much but I've been told that this is a
more slightly broader a little bit more
of a general interest talk so I want to
begin the presentation by talking a
little bit about why deep learning
actually
works and I think it's not a
self-evident
question why deep learning works it's
not self-evident that it should
work and I want to
give some a perspective which I think is
not entirely obvious on
that so one thing that you can actually
prove
mathematically that the best possible
way of generalizing that's completely
unimprovable is to find the best short
program that explains your data and then
use that to make
predictions and you can prove that it's
impossible to do better than that so if
you think about machine learning you
need to think about concept classes what
are you looking for given the data and
if you're looking for the best short
program it's impossible to generalize
better than that and it can be proved
and the proof is not even that
complicated and like the intuition of it
basically is that any regular any
regularity that can possibly exist is
expressible as a short program if you
have some piece of data which cannot be
compressed VI the slightly shorter
program then that piece of data is
totally random so you can take my word
on it that it therefore
follows the short programs are the best
possible way to generalize if only we
could use them problem is it is
impossible to find the best short
programing descrip data at least given
today's knowledge the computational
problem of finding the best short
program is intractable in practice
undecidable in
theory so no short programs for us but
what about small circuits small circuits
are the next best
thing after short programs because a
short a small circuit can also performs
non obvious computation if you have a
really deep really wide circuit maybe
you know many many thousand layers and
many millions of neurons wide you can
run lots of different algorithms on the
inside so it comes close it comes close
to short
programs and extremely fortunately the
problem of
finding the best small circuit given the
data is solvable with
backrop and
so basically what it boils down to is
that we can find the best small circuit
that explains the data and small
circuits are kind of like programs but
not really they are a little bit worse
it's like finding the
best parallel program that trans for 100
steps or less 50 steps that solves your
problem and that's where the
generalization comes
from now we don't know why don't know
exactly why back propagation is
successful at finding the best short
circuit given given your data it's a
mystery and it's a very fortunate
mystery it Powers all the progress that
we've made in all the progress that's
been made in uh artificial intelligence
over the past six
years so I think there is an element of
luck here we are lucky that it
works one thing which I one useful
analogy that I like to make when
thinking about
generalization is that
models learning models that in some ways
have greater computational power
generalize better so you could make this
you could make the case that the deeper
your neural network is the closer it
comes to the all the ultimate best short
programs and so the better will
generalize so that's that tries to touch
on the question of where does
generalization come
from I think the full answer is going to
be unknown for quite some time because
it also has to do with the specific data
that we happen to want to solve
it is very nice indeed that the problems
we want to solve happen to be solvable
with these classes of
models one other statement I want to
make is
that I think that the back propagation
algorithm is going to stay with us until
the very end because the problem that it
solves is so fundamental which is given
data find the best small circuit that
fits to
it it seems unlikely that this problem
that we will not want to Sol this
problem in the
future and so for this reason I feel
like backprop is really
important now I want to spend a little
bit of time talking about reinforcement
learning and
so reinforcement learning is a framework
for describing the behavior of Agents
you've got an agent which takes actions
interacts with an
environment and receives rewards when it
succeeds
and it's pretty clear that it's a very
general
framework but the thing that makes
reinforcement learning interesting is
that there exist useful algorithms in
reinforcement learning so in other words
the
algorithms of reinforcement learning
make the framework interesting even
though these algorithms have still a lot
of room for improvement they can already
succeed in lots of nonobvious
tasks and so therefore it's worth
pushing on these algorithms if you make
really good reinforcement learning
algorithms perhaps you'll build
very clever agents and
so the way the way the reinforcement
learning problem is
formulated
is as follows you have
some policy class where policy is just
some function which takes inputs and
produces actions and for any given
policy you can run it and you can figure
out its performance it's
cost and your goal is just to find the
best policy that minimizes cost
maximizes reward rewards now one way in
which this framework formulation is
different from reality is that in
reality the agents generate the rewards
to
themselves and the only true cost
function that exists is
survival
so the if you want to build good
reinforce um any reinforcement learning
algorithm at all you need to represent
the policy somehow so how you going to
represent anything the answer is always
using a neural network the neural
network is going to take the actions and
produce take the observations and
produce actions and then for a given
setting of The parameters you
could figure out how you could calculate
how good they are and then you could
calc you could you could figure out how
to compute the way to change these
parameters to improve the model so if
you make if you change the parameters of
the model model many times and make many
small
improvements then you may make a big
Improvement and very often in practice
the Improvement ends up being big enough
to solve the
problem so I want to talk a little bit
about how reinforcement learning
algorithms
work the modern ones the modal free ones
the one that every the ones that
everyone uses
today and you take your policy and you
add a little bit of Randomness to your
actions somehow
so you deviate from your usual
behavior and then you simply check if
the resulting cost was better than
expected and if it
is you make it more likely by the way I
want for the for um I'm actually curious
how many people are familiar with the
basics please raise your hand okay so
the audience here is informed so I can
skip through the introductory Parts
don't don't skip too much
all
right I'll skip only a little
bit but the point is you do something
randomly and you see if it's better than
usual and if it is do more of that and
do a lot of and repeat this many
times so in reinforcement learning there
are two classes of
algorithms one of them is called policy
gradients which is basically what I just
described and there is a beautiful
formula above which says that if you
just take the derivative of your cost
function and do a little bit of math you
get something which is exactly as
described where you just take a random
take some random actions with a little
bit of Randomness and if the result is
better than expected then increase the
probability of taking these actions in
the future then there is also the q-
learning algorithm which is a little bit
less stable a little bit more sample
efficient won't explain in too detail in
too much detail how it how it works but
it has the property that it is off
policy which
means that it can learn not just from
its own actions and I want to explain
what it means on policy means that you
can only learn you can
only learn at all if you the one who's
taking the actions while off policy
means that you can learn from anyone's
other anyone's actions it doesn't just
have to be your own so it's a bit more
it seems like a more useful thing
although it's interesting that the
algorithm which is more
stable the stable algorithms tend to be
policy gradient based the on policy ones
the ones that the Q Q learning which is
of policy is also less stable at least
as of today things change
quickly now I'll spend a little bit of
time illustrating how Q learning works
even though I think it may be familiar
uh this may be familiar to many to many
people and basically have this Q
function which tries to estimate for a
given State and a given action how good
or bad the future is going to
be and you have this trajectory of
States because your Pol your agent is
taking many actions in the world it's
relentlessly pursuing a
goal well the Q function is this
recursive Rec Rec recursive property
where the Q function of sa is basically
just the Q function of S Prime a prime
plus the reward you got to so you got
this recursivity and you can use this
recursivity to estimate the Q function
and that gives you the Q learning
algorithm and I want explain why it's
off policy all you need is to um for for
the purposes of this presentation just
take my word for
it and now what's the potential here why
is this exciting so yes the
reinforcement learning algorithms that
we have right now they are very sample
and efficient they're really bad at
exploration yet although progress is
being made
but you can kind of see that if you had
a really great reinforcement learning
algorithm that would be just really data
efficient and explore really well and
make really good use
of lots of sources of information then
we'd be in good shape in terms of the go
in terms of building intelligent
agents but we still have work to do
there still all be data
inefficient so now I want to talk a
little bit about meta learning which
will
be um an important part of this
talk and I want to explain what it is so
so there is the abstract the the dream
of metal learning the abstract idea that
metal learning is the idea that you can
learn to
learn kind of in the same way in which
biological evolution has learned the
learning algorithm of the
brain and spiritually the way you'd
approach this problem is by training a
system not on one task but on many
tasks and if you do that then suddenly
you've trained your system to solve new
tasks really
quickly so that would be a nice thing if
you could do that be great if you could
learn to learn we wouldn't need to
design the algorithms
ourselves use the learning algorithm
that we have right now to do the the
rest of the thinking for
us you're not quite there yet but meta
learning has had
a fair bit of success and I just want to
show um explain the dominant
the most common way of doing meta
learning the most common way of doing
meta learning is the most attractive one
where you basically say that you you
want to reduce the problem of meta
learning to traditional deep learning
where you basically take
your familiar supervised learning
framework and you replace each data
point with a
task from your training set of
tasks and so what you do is that all all
these algorithms have the same kind of
high level shape where you have a model
which receives information about the
task plus an task instance and it needs
to make the
prediction and it's pretty easy to see
that if you do that then you will train
a model which can receive a new
description of a task and make good
predictions
there and there have been some very um
some pretty successful comp compelling
success stories and I'll mention some of
them a lot of a lot of metal learning
work was done in Berkeley as well but
I'll mention some of the visual ones the
early ones that I think are notable
because you see this task right here
it's I took this figure from a paper by
Brandon Lake
atal and this but I think the data set
came earlier so this is the right
citation but
one of the criticisms of one of the ways
in which neural Nets were criticized is
that they can't learn quickly which is
kind of
true
and a team in Josh ten bom's lab have
developed
this data set which has a very large
number of different characters and a
very small number of examples for each
character
specifically as a challenge for neural
networks and it turns out that the
simple metal learning approach where you
just say that I want to train a neural
network that can learn to recognize any
character really quickly that approach
Works super well and it's been able to
get um superhuman performance and as far
as I know the best performance is
achieved by mishal and I believe it's
the work done with
Peter and they it's basically
superum and it's just a neuronet
So Meta learning sometimes work really
well there is also a very different take
on meta learning which is a lot more uh
which is a lot closer to the approach of
instead of learning the parameters of a
big model let's learn something Compact
and small like the architecture or even
the algorithm which is what evolution
did and here you just say why don't you
search in architecture space and find
the best architecture this is also a
form of meta learning it also
generalizes really well because this
work if you do if you learn an
architecture on a small image data set
it will work really well on a large
image data set as well and the reason it
generalizes well is because the amount
of information
in an architecture is small and this is
work from a Google by zop and
Lee so metal learning
works sometimes there are signs of Life
The Promise is very strong it's just so
compelling yeah I just just set
everything right and then your existing
learning algorithm you learn the
learning algorithm of the future that
would be
nice so now I want to dive
into a detailed description of
uh one algorithm that we've done it's
called hindsight experience replay and
it's been a large collaboration with
many people driven primarily by
andrial and this is not exactly metal
learning this is almost metal
learning
and basically what happened there is
that the way to think about what this
algorithm does is that you try to solve
a a hard
Problem by making it harder and as a
result it becomes
easier and so you frame one problem into
the framework into the context of many
problems you have very many problems
that you're will to solve simultaneously
and that makes it
easy and the problem here is basically a
combination of exploration where in
reinforcement learning we need to take
the right action if you don't take the
right action you don't learn if you
don't get rewards how can you improve
all your effort that doesn't lead to
reward will be wasted would be nice if
you didn't have that
and
so if our rewards are
sparse and if we try to achieve our goal
and to
fail the model doesn't learn so how do
we fix that so it's a really simple idea
it's super intuitive you basically say
you have the starting
point you try to reach the state a but
you reach the state b
instead
and so what can we learn something from
this well we have a data we have a
trajectory of how to reach the state B
so maybe we can use this flawed attempt
at reaching a as an opportunity to learn
the state
B and so this is
very correct
directionally means that you don't waste
experience and but you need enough
policy algorithm in order to learn it
and that's why I've emphasized the of
policy stuff earlier because your policy
tries to reach a but you going to use
this data to teach a different policy
which which ISP so you have this big
parameterized
function and you just simply tell it
which state you reach it's it's super
it's super super straightforward and
it's intuitive and it works really well
too hindsight experience replace I'm
going to show you the video it's
a it's pretty cool
and so in this case the reward is very
sparse and
binary
and so so I I should just say because
the reward is sparse in binary this
makes it very hard for traditional
reinforcement learning
algorithms because you never get to see
the reward if you were to shape your
reward perhaps you could solve these
problems a little bit better although we
still found it um you know when the
people that that were working on this
have tried it they still found it
difficult but this algorithm just works
on these cool tasks and just the videos
look
cool so let's keep
watching you get these very nice
confident looking movements from The
hset Experience replay
algorithm and it just makes sense like
anytime something happens we want to
learn from it and so we want this to be
the basis of all future algorithms
now again this is in the uh absolutely
sparse binary reward setting which means
that the standard reinforcement learning
algorithms are very
disadvantaged but even if you try to
shape a reward what's one thing that you
discover is that shaping rewards is
sometimes easy but sometimes quite
challenging and here is the same thing
working
on uh real physical blocks
okay so this is
the this this basically sums sums up the
hindsight experience replay results can
you tell us what acronym is represented
by HR hindsight experience
replay and like what you can see is like
if you want to one of the limitations of
all these results is that
they the state is very low
dimensional and if you have a general
environment which is very high
dimensional inputs and very long
histories you got a question of how do
you represent your goals and so what it
means is that representation learning is
going to be very
important and unsupervised learning is
probably doesn't work yet but I think
it's pretty
close and we should keep thinking about
how to really fuse unsupervised learning
with reinforcement learning I think this
is a fruitful area for the
future now I want to talk about a
different project on using
on on doing transfer from SIM to real
with meta learning and this work is by
pangal and multiple people who did this
work are from Berkeley unfortunately I
don't have the full list here
so it would be nice if you could train
our robots in simulation and then deploy
them on physical robots simulation is
easy to work
with but it's also very clear that you
can can't simulate most
things so then can anything be done
here and I just want to explain one very
simple idea of how you could do that and
answer is basically you train a policy
that doesn't just solve the task in one
simulated setting but it solves the task
in a family of simulated settings so
what does it mean you say okay I'm going
to randomize the friction coefficient
and gravity and pretty much anything you
can think of the length of your robotic
claims and their masses and the
frictions and
sizes and your policy isn't told what
you've done you just need to figure it
it needs to figure it out by interacting
with the
environment well if you do that then
you'll develop a robust policy that's
pretty good at figuring out what's going
on at least in the
simulations and if this is done
then the resulting system will be much
more
likely to generalize its Knowledge from
the simulation to the real world and
this is an instance of meta learning
because in effect you're learning a
policy which is very
quick at identifying the precise physics
you using so I would say this is a
little bit I mean calling it metal
learning is a bit of a stretch it's more
of a kind of a robust adaptive Dynamic
thing but but it also has a metal fi to
it I want to show this video of the
Baseline so this is what happens when
you
don't this is what happens when you
don't uh do this uh robustification of
the policy so you try to get the hcky Pu
into the
red a DOT and it just fails really
dramatically and and
um doesn't look very good and if you add
these robus defications then the result
is a lot
better than it's like you know even when
it pushes it around and it overshoots
it's just no
problem so it looks pretty good
so I think this toy example illustrates
that the approach of training a policy
in
simulation and then making sure that the
policy doesn't solve just one instance
of the simulation but many different
instances of it and figures out which
one it
is
then it could succeed to generalizing to
the real to the real physical robot
so that's encouraging now I want to talk
about another project by France
all
and it's about doing hierarchical
reinforcement
learning so hierarchical reinforcement
learning is one of those ideas that
would be
nice if we
could get it to work because one of the
problems with reinforce with
reinforcement learning as it's Curr ly
done today is that you have very long
Horizons which you have trouble dealing
with and you have trouble dealing with
that exploration is not very directed so
it's not as fast as you would like and
the credit assignment is challenging as
well and so we can do a very simple
metal learning approach
where you basically say that you want to
learn lowlevel actions which make
learning fast so you have a distribution
over
tasks
and you have a distribution of a tasks
and you want to find a set of low-level
policies such
that if you use them inside the
reinforcement learning algorithm you
learn as quickly as
possible and so if you do that
you can learn pretty sensible Locomotion
strategies that go in a persistent
Direction and
so here it is we got three policies the
high level and the the system has been
learned to find to find the policies
that will solve problems like this and
there is a specific distribution over
this kind of problem that solves it as
quickly as
possible so that's pretty
nice now one thing I want to mention
here
is the one important limitation of high
capacity metal learning so there are two
kinds of there are two ways to do metal
learning one is
by learning a big neural network that
can quickly solve problems in a
distribution of
tasks
and the other one is by learning an
architecture or an algorithm so you
learn a small object so if you learn an
architecture if you learn an algorithm
in a metal learning setting it will
likely generalize to many other tasks
but this is not the case or at least it
is much less the case for high-capacity
meta learning where if you just want to
for example train a very large recurrent
neural network
you want to learn a very large recurrent
neural network that
solves many tasks it will be very
committed to the distribution of task
that you've train it on and if you give
it a task that's meaningfully outside of
the distribution it will not succeed so
as a kind of a
slightly the kind of example I have in
mind is well let's say you take your
system and you train it to do math you
know a little bit of math and teach a
little bit of programming and you teach
it how to read could it do
chemistry well not according to this
Paradigm at least not obviously because
it really needs to have the task to come
from the same distribution in the
training and in and in test
time so I think for this to work we will
need to improve our the generalization
of our algorithms
further and now I want to finish by
talking about selfplay
s self play is a really cool topic
it's been around for a long
time
and I think it's really interesting and
intriguing and
mysterious and I want to start by
talking about the uh
very earliest work on selfplay that I
know of and that's DD gam it was done
back in
1992 it was by taro
single author
work and in this work they've used Q
learning with selfplay to train a neural
network that beats the world champion in
backgamon so I think this may sound
familiar in 2017 and 2018 but that's in
1992 that's back when your CPUs were of
like I don't know 33 MHz or something
and if you look at this plot you see it
shows the performance as a function of
time with different numbers of hidden
neurons you see okay you have 10 hidden
units versus that's the red that's the
red red curve and 20 hidden units is the
green curve all the way to the purple
curve and yeah it's basically nothing
changed in 25 years just the number of
zeros and the number of hidden
units and in fact they've even
discovered unconventional strategies
that surprised experts in
back so that's just
amazing that it's that this work was
done so long ago and it had so it was
looking forward into the future so much
and this approach basically remained
dormant people were trying out a little
bit but it really was revived by the
Atara results of deep
mind
and you know we've also had very
compelling self-play results in Alpha go
zero where they could train a very
strong go player from no knowledge at
all to beating all humans
same is true about our Dota 2 results it
again started from zero and just did
lots and lots of self play and I want to
talk a little bit about why I think self
play is really
exciting because you get things like
this like you
can self-play makes it possible to
create very simple
environments that support potentially
unbounded
complexity Unbound
Ed sophistication in your agents
unbounded scheming in social
skills
and it seems relevant towards building
for building intelligent agents and
there is work on artificial life by uh
Carl Sims from 94 and you can see that
already there it looks very very
familiar you see these little evolved
creatures whose morphologies are evolved
as well and here they are competing for
the possession of a little green
Cube and again this was done in 1994 on
Tiny
computers and just like many and just
like other uh promising ideas that we
may that we are familiar with didn't
have enough computer to really push them
forward but I think that this is the
kind of thing that we could get with
large scale selfplay and I want to show
some work that we've done just trying to
revive this concept a little bit and I'm
going to show this video this was work
by B salad Al was a productive summer
internship there is a bit of music here
let me turn it
off actually maybe I can keep it
on no I can't I
can't but the point is what's the point
you got the super simple environment
which in this case is just the summer
ring
and you just tell the agents you get a
plus one when the other agents get gets
outside the
ring and the reason I find is so well I
personally like it because these things
look alive like they have this breaths
of complicated
behaviors that they learn just in order
to stay in the
game and so you can kind of see that if
you let your imagination run
wild then yeah so this this selfplay is
not
symmetric and also the human these
humanoids are a bit unnatural because
they they they don't feel pain and they
don't get tired and they don't have you
know a whole lot of energy
constraints oh it blocked it that was
good so that's pretty good too so here
the goal you can guess what the goal is
that that was that was a nice
Dodge and now this so this is example so
one of the things that would be
nice is that if you could take these
selfplay environments train our agents
to do some kind
of tasks from the selfplay and then take
the agent outside and get it to do
something useful for us I think if that
possible that would be amazing and here
there is like a tiniest the tiniest of
tests where we take the sumo wrestling
agent and we just apply we put it we put
it isolated and alone inside the ring it
doesn't have a friend and we just apply
big forces on it and see if it can
balance itself and of course it can
balance itself because it's been
trained because it's been trained
against an opponent that tried to push
it so it's really good at resisting
force in
general and so kind of the the mental IM
here is that imagine you take a ninja
and then you ask it to to learn to
become a chef because the ninja is
already so dexterous it should have a
really fairly easy time to be a very
good good cook that's the kind of high
level idea here it hasn't happened yet
but one thing I'd like to ask yeah and
so and so I think one of the key
questions in this line of work is how
can you set
up a type of self-play environment which
once you
succeed it can solve useful tasks for us
which are different just from the
environment itself and that's the big
difference between games in games the
goal is to actually win the environment
but that's not what we want we want it
to just be generally good at being
clever and then Sol solve a problems you
know do my homework type
agent I want to um yeah I want to show
one one slide which I think is
interesting so one of the one of the
reasons like if you like I would like to
ask you to let your imaginations run
wild and imagine
that neural net the hardware designers
of neural Nets have built enormous giant
computers and this selfplay has been
scaled up massively one thing that's
notable that we know about biological
evolution is that social species tend to
be tend to have larger brains they tend
to be smarter
we know that this is true for any it is
very often the case that whenever you
have two species which are related but
one is social and one isn't then a
social one tends to be smarter we know
that human biological evolution really
accelerated over the past few million
years probably because at that
point well this is a bit
speculative but the theory here my
theory at least is that humans became
sufficiently competent with respect to
their environment so you're not they
stop being afraid of The Lion and the
biggest concern became the other human
what the other humans think of you what
are they gossiping about you where you
stand in the packing
order and so I think this kind of
environment created an incentive for the
large brains and I was able you know as
is often the case in science it's very
easy to find some s ific support for
your hypothesis which we did so there
exists a paper in
science which supports the claim
that social
environments stimulate the development
of larger clever brains and the specific
evidence they present there is the
convergent evolution in smart social
apes and smart birds like crows
who apparently they have similar
cognitive functions even though they
have very different brain structures now
I'm only 75% confident in this claim but
I'm pretty sure that birds don't have
the same kind of Cortex as we
do because the evolutionary split
occurred a long time back in the
past
so I think it's interesting I think it's
I like I like I think this is intriguing
at the very least but yeah you could
create a society of agents and just keep
scaling it up and perhaps you're going
to
get agents that are going to be smart
now I want to finish one with one
observation
about environments that are trained with
selfplay and this is and this is um a
plot from our from the the strength of
our DOTA bot as a function of
time going from April all the way to
August and basically you just fix the
bugs and you scale up your selfplay
environment and you scale up the amount
of compute and you get a very rapid
increase in the strength of the
system and it makes sense in selfplay
environments the computer is the data so
you can generate more of
it so I guess I want to finish with the
provocative question which
is if you have a self a sufficiently
open-ended self-play
environment will get extremely rapid
increase in the cognitive ability of
your
agents all the way to superhuman and on
this note I will finish the presentation
thank you so much for your
attention yeah before before before I
before I start the question answering um
session I want to say that one one
important thing I want to say is that
many of these Works were done in
collaboration with many people from
Berkeley and especially Peter ril and I
want to I want to highlight that okay
great uh I wonder if you can show the
last slide cuz you it seemed like it was
a very important conclusion but you went
over it very quickly yeah so this is a
very this is a it's
a it is a bit
speculative and it really is a question
of the specific statement here is that
if you
believe that you going to get truly
smart human level agents as a result of
some kind of massive scale selfplay
will you also experience the same kind
of Rapid increase in the capability of
the agent that you see that that we we
saw in our experience with DOTA and in
general because you can convert Compu
into Data so you put more compute this
thing gets better
yeah so I mean that's that's sort of a
general remark obviously you do you
compute more you get you get better
results but I didn't quite grasp the um
uh the difference between these two
panels well
so so it's really a question of
uh so let's say it really boils down to
this
it's a question of where do the the what
are the limits to progress in the fields
and in capabilities are do the limits
come
from like in other
words given the right algorithms which
currently don't yet
exist once you have them how will the
increase in the in the actual capability
of the system look like I think there is
definitely a possibility that it will be
like on the right side that once you
have you know you figure out your
hierarchical reinforcement learning you
figured out concept
learning you got your supervis learning
is in good
shape and then the massive neural net
Hardware arrives and you have a huge
neural net much bigger than the human
brain this will happen like how how how
will the plot look like over
time so you're you're projecting that
we've only seen the very beginning okay
so let's uh throw it up to questions and
I see you already have your hand
up thank you for that um you mentioned
hierarchy and I'm wondering if you have
an example of a hierarchical selfplay
that would uh you know increase the
slope of this curve yeah so we don't
have H we have not tried heral selfplay
this is more a statement from our
experience with our DOTA bot where you
start at basically losing to everyone
and then your true skill metric which is
like Cano rating just increase pretty
much linearly all the way to the best
humans so that's and I think this is a
gen it seems like it could be a general
property of self-play
systems which game was this DOTA DOTA
yeah okay more
questions hey IIA hey very nice talk
thank you I had a question on
environments do you have any thoughts on
going Beyond like sumo wrestling
environments like what what are good
environments to to
study well these are the question of
what
makes a good
environment
so I think there are two ways of getting
good
environments one of them is from trying
to solve problems that we care about and
they naturally generate environments
I think another one is to think of
open-ended environments where you can
build lot so one of the one of the
slightly
unsatisfying features of most of the
environments that we have today is that
there are a little bit not open-ended
you got a very kind of narrow domain and
you want to perform a task in this
narrow domain but one but some
environments which are very interesting
to think about are one where there is no
limit to the depth of these environments
and some of these examp examples include
programming math even Minecraft in
Minecraft you could build structures of
greater and greater complexity and you
know at first people build little homes
in Minecraft then they build big castles
and now people you can find people who
are building entire cities and even
computers inside Minecraft now obviously
Minecraft has an obvious challenge which
is problem which is what do we want the
agents to do there so it needs to be
addressed but kind of uh directionally
these would be nice environments to
think about
more
okay someone up
here uh this is this is sort of similar
to that last
question but I was wondering uh what the
effect if you know of complicated uh
non-agent objects and non-agent entities
in the environment is on how well
self-play works for instance in the Sumo
environment the reason that the
self-play agents can become very complex
in use very complex strategies is
because that's necessary in order to
compete against this other agent which
is also using very complex strategies If
instead you were uh working maybe not
against another agent but against a very
simple agent that doesn't train but
through some very complicated system you
had to operate a lot of machines in this
environment or something like that how
how does that affect the the
effectiveness of this yeah I mean I
think I think it depends a little bit on
the specifics like for sure that you
know if you have a complicated
environment or complicated problem was
produced
somehow then you will also need to
develop a pretty competent agent I think
the thing that's interesting about the
self-play approach is that you generate
the challenge yourself so the question
of where does The Challenge come from is
answered for you there's a mic problem
oh there's a mic problem might be a mic
problem uh oh I know it doesn't seem to
be muted let me check okay anyway let's
let's continue any more
questions okay uh so uh oh Bo we have
quite a
few um going back a bit to the hindsight
experience policy you talk about you
give the example of you know you trying
to reach the red spot a and you instead
reach some spot be and you're going to
use that to train I guess I was
wondering if you could elaborate on that
a little bit more I mean I'm not very
familiar with ddpg so perhaps that's
critical understanding this but I guess
what I'm wondering is how do you turn
every experience into you know hitting
the ball this way translates into this
motion without doing it in a reward
based way yeah so basically you just say
you you you have a policy which is
parameterized by a gold
state so then you say in effect you have
a family of policies one for for every
possible
goal and then you say okay I'm going to
run a pol I'm going to run the policy
that tries to reach State a and it
reached State b instead so I'm going to
say well this is great training data for
the policy which reaches State B so
that's how you do it in effect like if
you want more details we could talk
about it
offline
uh okay so um two question
one is a very simple question about
about HR again so if a task is difficult
for example you know hitting a fast ball
in baseball right so even the best
humans can do it know 38% of the time or
something like that right um so the
danger is that if you miss you're going
to say oh I was trying to miss so so now
I take this as a training example of how
to miss all right which is not right you
actually doing the optimal action right
but you're per appar just can't track
the ball fast enough so that's the best
you can do so it seems like you would
you would run into trouble on tasks like
that I mean okay should should should I
answer the first question before you ask
the second let's do that so the method
is still not absolutely perfect but on
the question of what happens when you
miss when you're trying to actually
succeed then yeah we have a lot of data
on how to not reach the
state like so you're trying to reach a
certain desired state which is hard to
reach you try to do that you reach a
different state so you say okay well I'm
going to I will train my
system to reach this
state but next time I'm going to say I
still want to it what it means is that
for that specific problem the approach
of this approach will be less beneficial
than for approach when approach where
the tasks are a little bit more
continuous where you can have a more of
a heel climbing effect you gradually
like let's say in a programming in the
in the setting context of programming
you learn to program simple programs you
learn to write different sub routines
and you gradually increase your
competence the set of States you know
how to reach so I agree that when there
is a very narrow state which is very
hard to reach then it will not help but
whenever there is a kind of a continuity
to the states then this approach will
help okay so the SEC the second question
is about self-play so when when I saw
your title what I thought you were going
to say was was the so if if you think
about
um alphago right if we tried to train
alphago by playing it against the
existing world
champion since it would never win a
single game for the first 50 million
games right it would learn nothing at
all yeah um but because we play it
against itself it always has a 50%
chance of winning so you're always going
to get a gradient signal no matter how
poorly you play yeah that's very
important now the so the the question is
you know is there some magic trick there
that you can then apply to tasks that
are intrinsically difficult to get uh to
get any reward signal on right so if you
take spider solitire for example if you
watch an ordinary human play spider
solitire they lose the first 100 games
and then they give up they say this is
impossible you know I I hate this game
right there's no reward signal there
because you're just not good enough to
ever
win um and so is there a way you can
convert spider solitire into a
two-player game and somehow guarantee
that you always get a a gradient signal
for that game so that's a very good
question that's a very good what what
you said is a very good point I just
want to before before I um elaborate on
your question I just want to
also talk about the fact that one of the
key things of self plays that you always
have an ni will evenly match the point
and what it means that you also have
potentially an indefinite incentive for
improvement
like even if you are really really
competent if you have a super competent
agent the opponent will be just as
competent and so if Done Right the
system will be incentivized to improve
and
so I think yeah I I think it's it's an
important thing to emphasize and that's
also by the way why the exploration
problem is much easier because you
explore the strategy space together with
your opponent and it's actually
important not to have just one opponent
but to have a whole little family of
them for
stability but that's that that's that's
basically crucial now on your second
question of what to do when you just
can't get the reward so very often if
the problem is hard enough I think there
isn't much you can do without having
some kind of deep domain you know side
information about the task but one
approach that is popular and it's been
pursued by multiple uh groups is to use
like asymmetric selfplay for exploration
you've got a predictor which Tri to
predict what's going going to
happen and you've got a policy which
tries to take action which surprise the
predictor so the predictor is going to
say okay well if you're going to I I
basically have opinions about what will
be the consequences of the different
actions and the actor tries to find
regions of space which surprise the
predictor so you have this kind of a
self plates not exactly self plates more
of a kind of a competitive adversarial
scenario where the agent is incentivized
to
cover the entire space it doesn't answer
the question of how to solve a hard task
like so like spider soliter because if
if you actually need to be super good I
think I think that's tough but at least
you can see how this can give you a
general guide of how to move forward in
general I think we had a question back
here some
question what what do you think is
exciting in terms of new architectures
such as you know they've been
building they've been adding like memory
structures to neural Nets like the DNC
paper yeah so what do you see the role
of new architectures playing in
actually uh achieving what we want for
generalization metal learning yeah so I
think I think this is a very good
question a question of
architectures and I'd say that it's very
rare to find a really a genuinely good
new architecture and and through through
genuine innovation in architecture space
is
uncommon I'd say the biggest innovation
in architecture space over the past many
years has been soft
attention so soft attention is
legitimately a major advance in
architectures but it's also very hard to
innovate in architecture
space because the basic architectures is
so good I think that better
generalization will be achieved not and
this is my opinion it's not backed by
data yet I think that better
generalization will not be achieved by
means of just improving the
architecture but by means of changing
the learning algorithm and possibly even
the Paradigm of the way we think about
our models I think things like uh
minimum description length and
compression will be a lot more
popular but it's not I think these are
non obvious questions but basically I
think architecture is important whenever
you can actually in good new good new
architectures for the heart problems uh
how about curriculum learning to learn
to hit a fast ball start with a slow
ball yeah for sure curriculum learning
is a very important idea it's how human
learn it's it's how humans
learn and it's very guess a pleasant
surprise that our neural networks also
benefit from
curriculums one nice thing about
self-play is that the curriculum is
built in it's like
intrinsic what you lose in self play is
the ability to direct the selfplay to a
specifi
point so I I have a question uh you
showed us the nice videos the uh
wrestlers and uh and the robots and so
forth and uh I I assume it's similar to
deep learning in the sense that there's
a framework of linear algebra underlying
the whole thing so is there anything
there other than linear algebra I mean
of neural net I mean so it's not it's
it's even it's you just take two agents
and you apply reinforcement learning
algorithms and a reinforcement learning
algorithm is a neural net with a
slightly different way of dating the
parameters so it's all it's all matrix
multiplication all the way down yeah
just want to multiply big matrices as
fast as possible
right okay oh we have one
more so you mentioned something about uh
transfer learning uh and the importance
of that um what do you think think about
concept extraction and uh transferring
that and if that's something that you
think is possible or people are doing
right now so I think it really depends
on what you mean by con concept
extraction exactly I think it's
definitely the case that our transfer
learning abilities are still
rudimentary and we don't yet have
methods that can extract like seriously
high level Concepts from one domain and
then apply it in another domain I think
there are ideas on how to do to approach
that but nothing that's really
convincing on a task that matters not
yet well we really had a lot of uh
questions and the reason is that you
gave very short succinct answers for
which we are very grateful thank you
very much uh let's uh give a great hand
thank you
Browse More Related Video
Ilya Sutskever: OpenAI Meta-Learning and Self-Play | MIT Artificial General Intelligence (AGI)
Lecture 1.1 — Why do we need machine learning — [ Deep Learning | Geoffrey Hinton | UofT ]
Highlights of the Fireside Chat with Ilya Sutskever & Jensen Huang: AI Today & Vision of the Future
Possible End of Humanity from AI? Geoffrey Hinton at MIT Technology Review's EmTech Digital
Stanford CS25: V2 I Introduction to Transformers w/ Andrej Karpathy
7. Layered Knowledge Representations
5.0 / 5 (0 votes)