Ilya Sutskever: OpenAI Meta-Learning and Self-Play | MIT Artificial General Intelligence (AGI)
Summary
TLDRIlya Sutskever在演讲中探讨了深度学习和人工智能的最新进展,包括元学习、强化学习和自我对弈。他强调了深度学习的重要性,并解释了其工作原理,特别是在找到最佳神经网络方面。他还讨论了强化学习中的探索问题,以及如何从失败中学习。Sutskever提出了通过自我对弈来提高智能体能力的观点,并展望了未来人工智能可能的发展方向,包括社会化智能体的创建和目标传达的重要性。
Takeaways
- 🤖 AI领域的深度学习和元学习正在取得显著进展,但仍然存在许多挑战。
- 🧠 深度学习的成功部分基于找到数据中规律性的最短时间程序,尽管这在计算上是不可行的。
- 🔄 反向传播是深度学习中的关键算法,尽管它与大脑的工作原理不同。
- 🧠 神经网络通过迭代地对基础神经网络进行小的改变来满足数据约束。
- 🤔 强化学习是一个评估智能体在复杂随机环境中实现目标能力的框架。
- 🎯 强化学习算法的目标是最大化预期奖励,但实际应用中可能需要考虑奖励的方差。
- 🤖 元学习(meta learning)是一个有前景的领域,尽管目前还不完全成熟。
- 📈 通过模拟和元学习,AI可以在模拟环境中学习并将其知识迁移到物理机器人上。
- 🔄 自我对弈(self-play)是一种新兴的AI研究方法,可以促进智能体的快速发展。
- 🌐 语言理解和生成模型是AI领域的关键挑战之一,目前还有很大的提升空间。
- 🚀 未来AI的发展将对社会产生深远影响,确保AI的目标与人类价值观一致是一个重要的政治问题。
Q & A
Ilya Sutskever 在人工智能领域有哪些重要贡献?
-Ilya Sutskever 是 OpenAI 的联合创始人和研究总监,他在深度学习和人工智能领域有着重要影响。他的工作在过去五年中被引用超过四万六千次,他是一些深度学习和人工智能领域重大突破思想的关键创意和推动力量。
深度学习为何能够工作?
-深度学习之所以能够工作,是因为它基于一个数学理论,即如果你能找到一个在数据上表现非常好的最短程序,那么你就能实现最佳的泛化。这意味着如果你能从数据中提取出所有规律性并将其编码到程序中,那么你就能做出最好的预测。尽管理论上存在这样的程序,但目前的工具和理解水平还无法找到这样的最佳短程序,因为这个问题在计算上是不可行的。
什么是元学习,它的潜力和挑战是什么?
-元学习是指训练算法学习如何学习的过程。它的潜力在于能够创建能够快速适应新任务的系统,这是非常具有前景的。然而,元学习面临的挑战是训练和测试分布必须相同,而在现实世界中,新的测试任务往往与训练数据有所不同,因此元学习在这种情况下可能会遇到困难。
强化学习是如何工作的?
-强化学习是一个框架,用于评估智能体在复杂随机环境中实现目标的能力。智能体通过与环境交互,尝试新的行为,并根据结果调整其策略。如果结果超出预期,智能体将在未来采取更多这样的行动。
自我对弈在人工智能中有什么作用?(self-play)?
-自我对弈是一种训练人工智能的方法,通过让AI系统在没有外部数据的情况下自我竞争和学习,从而提高其性能。这种方法在围棋游戏的AlphaGo Zero和OpenAI的Dota 2机器人中都取得了显著的成功。
如何将模拟中训练的策略应用到物理机器人上?
-为了将模拟中训练的策略应用到物理机器人上,可以通过在模拟器中引入大量变化来使策略具有适应性。然后在物理环境中部署策略时,策略需要通过尝试和错误来适应新环境的物理特性。
如何理解深度神经网络的训练过程?
-深度神经网络的训练过程可以看作是解决电路搜索问题的过程。通过迭代地对神经网络的基础进行小的调整,直到其预测满足数据的要求。这个过程非常深刻,因为它是通过梯度下降将信息从方程推入参数中,从而满足所有方程。
如何理解强化学习中的探索问题?
-在强化学习中,探索是指智能体在不知道如何行动时尝试新的行为。探索的重要性在于,只有通过尝试和偶尔获得奖励,智能体才能学习。因此,设计奖励函数以提供逐步的奖励增量是至关重要的,这样即使系统表现不佳,它也能获得奖励并从中学习。
如何通过观察其他智能体来推断它们的目标和策略?
-通过观察其他智能体的行为,我们可以推断它们的目标和策略。这是人类与其它动物相比在规模和范围上非常不同的一个方面。在非竞争环境中,观察和模仿他人的行为可以是一种有效的学习策略。
如何确保人工智能系统的目标与人类的期望一致?
-确保人工智能系统的目标与人类的期望一致是一个技术问题,但也是一个重大的政治问题。这需要在技术层面上开发出能够理解和执行人类指定目标的算法,同时在更广泛的社会层面上,需要解决如何确定正确的目标,以及如何确保系统按照这些目标行动的问题。
Outlines
🤖 人工智能与深度学习的突破
本段落介绍了Ilya Sutskever,他是OpenAI的联合创始人和研究总监。他的工作在深度学习和人工智能领域产生了巨大影响,过去五年的论文被引用超过四万六千次。他被认为是深度学习领域一些重大突破背后的主要创意和推动力。
🧠 深度学习的原理与工作方式
这部分讨论了深度学习的原理,解释了为什么深度学习能够工作。提到了一个数学理论,即如果能找到一个最短的程序来处理数据,那么就能达到最佳的泛化效果。然而,找到这样的程序在计算上是不可行的。尽管如此,通过使用反向传播和小型电路,我们能够找到解决问题的方法。
🔄 强化学习与智能体的目标达成
这一部分探讨了强化学习,它是一个评估智能体在复杂随机环境中实现目标能力的框架。强化学习算法通过多次运行智能体并计算其平均奖励来工作。这部分还讨论了强化学习的一个关键问题:如何在没有外部奖励的情况下,从观察中确定奖励。
🧠 元学习的概念与应用
元学习的目标是开发能够学习如何学习的算法。这种方法涉及在多个任务上训练系统,使其能够快速解决这些任务。元学习的一个成功案例是通过MIT的研究,快速识别字符。另一个例子是Google的神经架构搜索,通过解决小问题来找到能够解决大问题的神经网络架构。
🤹♂️ 自我对弈与智能体的进化
自我对弈是一种让智能体通过相互竞争来提高自身能力的方法。这种方法的一个经典例子是TD Gammon,它通过自我对弈学习并最终击败了世界背棋冠军。自我对弈的一个关键优势是它能够创建一个不断变化的环境,使智能体不断面临新的挑战。
🎮 强化学习在游戏中的应用
这部分讨论了强化学习在游戏领域的应用,特别是OpenAI的Dota 2机器人。通过自我对弈,这些机器人在短短几个月内从随机游戏玩法进步到世界锦标赛水平。这种快速的进步表明,自我对弈是一种强大的学习机制。
🤖 人工智能的目标传达与对齐问题
这部分探讨了如何向智能体传达目标,以及如何确保智能体的行为符合我们的期望。提出了一种通过人类反馈来训练智能体的方法,即通过观察人类选择更好的行为来训练智能体。这种方法需要解决的技术问题和政治问题都非常复杂。
🌟 人工智能的未来展望
最后一部分对人工智能的未来进行了展望,特别是关于如何让智能体在软件环境中学习并将其应用于现实世界的任务。讨论了自我对弈环境的潜力,以及如何通过持续训练和适应新环境来提高智能体的能力。
Mindmap
Keywords
💡深度学习
💡元学习
💡强化学习
💡自我对弈
💡神经网络
💡反向传播
💡策略梯度
💡Q学习
💡目标设定
💡合作
Highlights
Ilya Sutskever, co-founder and research director of OpenAI, discusses the impact of deep learning and AI.
Sutskever's work has been cited over 46,000 times, showcasing his influence in the field.
The concept of finding the shortest program to achieve the best generalization in machine learning is introduced.
The computational intractability of finding the best short program is discussed, highlighting the limitations of current AI tools.
The discovery that small circuits can be optimized using backpropagation is highlighted as a foundational AI principle.
Neural networks are likened to parallel computers capable of complex computations and reasoning.
The potential of reinforcement learning to achieve goals in complex environments is explored.
Meta-learning, or learning to learn, is introduced as a promising but not fully realized concept.
The importance of representation learning and unsupervised learning for identifying high-level states in meta-learning is emphasized.
Self-play is presented as a powerful method for training AI agents, leading to rapid increases in competence.
The potential societal implications and alignment issues of superintelligent AI are discussed.
The use of human feedback for training AI, such as through reinforcement learning, is highlighted.
The limitations of meta-learning, particularly the requirement for training and test distributions to match, are noted.
The potential for AI to develop social skills, language, and other human-like traits through multi-agent interaction is speculated.
The importance of continuous learning and adapting to new environments is emphasized for AI agents.
The challenges of conveying goals to AI agents and aligning their objectives with human values are discussed.
The potential for AI to develop strategies and behaviors through self-organization and interaction with other agents is explored.
The role of complexity theory in understanding the problems that AI can solve and the limitations thereof is examined.
The future of generative language models and the importance of scaling up current models is discussed.
The potential use of evolutionary strategies in reinforcement learning for small, compact objects is mentioned.
The necessity of accurate physical world modeling and simulation for training AI agents is questioned.
Transcripts
welcome back to 6 SZ row 99 artificial
general intelligence today we have Ilya
sutskever co-founder and research
director of open AI he started in the
amel group in Toronto Geoffrey Hinton
then at Stanford with an jiaying
co-founded DNN research for three years
as a research scientist at Google brain
and finally co-founded open AI citations
aren't everything
but they do indicate impact and his work
recent work in the past five years has
been cited over forty six thousand times
he has been the key creative intellect
and driver behind some of the biggest
breakthrough ideas in deep learning and
artificial intelligence ever
so please welcome Ilya alright thanks
for the introduction Lex
alright thanks for coming to my talk I
will tell you about some work we've done
over the past year on on meta learning
and software open AI and before I dive
into some of the more technical details
of the work I want to spend a little bit
of time talking about deep learning and
why it works at all in the first place
which I think it's actually not a
self-evident saying that they should
work one fact it's actually a fact it's
a mathematical theory that you can prove
is that if you could find the shortest
program the does very very well on your
data then you will achieve the best
generalization possible with a little
bit of modification you can turn it into
a precise theorem
and on a very intuitive level it's easy
to see what it should be the case if you
have some data and you're able to find a
shorter program which generates this
data then you've essentially extracted
all the all conceivable regularity from
this data into your program and then you
can use these objects to make the best
predictions possible like if if you have
data which is so complex but there is no
way to express it as a shorter program
then it means that your data is totally
random there is no way to extract any
regularity from it whatsoever now there
is little known mathematical theory
behind this and the proofs of these
statements actually not even that hard
but the one minor slight disappointment
is that it's actually not possible at
least given today's tools and
understanding to find the best short
program that explains or generates or
solves your problem given your data this
problem is computationally intractable
the space of all programs is a very
nasty space small changes to your
program result in massive changes in the
behavior of the program as it should be
it makes sense you have a loop you
change the inside of the loop of course
you get something totally different so
the space of programs is so hard at
least given what we know today search
there seems to be completely off the
table well if we give up on shorts on
short programs what about small circuits
well it turns out that we are lucky it
turns out that when it comes to small
circuits you can just find the best
small circuits circuits that solves the
problem using back propagation and this
is the miraculous fact on which the rest
of AI stands it is the fact but then you
have a circuit and you impose
constraints on your circuits on your
circuit using data you can find the way
to satisfy these constraints these
constraints using that problem by
iteratively making small changes
to the base of your neural network until
its predictions satisfy the data what
this means is that the computational
problem that so the back propagation is
extremely profound it is circuit search
now we know that you can solve it solve
it always but you can solve it sometimes
and you can solve it at those times
where we have a practical data set it is
easy to design artificial data sets for
which you cannot find the best neural
network but in practice that seems to be
not a problem you can think of training
a neural network as solving a neural
equation in many cases where you have a
large number of equation terms like this
f of X I theta equals y I so you got
your parameters and they represent all
your degrees of freedom and you use
gradient descent to push the information
from these equations into the parameters
satisfy them all and you can see that
the neural network let's say one with 50
layers is basically a parallel computer
that is given 50 time steps to run and
you can do quite a lot with a 15 with 50
time steps of a very very powerful
massively parallel computer so for
example I do I think it is not widely
known that you can learn to sort sort n
n bit numbers using a modestly sized
neural network with just two hidden
layers which is not bad it's not
self-evident especially since we've been
taught that sorting requires log n
parallel steps with the neural network
you can sort successful using only two
parallel steps so there's some things
like an arm is going on now these are
parallel steps of threshold threshold
neurons so they're doing a little bit
more work let's answer to the mystery
but if you've got 50 such layers you can
do quite a bit of logic quite a bit of
reasoning all inside the neural network
and that's why it works
given the data we are able to find the
best neural network and because the
neural network is deep because it can
run computation inside of its act inside
of its layers the best neural network is
worth finding because that's really what
you need you need something you need the
model class which is worth optimizing
but it also needs to be optimizable and
deep neural networks satisfy both of
these constraints and this is why
everything works this is the basis on
which everything else resides now I want
to talk a little bit about reinforcement
learning so reinforcement learning is a
framework it's a framework of evaluating
agents in their ability to achieve goals
and complicated stochastic environments
you've got an agent which is plugged
into an environment as shown in the
figure right here and for any given
agent you can simply run it many times
and compute its average reward now the
thing that's interesting about the
reinforcement learning framework is that
there exist interesting useful
reinforcement learning algorithms the
framework existed for a long time it
became interesting once we realized that
good algorithms exist now these are
there are perfect algorithms but they
are good enough to do interesting things
and all you want the mathematical
problem is one where you need to
maximize the expected reward now one
important way in which the reinforcement
learning framework is not quite complete
is that it assumes that the reward is
given by the environment you see this
picture the agent sends an action while
the reward sends it an observation in a
both the observation and the reward
backwards that's what the environment
communicates back
the way in which this is not the case in
the real world is that we figure out
what the reward is from the observation
we reward ourselves we are not told
environment doesn't say hey here's some
negative reward it's our interpretation
over census that lets us determine what
the reward is and there is only one real
true reward in life and this is
existence or nonexistence and everything
else is a corollary of that so well what
should our agent be you already know the
answer should be a neural network
because whenever you want to do
something dense it's going to be a
neural network and you want the agent to
map observations to actions so you let
it be parametrized with a neural net and
you apply learning algorithm so I want
to explain to you how reinforcement
learning works this is model free
reinforcement learning the reinforcement
learning has actually been used in
practice everywhere but it's also deeply
it's very robust it's very simple it's
also not very efficient so the way it
works is the following this is literally
the one sentence description of what
happens in short try something new add
randomness directions and compare the
result to your expectation if the result
surprises you if you find that the
results exceeded your expectation then
change your parameters to take those
actions in the future that's it this is
the fool idea of reinforcement learning
try it out see if you like it and if you
do do more of that in the future and
that's it that's literally it this is
the core idea now it turns out it's not
difficult to formalize mathematically
but this is really what's going on
if in a neural network in a regular
neural network like this you might say
okay what's the goal
you run the neural network you get an
answer you compare it to the desired
answer and whatever difference you have
between those two you send it back
to change the neural network that's
supervised line in reinforcement
learning you run in your own network you
add a bit of randomness to your action
and then if you like the result your
randomness turns into the desired target
in effect so that's it trivial now math
exists without explaining what these
equations mean the point is not really
to derive them but just to show that
they exist there are two classes of
reinforcement learning algorithms one of
them is the policy gradient where
basically what you do is that you take
this expression right there the sum of
expected we work the sum of rewards and
it just crunched through the derivatives
you expand the terms iran you do some
algebra and you get a derivative and
miraculously the derivative has exactly
the form that i told you which is try
some actions and if you like them
increase the log probability of the
actions that we truly follows from the
math it's very nice when the intuitive
explanation has a one-to-one
correspondence to what you get in the
equation even though you have to take my
word for it if you are not familiar with
it that's that equation at the top now
there is a different class of
reinforcement learning algorithms which
is a little bit more difficult to
explain it's called the Q learning based
algorithms they are a bit less stable a
bit more sample efficient and it has the
property that it can learn not only from
the data generated by the actor but from
any other data as well so it has it has
some rope but it has different
robustness profile which would be a
little bit important but it's only going
to be a technicality so yeah this is the
own policy of policy distinction but
it's a little bit technical so if you
find this hard to understand don't worry
about it if you already know this then
you already know it so now what's the
potential for enforcement learning
wasn't it promised what is it actually
why should we be excited about it
now there are two reasons the
reinforcement learning algorithms of
today already useful and interesting and
especially if you have a really good
simulation of your world you could train
agents to do lots of interesting things
but what's really exciting is if you can
build a super amazing sample efficient
out of reinforcement learning algorithm
we just give it a tiny amount of data
and the algorithm just crunches through
it and extracts every bit of entropy out
of it in order to learn in the fastest
way possible now today our algorithms
are not particularly efficient they are
data inefficient but as our field keeps
making progress this will change next I
want to dive into the topic of meta
learning the goal of meta learning so
meta learning is a beautiful idea that
doesn't really work but it kind of works
and it's really promising too it's
another promising idea so what's the
dream we have some learning algorithms
perhaps you could use those learning
algorithms in order to learn to learn
I'd be nice if we could learn to learn
so how would you do that you will take a
system which you train it not on one
task but on many tasks and you ask you
that it learns to solve these tasks
quickly and that may actually be enough
so here's how it looks like here's how
most traditional metal earning look
works like it looks like you have a
model which is a big neural network what
what you do is that you treat every
instead of training cases you have
training tasks and instead of test cases
you have test tasks so your input may be
instead of just your current test case
it would be all the information about
the new T above the test tasks plus the
test case and you'll try to output the
prediction reaction for that test case
so basically you say yeah I'm going to
give you your ten examples as part of
your input to your model figure out how
to make the best use of them it's a
really
straightforward idea u-turn the neural
network into the learning algorithm by
turning a training task into a training
case so training to ask a constraining
case this is meta learning just one
sentence and so they've been several
success stories which I I think are very
interesting one of the success stories
of meta learning is learning to
recognize characters quickly so they've
been a dataset
produced by MIT by lake corral and this
is a data set we have a large number of
different handwritten characters and
people have been able to train extremely
strong meta learning system for this
desk another successful another very
successful example of meta learning is
in that of neural architecture search by
is openly from google where they found
neural architecture that solved one
problem well small problem and then you
could generalize and then if you
successfully solve large problems as
well so this is the kind of the the
small number of bits meta learning is
that when you learn the architecture or
maybe even learn a program small program
or learning algorithm which you apply to
new tasks so this is the other way of
doing meta learning so anyway but the
point is what's happening what's really
happening in meta learning in most cases
is that you turn a training task into a
training case and pretend this is
totally normal
normal deep learning that's it this is
the entirety of meta learning everything
else suggests minor details next I wanna
dive in so now that I've finished the
introduction section I want to start
discussing different work by different
people from opening I and I want to
start by talking about hindsight
experience replay it's been a large
effort by and recurvature all to develop
a learning algorithm for reinforcement
learning
that doesn't solve just one task but it
solves many tasks and it learns to make
use of its experience in a much more
efficient way and I want to discuss one
problem in reinforcement learning it's
actually I guess a set of problems which
all related to each other at one really
important thing you need to learn to do
is to explore you're in that you start
out in an environment you don't know
what to do what do you do so one very
important thing that has to happen is
that you must get rewards from time to
time if you try something and you don't
get rewards then how can you learn so
said that's the kind of the crux of the
problem how do you learn and relatedly
is there any way to meaningfully benefit
from your ex from the experience from
your attempts to from from your failures
if you try to achieve a goal and you
fail can you still learn from it you
tell you instead of asking your
algorithm to achieve a single goal you
want to learn a policy that can achieve
a very large family of goals for example
instead of reaching one state you want
to learn a policy that reaches every
state of your system and what's the
implication anytime you do something you
achieve some state so let's suppose you
say I want to achieve state a I try my
best and I end up achieving state B I
can either conclude well that was
disappointing I haven't learned almost
anything I'm still have no idea how to
cheat how to achieve state aid but
alternatively I can say well wait a
second I've just reached a perfectly
good state which is B can I learn how to
achieve state B from my attempt to
achieve state a an answer is yes you can
and it just works and I just want to
point out this is the one case there is
a small subtlety here which may be
interesting to those of you who are very
familiar with on Part B the distinction
between on policy and off policy when
you try to achieve a you are on you're
doing on policy learning for
reaching the state a but you're doing
off policy learning for it in the state
be because you would take different
actions if you would actually try to
reach they'd be so that's why it's very
important that the algorithm you use
here can support of policy learning but
that's a minor technicality at the crux
the crux of the idea is you make the
problem easier by ostensibly making it
harder by training a system which can
which aspires to reach to learn to reach
every state to learn to achieve every
goal to learn to master its environment
in general you build a system which
always learn something it learns from
success as well as from failure because
if it tries to do one thing one thing
and it does something else
it now has training data for how to
achieve that something else I want to
show you a video of how this thing works
in practice so one challenge in
reinforcement learning systems is the
need to shape the reward so what does it
mean
it means that at the beginning of the
system at the start of learning then the
system doesn't know much it will
probably not achieve your goal and so
it's important that you design your
reward function to give it gradual
increments to make it smooth and
continuous so that even when the system
is not very good it achieves the goal
now if you give your state your system a
very sparse reward where the reward is
achieved only when you reach a final
state then it becomes very hard for
normal reinforcement learning algorithms
to solve a problem because naturally you
never get the reward so you never learn
no reward means no learning but here
because you learn from failure as well
as from success
this is this problem simply doesn't
occur and so this is this is nice I
think you know let's let's look at the
videos a little bit more like it's nice
how this is it confidently and
energetically moves the little green
buck to its target and here's another
one
you
okay so we can skip the it works on
spawn on the face if you do it on
physical robot as well but we can skip
it so I think the point is that the
hindsight experience replay algorithm is
directionally correct because you want
to make use of all your data and not
only a small fraction of it now one huge
question is where do you get the high
level states where do the high level
states come from because in the work of
showing you so far
the system is asked to achieve low level
States so I think one thing it will
become very important for this kind
approaches is representation learning
and unsupervised learning figure out
what are the rights what are the right
states what's the state space of goals
that's worth achieving now I want to go
through some real meta learning results
and I'll show you a very simple way of
doing seem to reel from simulation to
the physical robot with meta learning
and this is where my pain growl was an a
and encouraged a really nice intern
project in 2017 so I think we can agree
that in the domain of robotics it would
be nice if you could train your policy
in simulation and then somehow this
knowledge would carry over to the
physical robot now we can build we can
build simulators that are okay but they
can never perfectly match the real world
unless you want to have an insanely slow
simulator and the reason for that is
that it turns out that simulating freaky
simulating contacts is super hard and I
heard somewhere correct me if I'm wrong
that simulating friction is np-complete
I'm not sure but it's like stuff like
that so your simulation is just not
going to match reality there will be
some resemblance but that's it
how can we
address this problem and I want to show
you one simple idea so let's say one
thing once one thing that would be nice
is that if you could learn a policy
learn a policy that would quickly adapt
itself to the real world well if you
want to learn a policy that can quickly
adapt we need to make sure it has
opportunities to adapt during training
time so what do we do instead of solving
a problem in just one simulator we add a
huge amount of variability to the
simulator we say we will randomize the
friction so we will randomize the masses
the length of the different objects and
their I guess M dimensions so you try to
randomize physics they simulate in lots
of different ways and then importantly
you don't tell the policy how you
randomized it so what is it going to do
then you take your policy and you put it
in an environment then says well this is
really really tough I don't know what
the masses are and I don't know what the
frictions are I need to try things out
and figure out what the friction is as I
get it responses from the environment so
you're building you you learn a certain
degree of adaptability into the policy
and it actually works
let's want to show you this is what
happens when you just strain a policy in
simulation and deploy it on the physical
robot and here the goal is to bring the
hockey puck towards the red dot and you
will see that it will struggle and the
reason it struggles is because of the
systematic differences between the
simulator and the real physical robot so
I can even the basic movement is
difficult for the policy because the
assumptions are violated so much so if
you do the training as I discussed we
train a recurrent neural network policy
which learns to quickly infer properties
of the simulator in order to accomplish
the task you can then give it the real
thing
the real physics and it will do much
better so now this is not a perfect
technique but it's definitely very
promising it's promising whenever you
are able to sufficiently randomize the
simulator so it's definitely very nice
to see the closed-loop nature of the
policy you consider it would push the
hockey puck and would correct it
very very gently to bring it to the goal
yeah so that that was cool so that was
very that was a cool application of meta
learning I want to discuss one more
application of meta learning which is
learning a hierarchy of actions and this
was work done by France at all actually
kept in France the ancient who did it
was in high school I mean he wrote this
paper so one thing that would be nice is
if reinforcement learning was
hierarchical if instead of simply taking
micro actions you've had some kind of
little subroutines that you could deploy
maybe the term subroutine is a little
bit too crude but if you had some idea
of which action primitives are worth
starting with now no one has been able
to to get actually like real value add
from curricula reinforcement learning
yet so far all the really cool results
all the really convincing is also
reinforcement learning do not use it
that's because we haven't quite figured
out what's the right way for
reinforcement learning for her ocular
reinforcement learning
I just want to show you one very simple
approach where you use meta-learning to
learn to learn a hierarchy of actions so
here's what you do you have in this
specific work you have a certain yeah
let's say you have a certain number of
low-level primitives let's say you have
two ten of them and you have a
distribution of tasks and your goal is
to learn low level primitives such that
when they are used inside a very brief
run of some reinforcement learning
algorithm you will make as much progress
as possible so the idea is you want to
get the greatest amount of progress you
want to learn policies that result in
the great story you want to learn
primitives that result in the greatest
amount of progress is possible when used
inside learning so this is a meta
learning setter because any distribution
of tasks and here we've had if we've had
a little maze here the distribution of a
mazes and in this case the little bug
learned three policies which move it in
its fixed direction and as a result of
having this hierarchy you're able to
solve problems really fast but only when
the hierarchy is correct
so horican reinforcement learning is
still working progress and this was an
and this work is an interesting proof
point of how Haruko reinforcement could
be like how heretical reinforcement
learning could be like if it worked now
I want to just spend one slide
addressing the limitations of high
capacity method learning the specific
limitation is that the training test
distribution has to be equal to the test
test distribution and I think this is a
real limitation because in reality you
the new test that you want to learn do
in some ways being fundamentally
different from anything you've seen so
far so for example if you go to school
you learn lots of useful things but then
they go to work only a fraction of this
of the things that you've learned
carries over you need to learn if you
need quite a few more things from
scratch so metal owning would struggle
with that because it really assumes that
the Train the training data is that the
distribution over the training task has
to be equal to the distribution of the
test tasks that's the limitation I think
that as we develop better algorithms for
being robust when the test tasks outside
of the distribution of the training
tasks the metal on would work much
better now I want to talk about self
play the links of play is a very cool
topic that's starting to get attention
only now and I want to start by
reviewing very old work called TD gammon
it's back from all the way from 1992 so
it's 26 years old now it was done by
Jerry to cero so this work is really
incredible because it has so much
relevance today what they did basically
they said okay let's take two neural
networks and let them let them play
against each other let them play
backgammon against each other and let
them in tray let them be trained
particularly so it's a super-modern
approach and you would think this was a
paper from 2017 except that then you
look at this plot it shows that you only
have ten hidden units twenty hidden
units forty and eighty for the different
M colors where you notice that the
largest neural network works best so in
some ways not much has changed and this
is the evidence
and in fact they were able to beat the
world champion in backgammon and they
were able to discover new strategies
that the best human a backgammon players
weren't ever not noticed and they've
determined that the strategy discovered
by TD gammon actually better
so that's pure self play with cue
learning which is which remained dormant
until the DQ and work with Atari mid
mind so now other examples of self play
include alphago zero which was able to
learn to beat the world champion and go
without using any external data
whatsoever another result of this vein
is by open AI which is our dota 2 BOTS
which was able to build the world
champion on the 1v1 version of the game
and so I want to spend a little bit of
time talking about the allure of self
play and why I think it's exciting so
one important problem that's a that
that's that we must face as we try to
build truly intelligent systems is what
is the task what are we actually
teaching the systems to do and one very
attractive attribute of self play is
that the agents create the environment
by virtue of the agent acting in the
environment the environment becomes
difficult for the other agents and you
can see here an example of an iguana
interacting with snakes that try to eat
it
unsuccessfully this time so we can see
what will happen in a moment the iguana
strains best and so the fact you have
this arms race between the snakes and
the iguana
motivates their development potentially
without bound and this is what happens
in effect in but in biological evolution
now interesting work in this direction
was done in 1994 but Carl says there is
a really cool video on YouTube by Carl
seems you should check it out which
really kind of shows all the work that
he's done and here you have a little
competition between agents where you
evolved both the behavior and their
morphology when you when the agents is
trying to gain possession of a green
cube and so you can see that the agents
create the challenge for each other and
that's why they need to develop so one
thing that we did and this is work by
advance a little from open ai is we said
okay well can we demonstrate some
unusual results in self play that would
really convince us that there is
something there so what we did here is
that we created a small a small ring and
you have these two humanoid figures and
their goal is just to push each other
outside the ring and they don't know
anything about wrestling they don't know
anything about standing your balance in
each other they don't know anything
about centers of gravity all they know
is that if you don't do a good job then
your competition is going to do a better
job now one of the really attractive
things about self play is that you
always have an opponent that's roughly
as good as you are in order to learn you
need to sometimes win and sometimes lose
but you can't always win sometimes you
must fail sometimes you must succeed so
let's see what will happen here yeah so
it was able to do so the green humanoid
was able to block the ball in a Cell in
a well balanced self play environment
petition is always level no matter how
good you are or how bad you are you have
a competition that makes it exact
exactly of exactly the right challenge
for you on one thing here so this video
shows transfer learning it takes a
little wrestling humanoid and you take
its friend away and you start applying a
big large random forces on it and you
see if it can maintain its balance and
the answer turns out to be but yes it
can because it's been trained against an
opponent it pushes it and so that's why
even if it doesn't understand where the
fresh force is being applied on it it's
still able to balance itself so this is
one potentially attractive feature of
subway environments that you could learn
a certain broad set of skills although
it's real hard to control the square the
skills will be and so the biggest open
question with this research is how do
you learn agents in a software
environment such that they do whatever
they do but then they are able to solve
a battery of tasks that is useful for us
that is explicitly specified externally
yeah I also want to want to highlight
one attribute of self play environments
that we've observed in our dota BOTS and
that is that we've seen a very rapid
increase in the competence of the bots
so over the period over the course of
maybe five months we've seen the bots go
from playing totally randomly all the
way to the world champion and the reason
for that is that once you have a self
play environment if we put compute into
it you turn it into data self play
allows you to turn compute into data and
I think you will see a lot more of that
as being an extremely important thing to
be able to turn compute into essentially
data generalization simply because the
speed of neural net processors will
increase very dramatically over the next
few years so neural net cycles will be
cheap and it will be important to make
use of this new of newly-found
overabundance of cycles
I also want to talk a little bit about
the endgame of the self approach so one
thing that we know about the human brain
is that it has increased in sized fairly
rapidly over the past two million years
my theory the reason I think it happened
is because our ancestors got to a point
where the thing that's most important
for your survival is your standing in
the tribe and less the tiger and the
lion once the most important thing is
how you deal with those other things
which have a large brain then it really
helps to have a slightly larger brain
and I think that's what happened and
there exists at least one paper from
science which supports this point of
view so apparently there has been
convergent evolution between social apps
and social Birds even though in terms of
various behaviors even though the
divergence in evolutionary timescale
between humans and birds has occurred a
very long time ago and humans and humans
apes and humans apes and birds have very
different brain structure so I think
what should happen if we succeed if we
successfully follow the path of this
approach is that you should create a
society of agents which will have
language and theory of mind negotiation
social skills trade economy politics
justice system all these things should
happen inside the multi-agent
environment and it will also be some
alignment issue of how do you make sure
that the agents we learn behave in a way
that we want now I want to make a
speculative digression here which is I
want to make the following observation
if you believe that this kind of society
of agents is a plausible place where
truly where the fuller fully general
intelligence will emerge and if you
accept that our experience with the dota
BOTS we've seen a very rapid increase in
competence will carry over once all the
details are right if you assume both of
these conditions then it should follow
that we should see a very rapid increase
in the competence of our agents as they
live in the Society of agents so now
that we've talked about a potentially
interesting way of increasing the
competence and teachings of an agent's
social skills and language and a lot of
things that actually exist in humans as
well we want to talk a little bit about
how you convey goals to agents and the
question of the main goal to eight calls
to agents is just a technical problem
but it will be important because it is a
lot more likely than not that the agents
of evil train will eventually be
dramatically smarter than us and this is
work by the opening eye safety team by
Paul Christiana at all and others so I'm
just going to show you this video which
basically explains how the whole thing
works you there is some behavior looking
for and you the human gets to see pairs
of behaviors and you simply click on the
one that looks better and after a very
modest number of clicks you can get this
little simulated leg to do back flips
and there
go picking out the back flips and in
this to get this specific behavior it
took about 500 clicks by human
annotators the way it works is that you
take all the so this is a very data
efficient reinforcement learning
algorithm but it is efficient in terms
of rewards and not in terms of the
environment interactions so what you do
here is that you take all the clicks so
you've got your here is one B here which
is better than other you fit a reward
function a numerical reward function to
those clicks so you want to fit a reward
function which satisfies those clicks
clicks and you optimize this reward
function with reinforcement learning and
it actually works so this requires 500
bits of information you've also been
able to train lots of Atari games using
several thousand bits of information so
in all these cases you had human and
human annotators or human judges just
like in the previous slide looking at
the pairs of trajectories and clicking
on the one that they thought was better
and here's an example of an unusual goal
where this is a car racing game but the
goal was to ask the the agent to train
the white car drive right behind the
orange car so it's a different goal and
it was very straightforward to
communicate this goal using this
approach so then to finish off alignment
is a technical problem it has to be
solved but of course the determination
of the correct goals we want array
assistance the systems to have will be a
very challenging political problem and
on this note I want to thank you so much
for your attention and I just want to
say that will be a happy hour at
Cambridge Brewing Company at 8:45 if you
want to chat more about AI and other
topics please come by I think that
deserves an applause
so back propagation is a or neural
networks of bio-inspired but back
propagation doesn't look as though it's
what's going on in the brain because
signals in the brain go one direction
down the axons whereas back propagation
requires the errors to be propagated
back up the the wires so can you just
talk a little bit about that whole
situation where it looks as the brain is
doing something a bit different than our
highly successful algorithms our
algorithm is going to be improved once
we figure out what the brain is doing or
is the brain really sending signals back
even though it's got no obvious way of
doing that what's what's happening in
that area so that's a great question so
first of all I'll say that the true
answer is that the honest answer is that
I don't know but I have opinions and so
I'll say two things
but first of all given that look if you
agree if we agree like so rather it is a
true fact the back propagation solves
the problem of circuit search this
problem feels like an extremely
fundamental problem and for this reason
I think that it's unlikely to go away
now you also write that the brain
doesn't obviously do back propagation
although they've been multiple proposals
of how it could be how it could be doing
them for example there's been a work by
Tim little crap and others where they've
shown that if you use that it's possible
to learn a different set of connections
but can be used for the backward pass
and that can result in successful
learning now the reason this hasn't been
like really pushed to the limit by
practitioners is because they say well I
got TF to the gradients I'm just not
going to worry about it but you are
right this is an important issue and you
know one of two things is going to
happen so my personal opinion is that
back propagation is just going to stay
with us till the very end and will
actually build fully human level and
beyond systems before we understand how
the brain does what it does so that's
what I believe but of course it is a
difference that has to be acknowledged
okay thank you do you think it was a
fair matchup for the dota bot and that
person given the constraints of the
system so I'd say that like the biggest
advantage computers have in games like
this like one of the big advantages is
that they obviously have a better
reaction time although in DotA in
particular the number of clicks per
second over the top players is fairly
small which is different from Starcraft
so in Starcraft stuff up is a very
compact mechanically heavy game because
of a large number of units and so the
top players that is click all the time
in DotA every player controls just one
hero and so that greatly reduces the
total number of actions they need to
make now still precision matters I think
that will discover that but what I think
it'll really happen is if you'll
discover that computers have the
advantage in any domain or rather every
domain not yet so do you think that the
emergent behaviors from the agent were
actually kind of directed because the
constraints already kinda in place like
so it was kind of forced discover those
or do you think that like that was
actually something quite novel that like
wow it actually discovered these on its
own like you didn't actually am biased
towards constraining it so it's
definitely discover new strategies and I
can share an anecdote where our tester
we have a probe which would test the
bots and he played against for a long
time and the bots would do all kinds of
things against the player the human
player which were effective then at some
point that Pro decided to play against
the better plot Pro and he decided to
imitate one of the things that the bot
was doing and this image but by
imitating if he was able to defeat a
better pro so I think I think the
strategy discovers are real and so like
it means that like this very real
transformative Tran you know I would say
I think what that means is that he
because the strategies discovered by the
bot of the humans it means that we like
a fundamental game plays deeply related
for a long time now I've heard that the
objective of reinforcement learning is
to determine a policy that chooses an
action to maximize the expected reward
which is what you said earlier would you
ever want to look at the standard
deviation of possible rewards does that
even make sense yeah I mean I think for
sure I think it's a really application
dependent one of the reasons to maximize
the expected reward it's because it's
easier to design algorithms for it
so you write down this equation the
formula you do a little bit of
derivation you get something which
amounts to a nice-looking algorithm now
I think there exist like really there
exist applications where you'd never
want to make mistakes and you want to
work on the standard deviation as well
but in practice it seems that the just
looking at the expected reward covers a
large fraction of the B the situation as
you'd like to apply this door Thanks
we talked last week about motivations
and that has a lot to do with the
reinforcement and some of the ideas is
that the our motivations are actually
connection with others and cooperation
and I'm wondering if they're thrown off
and I understand it's very popular to
have the computers play these
competitive games but is there any use
in like having an agent self play
collaboratively collaborative games Yeah
right that's an extremely good question
I don't think one place from which we
can get some inspiration is from the
evolution of cooperation
like I think cooperation like we
cooperate ultimately because it's much
better for you the person to be
cooperative than not and so I think what
should happen
if you have a sufficiently open-ended
game then cooperation will be the
winning strategy and so I think we will
get cooperation whether we like it or
not Hey
you mentioned the complexity of this
simulation of friction I was wondering
if you feel that there exists open
complexity theoretic problems relevant
to relevant to AI or whether it's just a
matter of finding good approximations
that humans of the types of problems
that humans tend to solve yeah so
complexity theory well like at a very
basic level we know that whatever
algorithm we gonna run is going to run
fairly efficiently on some hardware so
that puts a pretty strict upper bound
and the true complexity of the problems
we're solving but by definition we are
solving problems which aren't too hard
in a complexity theoretic sense now it
is also the case that many of the
problems so while the overall thing that
we do is not hard from a complexity
theory makes sense and indeed humans
cannot solve np-complete problems in
general it is true that many of the like
optimization problems that we pose to
our algorithms are intractable in the
general case starting from a neural net
optimization itself it is easy to create
a family of data sets for a neural
network with a very small number of
neurons such that find a global optimum
is np-complete and so how do we avoid it
well we just try gradient descent anyway
and somehow it works but without
question like we cannot we do not solve
problems which are truly intractable so
I mean I hope this answer the question
hello
it seems like an important sub-problem
on the path towards AGI will be
understanding language and the state of
generative language modeling right now
is pretty abysmal what do you think are
the most productive research
trajectories towards generative language
models so
I'll first say that you are completely
correct that the situation with language
is still far from great although
progress has been made even without any
particular innovations beyond models
that exist today simply scaling up
models that exist today on larger
datasets is going to go surprisingly far
not even large datasets but larger and
deeper models for example if you trained
a language model be the thousand layers
and it's the same layer I think it's
gonna be a pretty amazing language model
like we don't have the cycles for it yet
but to think it will change very soon
now I also agree with you that there are
some fundamental things missing in a
current understanding of deep learning
which prevent us from really solving the
problem that we want so I think one of
these problems one of the things that's
missing is that or that seems like
patently wrong is the fact that we train
a model then you stop training the model
and you freeze it even though it's the
training process where the magic really
happens but the magic is that if you
think about it like the training process
is the true general part of the whole of
the whole of the whole story because you
tends to flow code doesn't care which
data set to optimize it just says
whatever just give me the data set I
don't care which one solve I'll sew them
all
so like the ability to do that feels
really special and I think we are not
using it at test time like it's hard to
speculate about like things which you
don't know the answer but all I'll say
is that simply train bigger deeper
language models you'll go surprisingly
far scaling up but also doing things
like training a test them and inference
the test time I think would be another
important boosts the performance hi
thank you for the talk so it seems like
right now another interesting approach
to solving reinforcement learning
problems could be to go for the
evolutionary roots using evolutionary
strategies and although they have they
their cave Hut's I wanted to know if I'd
open a I particularly you're working on
something related and what are what is
your general opinion on them
so like at present I believe that
something evolutionary strategies is not
great for reinforcement learning I think
that normal reinforcement learning
algorithms especially with big policies
are better but I think if you want to
evolve a small compact object like like
a piece of code for example I think that
would be a place where this would be
seriously was considering but this all
you know evolving a beautiful piece of
code is a cool idea hasn't been done yet
so still a lot of work to be done before
we get there hi thank you so much for
coming my question is you mentioned what
is the right go is a political problem
so I'm wondering if you can elaborate a
bit on that and also what do you think
would be their approach for us to maybe
get there well I can't I can't really
comment too much because all the
thoughts that you know we have we now
have a few people who are thinking about
this full-time at opening I I don't have
enough of a super strong opinion to say
anything too definitive all I can say at
the very high level is given the size
like if you go into the future whenever
soon or late you know whenever it's
going to happen when you build a
computer which can do anything better
than a human it will happen because the
brain is physical the impact on society
is going to be completely massive and
overwhelming it's it's very difficult to
imagine even if you try really hard and
I think what it means is that people who
care a lot and that's what I was
alluding to the fact that this will be
something that many people who care
about strongly and like as the impact
increases gradually with self-driving
cars more automation I think we will see
a lot more people care do we need to
have a very accurate model of the
physical world and then simulate that in
order to have these agents that can
eventually come out into the real world
and do something approaching you know
human level intelligence tasks that's a
very good question so I think if that
were the case
be in trouble and I am very certain that
it could be avoided so specifically the
real answer has to be that look you
learn the problem so we learn to
negotiate you learn to persist you not a
lots of different useful life lessons in
the simulation and yes you learn some
physics too but then you go outside to
the real world
and you have to start over to some
extent because many of you are deeply
held assumptions will be false in one of
the goals so what was that's one reasons
I care so much about never stopping
training you've accumulated your
knowledge now we go into an environment
for some of your assumptions of valid
you continue training you try to connect
the new data to your old data and this
is an important requirement from our
algorithms which is already met to some
extent but it will have to be met a lot
more so that you can take the partial
knowledge if you've acquired then go in
a new situation learn some more
literally the example of you go to
school ballon useful things then you go
to work it's not a perfect it's not you
know you pour your four years of CS and
undergrad is not going to fully prepare
you for whatever it is you need to know
it work
it will help somewhat you'll be able to
get off the ground but it will be lots
of new things you need to learn so
that's that's the spirit of it I think
of a toes of the school one of the
things you mentioned pretty early on in