Reinforcement Learning Series: Overview of Methods
Summary
TLDRこのビデオ講義では、強化学習の基礎とその応用について説明した後、実践でどのように強化学習を実装するかについての詳細なアルゴリズムに掘り下げます。強化学習は、神経科学、行動科学、最適制御理論、Bellman方程式、そして現代の深層強化学習まで幅広い分野にわたります。講師は、強化学習の理論と実際の問題解決に役立つ情報を提供し、深層強化学習の分野における最近の進歩についても触れています。
Takeaways
- 強化学習は、制御と機械学習の交差点にある分野であり、100年以上の歴史があります。
- 強化学習の問題は、エージェントが環境と交互して現在または将来の報酬を最大化することを学ぶことです。
- 報酬構造は時々スパースであり、最終的な結果までフィードバックが得られないことがあります。
- ポリシーは、現在の状態から行動を決定する確率的なセットのルールであり、価値関数はシステムの各状態における期待される将来の報酬を示します。
- 強化学習の目標は、試行錯誤を通じて最適なポリシーを学ぶことです。
- モデルベースとモデルフリーの強化学習の2つの大きなカテゴリーがあります。モデルがある場合、ポリシー反復と価値反復などの動的計画に基づく強力な技術を使用できます。
- モデルがないと仮定すると、グラディエントフリーとグラディエントベースの方法の2つの主要なカテゴリーがあります。
- オンポリシーとオフポリシーの方法があります。オンポリシーでは、最良のポリシーを常に使用してゲームをプレイし、オフポリシーでは、サブオプティマルな行動を試みることがあります。
- Q学習は、モデルがない場合でも学習できるオフポリシー方法であり、deep neural networksを使用して最適なポリシーをより迅速に学習できます。
- 最近10年間で、DeepMindやAlphaGoなどの進化により、深層強化学習が爆発的に発展し、機械が人間レベルのパフォーマンスでアタリゲームをプレイしたり、囲碁のプロ棋士を勝ち取ったりすることができました。
- 深層学習は、価値関数やポリシーを表すために使用され、勾配ベースの最適化を使用してポリシーネットワークをトレーニングできます。
- アクタークリティック方法は、deep neural networksを使用してトレーニングでき、強化学習に新しい興味を引き起こしています。
Q & A
強化学習とは何ですか?
-強化学習は、エージェントが環境と交互して報酬を最大化する方法を学ぶことで、学習を進める機械学習の手法です。
強化学習の応用例は何ですか?
-強化学習はチェスや囲碁、パーセプトロンの学習、自動運転車の制御、ゲームAIなど、多岐にわたる分野で応用されています。
強化学習における「報酬」とは何を指しますか?
-報酬とは、エージェントが環境と交互して得られる報酬や成果を指します。例えば、チェスでの勝利や囲碁での得点などが該当します。
「価値関数」とは何ですか?
-価値関数は、ある状態において将来的に得られる報酬の期待値を表す関数です。この期待値は、割引因子を用いて現在までの報酬に割引されています。
「ポリシー」とは何を指しますか?
-ポリシーは、エージェントが現在の状態から次にどのような行動を起こすべきかを決定する確率的な規則です。これは最適なポリシーと呼ばれる場合があり、将来の報酬を最大化する行動を選択するものです。
「モデルベースの強化学習」と「モデルフリーの強化学習」の違いは何ですか?
-モデルベースの強化学習は、環境のモデル(マルコフ決定過程や微分方程式)を持っていることに基づいて最適なポリシーを学ぶ方法です。一方、モデルフリーの強化学習は、環境のモデルを持っていない場合に、試行錯誤を通じて最適なポリシーを学ぶ方法です。
「ポリシーイテレーション」と「バリュイイテーションイテレーション」はどのような手法ですか?
-ポリシーイテレーションとバリュイイテーションイテレーションは、モデルベースの強化学習において用いられる手法で、最適なポリシーと価値関数を反復的に更新していくことで、最適なポリシーを学ぶことができます。
「サルサ」と「Q学習」の違いは何ですか?
-サルサは、現在の状態と行動に基づいて最適なポリシーを学習するオンポリシーメソッドです。一方、Q学習は、特定の状態と行動の組み合わせの価値関数を学習するオフポリシーメソッドであり、最適なポリシーをより迅速に収束させることが可能です。
深層強化学習とは何ですか?
-深層強化学習は、深層ニューラルネットワークを用いてポリシー、価値関数、またはモデルを表現し、最適な制御戦略を学習する手法です。これにより、複雑な環境においても高い学習性能を達成することが可能です。
「アクタークリティック」方法とは何ですか?
-アクタークリティック方法は、エージェントの行動を決定する「アクター」と状況の評価を行う「クリティック」の2つのネットワークを用いて、最適なポリシーを学習する手法です。これにより、柔軟性と学習速度の両方を確保することができます。
強化学習における「遅延報酬構造」とは何を指しますか?
-遅延報酬構造とは、エージェントが学習過程でフィードバックを受け取るのを遅延させる構造を指します。これにより、学習過程での報酬のスペースを調整することができ、より効率的な学習が可能になります。
Outlines
📚 強化学習の基礎とその応用
この段落では、講師が強化学習のシリーズ動画の第一講を始め、強化学習の基礎とその応用について説明しています。強化学習は、学習アルゴリズムの一種で、主に統計力学、行動科学、最適化理論、最適制御、Bellman方程式などを含む分野です。また、強化学習は、機械学習と制御理論の交差点に位置しています。 講師は、このシリーズで強化学習の実践的なアルゴリズムについて深く掘り下げ、紹介する予定です。また、新しい書籍「データドリブンな科学と工学」の第二版の一章にもこの話題が含まれており、読者がその詳細をさらに知ることができます。
🤖 強化学習の価値関数と最適政策
この段落では、講師が強化学習における価値関数と最適政策について説明しています。強化学習の問題は、エージェントが環境と交互して行動し、現在のまたは将来の報酬を最大化することを目指すものです。 報酬構造が稀薄である場合、何が正しい行動かを判断するためにはフィードバックが遅れることがあります。そのため、学習プロセスは挑戦的であり、動物学習でも同様の課題があります。エージェントの制御戦略やポリシーは、与えられた状態でのアクションを決定する一連のルールであり、価値関数は、各状態での期待される未来の報酬を示します。
🧠 モデルベースとモデルフリーの強化学習
この段落では、講師がモデルベースとモデルフリーの強化学習の違いと、それぞれのアプローチについて説明しています。モデルベースの強化学習では、環境のモデル(例えばマルコフ決定過程)を前提に、ポリシーイテレーションや価値イテレーションなどの技術を用いて最適なポリシーを学びます。 モデルフリーの強化学習では、環境のモデルがない場合に適用され、Gradient-FreeとGradient-Basedの方法に分かれます。Gradient-Freeでは、SARSAやQ Learningなどの手法を用いてポリシーを改善し、Gradient-Basedでは、報酬関数や価値関数をパラメータ化し、グラディエント最適化を用いてポリシーを更新します。
🚀 深層強化学習の進化
この段落では、講師が深層強化学習の進化とその影響について説明しています。最近10年間で、DeepMindやAlphaGoなどの進化により、コンピュータがアタリゲームを人間レベルの性能でプレイしたり、囲碁のプロフェッショナルを勝ち取ったりすることができました。 これらは強化学習における素晴らしい成果であり、Deep Neural Networksを使用してモデルを学習し、モデルベースの強化学習を行うか、モデルフリーの概念を表すことができます。例えば、価値関数やポリシーを深層ニューラルネットワークで表現し、ネットワークのパラメータを微分して最適化を行っています。
📈 強化学習の今後の方向性
最後の段落では、講師が今後の動画で取り上げることについて説明しています。马尔可夫決定過程に関する政策イテレーションと価値イテレーション、Gradient-FreeのSARSAやQ Learning、最適非線形制御におけるHamilton-Jacobi-Bellman方程式、政策グラディエント最適化、深層学習を用いたアクタークリティック方法について詳細に説明する予定です。これらの方法を通じて、強化学習の様々なカテゴリーを理解し、最適なアルゴリズムを選択することができるようになるでしょう。
Mindmap
Keywords
💡reinforcement learning
💡agent
💡environment
💡policy
💡value function
💡model-based
💡model-free
💡dynamic programming
💡deep reinforcement learning
💡policy iteration
💡value iteration
Highlights
Reinforcement learning is a field that merges neuroscience, behavioral science, optimization theory, and control theory.
The intersection of machine learning and control theory is where reinforcement learning lies, focusing on learning effective control strategies.
The reinforcement learning problem involves an agent interacting with an environment through actions and observations to maximize rewards.
The delayed reward structure in reinforcement learning makes it challenging, as feedback might not be received until the end of a task.
A policy in reinforcement learning is a probability of taking an action given a current state, determining the agent's control strategy.
The value function estimates the expected future rewards from a given state under a specific policy, with a discount factor for future rewards.
Model-based reinforcement learning uses a known model of the environment, such as a Markov decision process, to optimize policy and value functions.
Policy iteration and value iteration are powerful techniques for optimizing policy and value functions when a model of the environment is available.
Model-free reinforcement learning approximates dynamic programming without having an explicit model of the environment.
Gradient-based methods in reinforcement learning can speed up optimization when the gradient of the reward or value function is known.
Q-learning is an off-policy learning algorithm that learns the quality function without needing a model of the next state.
Deep reinforcement learning has seen significant advancements in the last decade, with applications such as AlphaGo demonstrating impressive achievements.
Deep neural networks can be used in reinforcement learning for learning models or representing policy and value functions for model-free approaches.
Actor-critic methods have gained renewed interest with the ability to train them using deep neural networks.
The Hamilton Jacobi Bellman equation is used in optimal nonlinear control problems, which can be solved using dynamic programming ideas.
Reinforcement learning algorithms can be categorized into model-based, model-free, on-policy, and off-policy variants, each with their unique applications and considerations.
Transcripts
welcome back so i've started this video lecture series on reinforcement learning and the last
three videos were at a very high level kind of what is reinforcement learning how does it work uh
what are some of the applications but we really didn't dig into too many details on the actual
algorithms of how you implement reinforcement learning in practice and so that's what i'm
actually going to do today and in this next part of this series is something i hope is going to be
really really useful kind of for all of you which is the
The first thing is i'm going to kind of organize the different approaches of reinforcement learning
This is a massive field that's about 100 years old
This merges neuroscience, behavioral science like Pavlov's dog, optimization theory, optimal control
think Bellman's equation and the Hamilton Jacobi Bellman equation
all the way to modern day deep reinforcement learning
which is kind of how to use powerful machine learning techniques to solve these optimization problems
and you'll remember that in my view of reinforcement learning
this is really at the intersection of machine learning and control theory
so we're essentially machine learning good effective control strategies to interact with in a an environment
So in this first lecture what i'm gonna do and i think i'm hoping that this is actually super useful for some of you
is i'm going to talk through the organization of these different decisions you have to make
and kind of how you can think about the landscape of reinforcement learning.
Before going on I want to mention this is actually a chapter in the new second edition of our book data driven science
and engineering with myself and Nathan Kutz and reinforcement learning was one of the new chapters
I decided to write so this is a great excuse for me to get to learn more about reinforcement learning
and it's also a nice opportunity for me to kind of get to communicate more details to you
so if you want to download this chapter the link is here, and I'll also put it in the comments below
and i'll have a link to the second edition of the book uh up soon as well probably in the comments
good so um a new chapter you can follow along with all of the videos
and each video kind of um you know follows follows the chapter
good so before i get into that organizational chart of how you know all of these different
types of reinforcement learning can be thought of i want to just do a really really quick recap
of what is the reinforcement learning problem so in reinforcement learning you have an agent
that gets to interact with the world or the environment through a set of actions sometimes
these are discrete actions sometimes these are continuous actions if i have a robot i might have
a continuous action space whereas if i'm playing a game if i'm the you know the white pieces on a
chess board then i have a discrete set of actions even though it might be kind of high dimensional
and i observe the state of the system at each time step i get to observe the state of the system
and use that information uh to change my actions to try to maximize my current or future rewards
uh through through playing and i'll mention that in lots of applications for example in chess
the reward structure might be quite sparse i might not get any feedback on whether or not i'm making
good moves until the very end when i either win or lose tic-tac-toe backgammon checkers
go are all kind of the same way and that delayed reward structure is one of the things that makes
this reinforcement learning problem really really challenging it's what makes uh you know learning
in animal systems also challenging if you want to teach your dog a trick you know they have to know
kind of step by step what you want them to do and so you actually sometimes have to give them
rewards at intermediate steps to train a behavior and so the agent their control strategy or their
their policy is typically called pi and it basically is a probability of taking action a
given a state s a current state s so this could be a deterministic policy it could be a probabilistic
policy but essentially it's a set of rules that determines what actions i as the agent take given
uh what i sense uh in the environment to maximize my future rewards so that's the policy and again
usually this is written in a probabilistic framework because typically the environment
is written as a probabilistic model and there is something called a value function so given
um you know some policy pi that i take then i can associate a value with being in each of the states
of this system essentially by what is my expected future reward add up all of my future rewards
what's the expectation of that and we'll put in this little discount factor because future rewards
might be less uh advantageous to me than than current rewards this is just something that people
do in economic theory it's kind of like a you know utility function and so for every policy pi
there is a value associated with being in each of the given states s now again i'll point out
for for even reasonably sophisticated problems you can do this for tic-tac-toe you can enumerate all
possible states and all possible actions and you can compute this value function kind of through
brute force but even for moderately complicated games even like checkers let alone back game in
her chess or go this state space the the the space of all possible states you could observe
your system in is astronomically large i think it's estimated that there's 10 to the 80 plus um
maybe it's 10 to 180 africa it's a huge number of possible chess boards even more possible go boards
and so you can't really actually enumerate this value function but it's a good um kind
of abstract function that we can think about are these policy functions and these value functions
and at least in simplistic dynamic programming we often assume that we know a model of of our
environment and i'll get to that in a minute so the goal here the entire goal of reinforcement
learning is to learn through trial and error through experience what is the optimal policy
to maximize your future rewards okay so notice that this value function is a function of policy
pi i want to learn the best possible policy that always gives me the most value out of every board
position out of every state and that's a really hard problem it's easy to state the goal and it
is really really hard to solve this problem that's why it's been uh you know this growing field for
100 years that's why it's still a growing field because we have more powerful emerging techniques
in machine learning to start to solve this problem so that's the framework i need you to know what
the the policy is that is the set of kind of rules or or uh you know controllers that i as an agent
get to take to manipulate my environment the value function tells me how valuable it is to be in a
particular state so i might want to move myself into that state so i need you to know kind of this
nomenclature so now i can kind of show you how all of these techniques are organized okay so that's
what i'm going to do for the rest of the video is we're going to talk through kind of the key
organization of all of the different uh like like mainstream types of reinforcement learning
okay so the first biggest dichotomy is between model-based and model-free reinforcement learning
so if you actually have a good model of your environment you have some you know markov decision
process or some differential equation if you have a good model to start with then you can work in
this model based reinforcement learning world now some people don't actually consider what i'm about
to talk about reinforcement learning but but i do okay so for example if my environment is a
markov decision process which means that there is a probability kind of a deterministic probability
that sounds like an oxymoron but if there is a specified probability of moving from state s
to the next state s prime given action a and this probability function is known so it's a
it doesn't depend on the history of your actions and states it only depends on the current state
and the current action determines a probability of going to a next state s prime then uh two really
really powerful techniques that i'm going to tell you about to optimize the the policy function pi
is policy iteration and value iteration these allow you to essentially iteratively walk through
um the game or or the markov decision process taking actions that you think are going to
be the best and then assessing what the value of that action and state actually
are and then kind of refining and iterating the policy function and the value function
so that is a really really powerful approach if you have a model of your system you can
run this kind of on a computer and and kind of determine learn what the best policy and value is
and this is kind of a special case of dynamic programming that relies on the bellman optimality
condition for the value function so i'm going to do a whole lecture
on this blue part right here we're going to talk about policy iteration and value iteration
and how they are essentially dynamic using dynamic programming on the value function which satisfies
bellman's optimality condition now that was for probabilistic uh processes things where maybe
you know like backgammon where there's a dice roll at every turn for deterministic systems like
a robot system or a self-driving car if i think about a human you know my reinforcement learning
problem this is much more of a continuous control problem in which case i might have some nonlinear
differential equation x dot equals f of x comma u and so the linear optimal control that we studied
i guess in chapter 8 of the book so linear quadratic regulators common filters things like
that optimal linear control problems are special cases of this optimal non-linear control problem
with the hamilton jacobi bellman equation again this relies on bellman optimality and
you can use kind of dynamic programming ideas to solve optimal nonlinear control problems
like this now i'll point out mathematically this is a beautiful theory it's powerful
it's been around for decades and it's you know kind of the textbook way of thinking about how to
design optimal policies and optimal controllers you know for markov decision processes and for
non-linear control systems in practice actually solving these things with dynamic programming
ends up usually amounting to a brute force search and it's usually not scalable to high dimensional
systems so typically it's hard to do this optimal hamilton jacobi bellman type non-linear control
for an even moderately high dimensional system you know you can do this for a three-dimensional
system sometimes a five-dimensional system maybe you know i've heard special cases with with
machine learning you can do this maybe for a 10 or 100 dimensional system but you can't do this for
the nonlinear fluid flow equations which might have you know a hundred thousand or a million
dimensional differential equation when you write it down on your computer so important caveat there
but that's model based control and a lot of what we're going to do in model free control uses ideas
that we learned from model based control so even though you know i don't actually do a lot of this
in my daily life with reinforcement learning most of the time we don't have a good model
of our system for example in chess i don't have a model of my opponent for example or at least i
can't write it down mathematically as a markov decision process so i can't really use these
techniques but a lot of what model free control reinforcement learning is going to do is kind of
approximate dynamic programming where you're simultaneously learning kind of the dynamics
or learning to update these these functions through trial and error without actually having
a model and so in model-free reinforcement learning kind of the major dichotomy here
is between gradient-free and gradient-based methods and i'll tell you what this means in a
little bit but for example if i can parameterize my policy pi by some variables theta and i know
kind of what the dependency with those variables theta are i might be able to take the gradient
of my reward function or my value function with respect to those parameters directly
and speed up the optimization okay so gradient based if you if you can use it is usually going
to be the fastest most efficient way to do things but oftentimes again we don't have
gradient information we're just playing games we're playing chess we're playing go and i can't
compute the derivative of one game with respect to another that's hard for me at least to do
and so within gradient free okay there's a lot of dichotomies here there's a dichotomy of a
dichotomy of a dichotomy within gradient free control there is this idea of sometimes you can
be off policy or on policy and it's a really important uh distinction what on policy means
is that let's say i'm playing a bunch of games of chess i'm trying to learn an optimal policy
function or an optimal value function or both by playing games of chess and iteratively kind of
refining my estimate of pi or a v what on policy means is that i always play my best game possible
whatever i think the value function is and whatever i think my best policy possible is
i'm always going to use that best policy as i play my game and i'm going to always try to kind of get
the most reward out of my system every game i play that's what it means to be on policy
off policy means well maybe i'll try some things maybe maybe i know that my policy is suboptimal
and so i'm just going to do some like random moves occasionally that is called off policy because i
think they're sub-optimal but they might be really valuable for learning information about the system
uh so on policy methods include this sarsa state action reward state action
and there's all of these variants of the sarsa algorithm this on policy reinforcement learning
and these tds mean temporal difference and mc is monte carlo and so there's this whole family
of kind of gradient-free optimization techniques that use different kind of amounts of history i'll
talk all about that that's going to be a whole other lecture is this this red box gradient free
model free reinforcement learning and so the off policy version of sarsa kind of this on policy
set of algorithms there is an off policy variant called q learning and so this quality function
q is kind of the joint value if you like of being in a particular state
and taking a particular action a so this quality function contains all of the information of my
my optimal policy and the value function and both of these can be derived from the quality function
but the really important distinction is that when we learn based on the quality function we don't
need a model for what my next state is going to be this quality function kind of implicitly defines
the value of you know based on where you're going to go in the future and so q learning is a really
nice way of learning when you have no model and you can take off policy information and learn
from that you can take a sub-optimal controller just to see what happens and still learn and get
better policies and better value functions in the future and that's also really important if you
want to do imitation learning if i want to just watch other people play games of chess even though
i don't know what their value function is or what their policy is with these off policy learning
algorithms you can accumulate that information into your estimate of the world and every bit of
information you get improves your quality function and it improves the next game you're going to play
so really powerful and i would say most of what we do nowadays you know is kind of in this q learning
world a lot a lot a lot of machine learning is q learning reinforcement learning is q-learning
and then the gradient-based algorithms i'm not going to talk about it too much here but it's
essentially where you would actually update the parameters of your policy or your value function
or your q function directly using some kind of a gradient optimization so if i can sum up all of my
future rewards and it's a function of the current parameters theta that parameterize my policy
then i might be able to use gradient optimization things like newton's steps and steepest descent
things like that to get a good estimate and this when i have the ability to do that
is going to be way way faster uh than any of these uh these gradient free methods and and
even in term uh will be faster than dynamic programming and so the last piece of this
is kind of in the last 10 years we've had this massive explosion of deep reinforcement learning
a lot of this has been because of deep mind and alphago you know demonstrating that machines
computers can play atari games at human level performance they can beat grand masters that go
just incredibly impressive demonstrations of reinforcement learning that now use deep
neural networks either to learn a model where you can then use model-based reinforcement learning
or to represent these kind of model-free concepts so you can have like a deep neural network for
the quality function you can have a deep neural network for the policy and then differentiate
with respect to those network parameters uh using kind of you know auto diff and back propagation
uh to do gradient based optimization on your policy network i would say that deep model
predictive control this doesn't exactly fit into the reinforcement learning world but i would say
you know it's it's morally very closely related deep model predictive control allows you to solve
these kind of hard optimal nonlinear problems and then you can actually learn a policy based
on what your model predictive controller actually does you can essentially kind of codify that model
predictive controller into a control policy and finally uh actor critic methods um actor critic
methods existed long before deep reinforcement learning but nowadays they have kind of a renewed
interest uh because you can you can uh you can train these with with deep neural networks
okay so that is the mile-high view as i see it of the different categories of reinforcement learning
is this comprehensive absolutely not is it a hundred percent
factually correct definitely not this is you know a rough sketch of the main divides and
things you need to think about when you're choosing a reinforcement learning algorithm
if you have a model of your system you can use dynamic programming based on bellman optimality
if you don't have a model of the system you can either use gradient free or gradient based methods
and then there's on policy and off policy variants depending on you know your specific needs it tends
out to be that sarsa methods are more conservative and q learning will tend to converge faster
and then for all of these methods there are ways of kind of making them more powerful
and more flexible representations using uh deep neural networks in kind of different focused ways
okay so in the next few videos we'll zoom into you know this part here for markup decision processes
how we do policy iteration and value iteration we'll actually derive the quality function uh here
we'll talk about model free control these kind of gradient free methods on policy and off policy
cue learning is one of the most important ones and temporal difference learning actually has a lot of
neuro science analog so how we learn in in our animal brains people think you know is very
closely related to these td learning policies we'll talk about how you do optimal nonlinear
control with the hamilton jacobi bellman equation uh we'll talk very briefly about policy gradient
optimization and then you know all of these there are kind of deep learning things uh
we'll pepper it throughout with deep learning or maybe i'll have a whole lecture on on these
deep learning methods so that's all coming up really excited to walk you through this thank you
5.0 / 5 (0 votes)