Reinforcement Learning Series: Overview of Methods

Steve Brunton
3 Jan 202221:37

Summary

TLDRこのビデオ講義では、強化学習の基礎とその応用について説明した後、実践でどのように強化学習を実装するかについての詳細なアルゴリズムに掘り下げます。強化学習は、神経科学、行動科学、最適制御理論、Bellman方程式、そして現代の深層強化学習まで幅広い分野にわたります。講師は、強化学習の理論と実際の問題解決に役立つ情報を提供し、深層強化学習の分野における最近の進歩についても触れています。

Takeaways

  • 強化学習は、制御と機械学習の交差点にある分野であり、100年以上の歴史があります。
  • 強化学習の問題は、エージェントが環境と交互して現在または将来の報酬を最大化することを学ぶことです。
  • 報酬構造は時々スパースであり、最終的な結果までフィードバックが得られないことがあります。
  • ポリシーは、現在の状態から行動を決定する確率的なセットのルールであり、価値関数はシステムの各状態における期待される将来の報酬を示します。
  • 強化学習の目標は、試行錯誤を通じて最適なポリシーを学ぶことです。
  • モデルベースとモデルフリーの強化学習の2つの大きなカテゴリーがあります。モデルがある場合、ポリシー反復と価値反復などの動的計画に基づく強力な技術を使用できます。
  • モデルがないと仮定すると、グラディエントフリーとグラディエントベースの方法の2つの主要なカテゴリーがあります。
  • オンポリシーとオフポリシーの方法があります。オンポリシーでは、最良のポリシーを常に使用してゲームをプレイし、オフポリシーでは、サブオプティマルな行動を試みることがあります。
  • Q学習は、モデルがない場合でも学習できるオフポリシー方法であり、deep neural networksを使用して最適なポリシーをより迅速に学習できます。
  • 最近10年間で、DeepMindやAlphaGoなどの進化により、深層強化学習が爆発的に発展し、機械が人間レベルのパフォーマンスでアタリゲームをプレイしたり、囲碁のプロ棋士を勝ち取ったりすることができました。
  • 深層学習は、価値関数やポリシーを表すために使用され、勾配ベースの最適化を使用してポリシーネットワークをトレーニングできます。
  • アクタークリティック方法は、deep neural networksを使用してトレーニングでき、強化学習に新しい興味を引き起こしています。

Q & A

  • 強化学習とは何ですか?

    -強化学習は、エージェントが環境と交互して報酬を最大化する方法を学ぶことで、学習を進める機械学習の手法です。

  • 強化学習の応用例は何ですか?

    -強化学習はチェスや囲碁、パーセプトロンの学習、自動運転車の制御、ゲームAIなど、多岐にわたる分野で応用されています。

  • 強化学習における「報酬」とは何を指しますか?

    -報酬とは、エージェントが環境と交互して得られる報酬や成果を指します。例えば、チェスでの勝利や囲碁での得点などが該当します。

  • 「価値関数」とは何ですか?

    -価値関数は、ある状態において将来的に得られる報酬の期待値を表す関数です。この期待値は、割引因子を用いて現在までの報酬に割引されています。

  • 「ポリシー」とは何を指しますか?

    -ポリシーは、エージェントが現在の状態から次にどのような行動を起こすべきかを決定する確率的な規則です。これは最適なポリシーと呼ばれる場合があり、将来の報酬を最大化する行動を選択するものです。

  • 「モデルベースの強化学習」と「モデルフリーの強化学習」の違いは何ですか?

    -モデルベースの強化学習は、環境のモデル(マルコフ決定過程や微分方程式)を持っていることに基づいて最適なポリシーを学ぶ方法です。一方、モデルフリーの強化学習は、環境のモデルを持っていない場合に、試行錯誤を通じて最適なポリシーを学ぶ方法です。

  • 「ポリシーイテレーション」と「バリュイイテーションイテレーション」はどのような手法ですか?

    -ポリシーイテレーションとバリュイイテーションイテレーションは、モデルベースの強化学習において用いられる手法で、最適なポリシーと価値関数を反復的に更新していくことで、最適なポリシーを学ぶことができます。

  • 「サルサ」と「Q学習」の違いは何ですか?

    -サルサは、現在の状態と行動に基づいて最適なポリシーを学習するオンポリシーメソッドです。一方、Q学習は、特定の状態と行動の組み合わせの価値関数を学習するオフポリシーメソッドであり、最適なポリシーをより迅速に収束させることが可能です。

  • 深層強化学習とは何ですか?

    -深層強化学習は、深層ニューラルネットワークを用いてポリシー、価値関数、またはモデルを表現し、最適な制御戦略を学習する手法です。これにより、複雑な環境においても高い学習性能を達成することが可能です。

  • 「アクタークリティック」方法とは何ですか?

    -アクタークリティック方法は、エージェントの行動を決定する「アクター」と状況の評価を行う「クリティック」の2つのネットワークを用いて、最適なポリシーを学習する手法です。これにより、柔軟性と学習速度の両方を確保することができます。

  • 強化学習における「遅延報酬構造」とは何を指しますか?

    -遅延報酬構造とは、エージェントが学習過程でフィードバックを受け取るのを遅延させる構造を指します。これにより、学習過程での報酬のスペースを調整することができ、より効率的な学習が可能になります。

Outlines

00:00

📚 強化学習の基礎とその応用

この段落では、講師が強化学習のシリーズ動画の第一講を始め、強化学習の基礎とその応用について説明しています。強化学習は、学習アルゴリズムの一種で、主に統計力学、行動科学、最適化理論、最適制御、Bellman方程式などを含む分野です。また、強化学習は、機械学習と制御理論の交差点に位置しています。 講師は、このシリーズで強化学習の実践的なアルゴリズムについて深く掘り下げ、紹介する予定です。また、新しい書籍「データドリブンな科学と工学」の第二版の一章にもこの話題が含まれており、読者がその詳細をさらに知ることができます。

05:04

🤖 強化学習の価値関数と最適政策

この段落では、講師が強化学習における価値関数と最適政策について説明しています。強化学習の問題は、エージェントが環境と交互して行動し、現在のまたは将来の報酬を最大化することを目指すものです。 報酬構造が稀薄である場合、何が正しい行動かを判断するためにはフィードバックが遅れることがあります。そのため、学習プロセスは挑戦的であり、動物学習でも同様の課題があります。エージェントの制御戦略やポリシーは、与えられた状態でのアクションを決定する一連のルールであり、価値関数は、各状態での期待される未来の報酬を示します。

10:10

🧠 モデルベースとモデルフリーの強化学習

この段落では、講師がモデルベースとモデルフリーの強化学習の違いと、それぞれのアプローチについて説明しています。モデルベースの強化学習では、環境のモデル(例えばマルコフ決定過程)を前提に、ポリシーイテレーションや価値イテレーションなどの技術を用いて最適なポリシーを学びます。 モデルフリーの強化学習では、環境のモデルがない場合に適用され、Gradient-FreeとGradient-Basedの方法に分かれます。Gradient-Freeでは、SARSAやQ Learningなどの手法を用いてポリシーを改善し、Gradient-Basedでは、報酬関数や価値関数をパラメータ化し、グラディエント最適化を用いてポリシーを更新します。

15:14

🚀 深層強化学習の進化

この段落では、講師が深層強化学習の進化とその影響について説明しています。最近10年間で、DeepMindやAlphaGoなどの進化により、コンピュータがアタリゲームを人間レベルの性能でプレイしたり、囲碁のプロフェッショナルを勝ち取ったりすることができました。 これらは強化学習における素晴らしい成果であり、Deep Neural Networksを使用してモデルを学習し、モデルベースの強化学習を行うか、モデルフリーの概念を表すことができます。例えば、価値関数やポリシーを深層ニューラルネットワークで表現し、ネットワークのパラメータを微分して最適化を行っています。

20:17

📈 強化学習の今後の方向性

最後の段落では、講師が今後の動画で取り上げることについて説明しています。马尔可夫決定過程に関する政策イテレーションと価値イテレーション、Gradient-FreeのSARSAやQ Learning、最適非線形制御におけるHamilton-Jacobi-Bellman方程式、政策グラディエント最適化、深層学習を用いたアクタークリティック方法について詳細に説明する予定です。これらの方法を通じて、強化学習の様々なカテゴリーを理解し、最適なアルゴリズムを選択することができるようになるでしょう。

Mindmap

Keywords

💡reinforcement learning

強化学習は、エージェントが環境と交互して報酬を最大化する効果的な制御戦略を学習する過程です。このビデオでは、強化学習の理論と実践的なアルゴリズムについて解説しています。

💡agent

エージェントは、環境と交互する主体を指し、このビデオではそのエージェントがどのように行動して報酬を最大化するかを学習する過程を説明しています。

💡environment

環境は、エージェントが行動する外部の状況や設定を指します。このビデオでは、環境との交互を通じてエージェントが学習する方法について説明されています。

💡policy

ポリシーは、エージェントが環境において与えられた状況(ステート)に対してどの行動をとるかを決定する一連のルールです。このビデオでは、最適なポリシーを学習することが強化学習の目標とされています。

💡value function

価値関数は、あるステートにおける将来の報酬の期待値を表す関数であり、エージェントがどのステートにいたら良いかを評価するために使用されます。このビデオでは、価値関数を最大化することが強化学習の核心的な目標とされています。

💡model-based

モデルベースの強化学習は、環境のモデルを事前に知っている場合に使用される方法であり、このビデオではその理論と応用について説明しています。

💡model-free

モデルフリーの強化学習は、環境のモデルを持っていない場合に使用される方法であり、このビデオではそのアプローチとそれに伴う技術について説明しています。

💡dynamic programming

動的計画法は、最適なポリシーや価値関数を見つけるために使用されるアルゴリズムであり、このビデオでは強化学習においてその理論と応用について解説しています。

💡deep reinforcement learning

深層強化学習は、深層学習技術を用いて強化学習を行う方法であり、このビデオではその発展と応用について説明しています。

💡policy iteration

ポリシー反復は、モデルベースの強化学習において最適なポリシーを探索するためのアルゴリズムであり、このビデオではそのプロセスと重要性について説明しています。

💡value iteration

価値反復は、モデルベースの強化学習において価値関数を更新し最適なポリシーを探索するためのアルゴリズムであり、このビデオではそのプロセスと応用について解説しています。

Highlights

Reinforcement learning is a field that merges neuroscience, behavioral science, optimization theory, and control theory.

The intersection of machine learning and control theory is where reinforcement learning lies, focusing on learning effective control strategies.

The reinforcement learning problem involves an agent interacting with an environment through actions and observations to maximize rewards.

The delayed reward structure in reinforcement learning makes it challenging, as feedback might not be received until the end of a task.

A policy in reinforcement learning is a probability of taking an action given a current state, determining the agent's control strategy.

The value function estimates the expected future rewards from a given state under a specific policy, with a discount factor for future rewards.

Model-based reinforcement learning uses a known model of the environment, such as a Markov decision process, to optimize policy and value functions.

Policy iteration and value iteration are powerful techniques for optimizing policy and value functions when a model of the environment is available.

Model-free reinforcement learning approximates dynamic programming without having an explicit model of the environment.

Gradient-based methods in reinforcement learning can speed up optimization when the gradient of the reward or value function is known.

Q-learning is an off-policy learning algorithm that learns the quality function without needing a model of the next state.

Deep reinforcement learning has seen significant advancements in the last decade, with applications such as AlphaGo demonstrating impressive achievements.

Deep neural networks can be used in reinforcement learning for learning models or representing policy and value functions for model-free approaches.

Actor-critic methods have gained renewed interest with the ability to train them using deep neural networks.

The Hamilton Jacobi Bellman equation is used in optimal nonlinear control problems, which can be solved using dynamic programming ideas.

Reinforcement learning algorithms can be categorized into model-based, model-free, on-policy, and off-policy variants, each with their unique applications and considerations.

Transcripts

play00:09

welcome back so i've started this video lecture  series on reinforcement learning and the last  

play00:16

three videos were at a very high level kind of  what is reinforcement learning how does it work uh  

play00:22

what are some of the applications but we really  didn't dig into too many details on the actual  

play00:27

algorithms of how you implement reinforcement  learning in practice and so that's what i'm  

play00:33

actually going to do today and in this next part  of this series is something i hope is going to be  

play00:40

really really useful kind of for all of you which  is the  

play00:45

The first thing is i'm going to kind of organize the different approaches of reinforcement learning

play00:51

This is a massive field that's about 100 years old

play00:54

This merges neuroscience, behavioral science like Pavlov's dog, optimization theory, optimal control

play01:02

think Bellman's equation  and the Hamilton Jacobi Bellman equation  

play01:07

all the way to modern day deep reinforcement learning

play01:11

which is kind of how to use powerful machine learning techniques to solve these optimization problems

play01:17

and you'll remember that in my view of reinforcement learning

play01:22

this is really at the intersection of machine learning and control theory

play01:27

so we're essentially machine learning good effective control strategies to interact with in a an environment

play01:34

So in this first lecture what i'm gonna do and i think i'm hoping that this is actually super useful for some of you  

play01:43

is i'm going to talk through the organization  of these different decisions you have to make  

play01:47

and kind of how you can think about the landscape  of reinforcement learning.

play01:51

Before going on I want to mention this is actually a chapter in the  new second edition of our book data driven science  

play01:59

and engineering with myself and Nathan Kutz and  reinforcement learning was one of the new chapters  

play02:06

I decided to write so this is a great excuse  for me to get to learn more about reinforcement learning

play02:11

and it's also a nice opportunity for me  to kind of get to communicate more details to you  

play02:16

so if you want to download this chapter the link  is here, and I'll also put it in the comments below  

play02:22

and i'll have a link to the second edition of the  book uh up soon as well probably in the comments  

play02:28

good so um a new chapter you can  follow along with all of the videos  

play02:32

and each video kind of um you  know follows follows the chapter  

play02:37

good so before i get into that organizational  chart of how you know all of these different  

play02:41

types of reinforcement learning can be thought  of i want to just do a really really quick recap  

play02:47

of what is the reinforcement learning problem  so in reinforcement learning you have an agent  

play02:53

that gets to interact with the world or the  environment through a set of actions sometimes  

play02:58

these are discrete actions sometimes these are  continuous actions if i have a robot i might have  

play03:03

a continuous action space whereas if i'm playing  a game if i'm the you know the white pieces on a  

play03:09

chess board then i have a discrete set of actions  even though it might be kind of high dimensional  

play03:16

and i observe the state of the system at each  time step i get to observe the state of the system  

play03:21

and use that information uh to change my actions  to try to maximize my current or future rewards  

play03:29

uh through through playing and i'll mention that  in lots of applications for example in chess  

play03:35

the reward structure might be quite sparse i might  not get any feedback on whether or not i'm making  

play03:39

good moves until the very end when i either  win or lose tic-tac-toe backgammon checkers  

play03:46

go are all kind of the same way and that delayed  reward structure is one of the things that makes  

play03:51

this reinforcement learning problem really really  challenging it's what makes uh you know learning  

play03:57

in animal systems also challenging if you want to  teach your dog a trick you know they have to know  

play04:03

kind of step by step what you want them to do  and so you actually sometimes have to give them  

play04:09

rewards at intermediate steps to train a behavior  and so the agent their control strategy or their  

play04:17

their policy is typically called pi and it  basically is a probability of taking action a  

play04:24

given a state s a current state s so this could be  a deterministic policy it could be a probabilistic  

play04:30

policy but essentially it's a set of rules that  determines what actions i as the agent take given  

play04:36

uh what i sense uh in the environment to maximize  my future rewards so that's the policy and again  

play04:45

usually this is written in a probabilistic  framework because typically the environment  

play04:50

is written as a probabilistic model and there  is something called a value function so given  

play04:57

um you know some policy pi that i take then i can  associate a value with being in each of the states  

play05:04

of this system essentially by what is my expected  future reward add up all of my future rewards  

play05:10

what's the expectation of that and we'll put in  this little discount factor because future rewards  

play05:15

might be less uh advantageous to me than than  current rewards this is just something that people  

play05:20

do in economic theory it's kind of like a you  know utility function and so for every policy pi  

play05:28

there is a value associated with being in each  of the given states s now again i'll point out  

play05:35

for for even reasonably sophisticated problems you  can do this for tic-tac-toe you can enumerate all  

play05:41

possible states and all possible actions and you  can compute this value function kind of through  

play05:45

brute force but even for moderately complicated  games even like checkers let alone back game in  

play05:51

her chess or go this state space the the the  space of all possible states you could observe  

play05:57

your system in is astronomically large i think  it's estimated that there's 10 to the 80 plus um  

play06:04

maybe it's 10 to 180 africa it's a huge number of  possible chess boards even more possible go boards  

play06:10

and so you can't really actually enumerate  this value function but it's a good um kind  

play06:16

of abstract function that we can think about are  these policy functions and these value functions  

play06:22

and at least in simplistic dynamic programming  we often assume that we know a model of of our  

play06:31

environment and i'll get to that in a minute so  the goal here the entire goal of reinforcement  

play06:37

learning is to learn through trial and error  through experience what is the optimal policy  

play06:44

to maximize your future rewards okay so notice  that this value function is a function of policy  

play06:50

pi i want to learn the best possible policy that  always gives me the most value out of every board  

play06:55

position out of every state and that's a really  hard problem it's easy to state the goal and it  

play07:02

is really really hard to solve this problem that's  why it's been uh you know this growing field for  

play07:07

100 years that's why it's still a growing field  because we have more powerful emerging techniques  

play07:12

in machine learning to start to solve this problem  so that's the framework i need you to know what  

play07:18

the the policy is that is the set of kind of rules  or or uh you know controllers that i as an agent  

play07:24

get to take to manipulate my environment the value  function tells me how valuable it is to be in a  

play07:30

particular state so i might want to move myself  into that state so i need you to know kind of this  

play07:36

nomenclature so now i can kind of show you how all  of these techniques are organized okay so that's  

play07:42

what i'm going to do for the rest of the video  is we're going to talk through kind of the key  

play07:46

organization of all of the different uh like  like mainstream types of reinforcement learning  

play07:53

okay so the first biggest dichotomy is between  model-based and model-free reinforcement learning  

play08:01

so if you actually have a good model of your  environment you have some you know markov decision  

play08:08

process or some differential equation if you have  a good model to start with then you can work in  

play08:13

this model based reinforcement learning world now  some people don't actually consider what i'm about  

play08:20

to talk about reinforcement learning but but i  do okay so for example if my environment is a  

play08:28

markov decision process which means that there is  a probability kind of a deterministic probability  

play08:35

that sounds like an oxymoron but if there is  a specified probability of moving from state s  

play08:41

to the next state s prime given action a and  this probability function is known so it's a  

play08:48

it doesn't depend on the history of your actions  and states it only depends on the current state  

play08:52

and the current action determines a probability of  going to a next state s prime then uh two really  

play08:59

really powerful techniques that i'm going to tell  you about to optimize the the policy function pi  

play09:06

is policy iteration and value iteration these  allow you to essentially iteratively walk through  

play09:14

um the game or or the markov decision process  taking actions that you think are going to  

play09:19

be the best and then assessing what the  value of that action and state actually  

play09:24

are and then kind of refining and iterating  the policy function and the value function  

play09:29

so that is a really really powerful approach  if you have a model of your system you can  

play09:34

run this kind of on a computer and and kind of  determine learn what the best policy and value is  

play09:43

and this is kind of a special case of dynamic  programming that relies on the bellman optimality  

play09:49

condition for the value function  so i'm going to do a whole lecture  

play09:53

on this blue part right here we're going to  talk about policy iteration and value iteration  

play09:57

and how they are essentially dynamic using dynamic  programming on the value function which satisfies  

play10:04

bellman's optimality condition now that was for  probabilistic uh processes things where maybe  

play10:10

you know like backgammon where there's a dice  roll at every turn for deterministic systems like  

play10:17

a robot system or a self-driving car if i think  about a human you know my reinforcement learning  

play10:22

problem this is much more of a continuous control  problem in which case i might have some nonlinear  

play10:28

differential equation x dot equals f of x comma u  and so the linear optimal control that we studied  

play10:37

i guess in chapter 8 of the book so linear  quadratic regulators common filters things like  

play10:44

that optimal linear control problems are special  cases of this optimal non-linear control problem  

play10:52

with the hamilton jacobi bellman equation  again this relies on bellman optimality and  

play10:58

you can use kind of dynamic programming ideas  to solve optimal nonlinear control problems  

play11:05

like this now i'll point out mathematically  this is a beautiful theory it's powerful  

play11:12

it's been around for decades and it's you know  kind of the textbook way of thinking about how to  

play11:16

design optimal policies and optimal controllers  you know for markov decision processes and for  

play11:22

non-linear control systems in practice actually  solving these things with dynamic programming  

play11:28

ends up usually amounting to a brute force search  and it's usually not scalable to high dimensional  

play11:36

systems so typically it's hard to do this optimal  hamilton jacobi bellman type non-linear control  

play11:42

for an even moderately high dimensional system  you know you can do this for a three-dimensional  

play11:47

system sometimes a five-dimensional system maybe  you know i've heard special cases with with  

play11:52

machine learning you can do this maybe for a 10 or  100 dimensional system but you can't do this for  

play11:57

the nonlinear fluid flow equations which might  have you know a hundred thousand or a million  

play12:01

dimensional differential equation when you write  it down on your computer so important caveat there  

play12:07

but that's model based control and a lot of what  we're going to do in model free control uses ideas  

play12:15

that we learned from model based control so even  though you know i don't actually do a lot of this  

play12:19

in my daily life with reinforcement learning  most of the time we don't have a good model  

play12:24

of our system for example in chess i don't have  a model of my opponent for example or at least i  

play12:29

can't write it down mathematically as a markov  decision process so i can't really use these  

play12:35

techniques but a lot of what model free control  reinforcement learning is going to do is kind of  

play12:42

approximate dynamic programming where you're  simultaneously learning kind of the dynamics  

play12:47

or learning to update these these functions  through trial and error without actually having  

play12:52

a model and so in model-free reinforcement  learning kind of the major dichotomy here  

play12:59

is between gradient-free and gradient-based  methods and i'll tell you what this means in a  

play13:06

little bit but for example if i can parameterize  my policy pi by some variables theta and i know  

play13:15

kind of what the dependency with those variables  theta are i might be able to take the gradient  

play13:20

of my reward function or my value function  with respect to those parameters directly  

play13:25

and speed up the optimization okay so gradient  based if you if you can use it is usually going  

play13:31

to be the fastest most efficient way to do  things but oftentimes again we don't have  

play13:36

gradient information we're just playing games  we're playing chess we're playing go and i can't  

play13:41

compute the derivative of one game with respect  to another that's hard for me at least to do  

play13:46

and so within gradient free okay there's a lot  of dichotomies here there's a dichotomy of a  

play13:51

dichotomy of a dichotomy within gradient free  control there is this idea of sometimes you can  

play13:57

be off policy or on policy and it's a really  important uh distinction what on policy means  

play14:05

is that let's say i'm playing a bunch of games  of chess i'm trying to learn an optimal policy  

play14:10

function or an optimal value function or both by  playing games of chess and iteratively kind of  

play14:16

refining my estimate of pi or a v what on policy  means is that i always play my best game possible  

play14:25

whatever i think the value function is and  whatever i think my best policy possible is  

play14:30

i'm always going to use that best policy as i play  my game and i'm going to always try to kind of get  

play14:35

the most reward out of my system every game  i play that's what it means to be on policy  

play14:42

off policy means well maybe i'll try some things  maybe maybe i know that my policy is suboptimal  

play14:49

and so i'm just going to do some like random moves  occasionally that is called off policy because i  

play14:54

think they're sub-optimal but they might be really  valuable for learning information about the system  

play15:01

uh so on policy methods include this  sarsa state action reward state action  

play15:07

and there's all of these variants of the sarsa  algorithm this on policy reinforcement learning  

play15:13

and these tds mean temporal difference and mc  is monte carlo and so there's this whole family  

play15:20

of kind of gradient-free optimization techniques  that use different kind of amounts of history i'll  

play15:27

talk all about that that's going to be a whole  other lecture is this this red box gradient free  

play15:33

model free reinforcement learning and so the off  policy version of sarsa kind of this on policy  

play15:41

set of algorithms there is an off policy variant  called q learning and so this quality function  

play15:48

q is kind of the joint value if you  like of being in a particular state  

play15:54

and taking a particular action a so this quality  function contains all of the information of my  

play16:01

my optimal policy and the value function and both  of these can be derived from the quality function  

play16:08

but the really important distinction is that when  we learn based on the quality function we don't  

play16:13

need a model for what my next state is going to be  this quality function kind of implicitly defines  

play16:19

the value of you know based on where you're going  to go in the future and so q learning is a really  

play16:25

nice way of learning when you have no model and  you can take off policy information and learn  

play16:31

from that you can take a sub-optimal controller  just to see what happens and still learn and get  

play16:36

better policies and better value functions in the  future and that's also really important if you  

play16:42

want to do imitation learning if i want to just  watch other people play games of chess even though  

play16:47

i don't know what their value function is or what  their policy is with these off policy learning  

play16:53

algorithms you can accumulate that information  into your estimate of the world and every bit of  

play16:59

information you get improves your quality function  and it improves the next game you're going to play  

play17:05

so really powerful and i would say most of what we  do nowadays you know is kind of in this q learning  

play17:11

world a lot a lot a lot of machine learning is  q learning reinforcement learning is q-learning  

play17:18

and then the gradient-based algorithms i'm not  going to talk about it too much here but it's  

play17:22

essentially where you would actually update the  parameters of your policy or your value function  

play17:26

or your q function directly using some kind of a  gradient optimization so if i can sum up all of my  

play17:33

future rewards and it's a function of the current  parameters theta that parameterize my policy  

play17:40

then i might be able to use gradient optimization  things like newton's steps and steepest descent  

play17:44

things like that to get a good estimate  and this when i have the ability to do that  

play17:50

is going to be way way faster uh than any of  these uh these gradient free methods and and  

play17:55

even in term uh will be faster than dynamic  programming and so the last piece of this  

play18:01

is kind of in the last 10 years we've had this  massive explosion of deep reinforcement learning  

play18:08

a lot of this has been because of deep mind and  alphago you know demonstrating that machines  

play18:14

computers can play atari games at human level  performance they can beat grand masters that go  

play18:20

just incredibly impressive demonstrations  of reinforcement learning that now use deep  

play18:25

neural networks either to learn a model where you  can then use model-based reinforcement learning  

play18:31

or to represent these kind of model-free concepts  so you can have like a deep neural network for  

play18:39

the quality function you can have a deep neural  network for the policy and then differentiate  

play18:44

with respect to those network parameters uh using  kind of you know auto diff and back propagation  

play18:50

uh to do gradient based optimization on your  policy network i would say that deep model  

play18:56

predictive control this doesn't exactly fit into  the reinforcement learning world but i would say  

play19:02

you know it's it's morally very closely related  deep model predictive control allows you to solve  

play19:07

these kind of hard optimal nonlinear problems  and then you can actually learn a policy based  

play19:13

on what your model predictive controller actually  does you can essentially kind of codify that model  

play19:18

predictive controller into a control policy and  finally uh actor critic methods um actor critic  

play19:25

methods existed long before deep reinforcement  learning but nowadays they have kind of a renewed  

play19:32

interest uh because you can you can uh you  can train these with with deep neural networks  

play19:38

okay so that is the mile-high view as i see it of  the different categories of reinforcement learning  

play19:46

is this comprehensive absolutely  not is it a hundred percent  

play19:50

factually correct definitely not this is you  know a rough sketch of the main divides and  

play19:56

things you need to think about when you're  choosing a reinforcement learning algorithm  

play20:00

if you have a model of your system you can use  dynamic programming based on bellman optimality  

play20:05

if you don't have a model of the system you can  either use gradient free or gradient based methods  

play20:11

and then there's on policy and off policy variants  depending on you know your specific needs it tends  

play20:17

out to be that sarsa methods are more conservative  and q learning will tend to converge faster  

play20:23

and then for all of these methods there are  ways of kind of making them more powerful  

play20:27

and more flexible representations using uh deep  neural networks in kind of different focused ways  

play20:38

okay so in the next few videos we'll zoom into you  know this part here for markup decision processes  

play20:43

how we do policy iteration and value iteration  we'll actually derive the quality function uh here  

play20:50

we'll talk about model free control these kind  of gradient free methods on policy and off policy  

play20:55

cue learning is one of the most important ones and  temporal difference learning actually has a lot of  

play21:02

neuro science analog so how we learn in in our  animal brains people think you know is very  

play21:09

closely related to these td learning policies  we'll talk about how you do optimal nonlinear  

play21:15

control with the hamilton jacobi bellman equation  uh we'll talk very briefly about policy gradient  

play21:21

optimization and then you know all of these  there are kind of deep learning things uh  

play21:27

we'll pepper it throughout with deep learning  or maybe i'll have a whole lecture on on these  

play21:31

deep learning methods so that's all coming up  really excited to walk you through this thank you

Rate This

5.0 / 5 (0 votes)

Related Tags
強化学習理論解説実践入門深層学習AI技術動的最適化学習アルゴリズムMarkov決定過程Q学習SARSA
Do you need a summary in English?