Deep Reinforcement Learning Tutorial for Python in 20 Minutes

Nicholas Renotte
29 Aug 202020:55

Summary

TLDRこのビデオでは、強化学習の基礎について解説しています。ディープラーニングや機械学習と異なり、強化学習はライブ環境でのトレーニングに重点を置きます。ビデオでは、OpenAI Gymを使用して環境を作成し、TensorFlowとKerasでディープラーニングモデルを構築し、Keras-RL2でポリシーベースの学習を通じて強化モデルをトレーニングする方法を紹介しています。最終的に、トレーニング済みのモデルを保存し、必要に応じて再ロードしてプロダクションに展開することができるデモンストレーションを行います。

Takeaways

  • 🤖 強化学習とは異なる学習手法で、リアルタイム環境でのトレーニングが特徴です。
  • 🧠 強化学習の核心概念を覚えるための記憶術として「AREA 51」が使われ、AはAction、RはReward、EはEnvironment、AはAgentを表します。
  • 🚀 動画ではOpen AI Gymを使用して環境を作成し、TensorFlowとKerasでディープラーニングモデルを構築し、KerasRLで強化学習モデルをトレーニングする方法が説明されています。
  • 🐍 PythonとJupyter Notebook内で作業を行い、依存関係としてTensorFlow、Keras、KerasRL、Open AI Gymをインストールすることが必要です。
  • 🎮 Open AI Gymには多数の事前構築済み環境があり、特にCartPoleという環境を使用してモデルをテストします。
  • 📈 CartPole環境では、カートを動かしバネを倒さないように保つことでポイントを獲得し、最大200ポイントを目指します。
  • 🔧 KerasのSequential APIを使用してディープラーニングモデルを構築し、FlattenとDenseレイヤーを組み合わせてモデルを作成します。
  • 🤹‍♂️ KerasRLを使用してDQN(Deep Q-Network)エージェントをトレーニングし、ポリシーベースの学習を行っています。
  • 💾 トレーニングされたモデルの重みを保存し、後で再読み込みすることで、モデルをプロダクション環境にデプロイすることができます。
  • 📊 最終的な結果として、強化学習によってCartPoleのスコアが大幅に向上し、ほぼ完璧な200ポイントのスコアを達成することが示されています。

Q & A

  • 強化学習とはどのような学習手法ですか?

    -強化学習は、主にエージェントが環境と相互作用を通じて最適な行動を選択するように学習する手法です。これは、教師あり学習や非教師あり学習とは異なるアプローチです。

  • 強化学習における'A.R.E.A.'とは何を表しますか?

    -A.R.E.A.は、強化学習モデルで必要な4つの要素を表しており、それぞれAction(行動)、Reward(報酬)、Environment(環境)、Agent(エージェント)を意味します。

  • OpenAI Gymとは何ですか?

    -OpenAI Gymは、強化学習アルゴリズムのテストや開発に使用される、環境とアルゴリズムの標準化されたツールキットです。

  • カートポール環境とはどのようなものですか?

    -カートポール環境は、強化学習の基本的なテスト環境で、カートを動かして棒を倒れないように保つゲームです。報酬は、棒が倒れるたびに減少し、目標は200ポイントまで獲得することです。

  • TensorFlowとKerasはどのようにして強化学習モデルを構築するのですか?

    -TensorFlowとKerasを使用して、順伝播ニューラルネットワーク(Sequential model)を構築し、強化学習エージェントの行動を決定するモデルを作成します。

  • DQN(Deep Q-Network)とは何ですか?

    -DQNは、価値ベースの強化学習アルゴリズムの1つで、ディープラーニングを用いて行動価値関数を学習します。

  • ボルツマンQポリシーとは何ですか?

    -ボルツマンQポリシーは、強化学習におけるポリシーの一種で、行動選択の確率をQ値に基づいて計算します。これは、探索と利用のバランスを取るために使用されます。

  • Sequential Memoryとは何で、なぜDQNエージェントで必要なのですか?

    -Sequential Memoryは、DQNエージェントが過去の経験を記憶し、最適な行動を選択するためのメモリです。これは、長期的な状況を考慮する必要がある強化学習タスクで特に重要です。

  • モデルの重みを保存し、後で再利用するにはどうすればよいですか?

    -DQNモデルの重みは、save_weightsメソッドを使用して保存でき、load_weightsメソッドで再ロードすることができます。これにより、トレーニング済みのモデルを再利用して、新たな環境やタスクに適用することが可能です。

  • このビデオスクリプトで使用されたJupyter Notebookとは何ですか?

    -Jupyter Notebookは、Pythonコードを実行し、分析、可視化を行えるウェブベースのインタラクティブな開発環境です。これにより、データサイエンスや機械学習のタスクをより効率的に実行できます。

Outlines

00:00

😀 リンforcement Learningの概要

この段落では、ニコラスが強化学習(Reinforcement Learning)の基本を紹介しています。強化学習は、ディープラーニングやマシンラーニングとは異なるアプローチで、ライブ環境でのトレーニングが特徴です。強化学習の4つの主要概念をArea 51と記憶するようアドバイスされ、それらはアクション(Action)、報酬(Reward)、環境(Environment)、エージェント(Agent)です。このビデオでは、Open AI Gymを使用して環境を作成し、TensorFlowとKerasでディープラーニングモデルを構築し、KerasRLで強化学習モデルをトレーニングするプロセスが説明されています。

05:02

🔧 Open AI Gymを使った環境の構築

このセクションでは、Open AI Gymを使用して強化学習のための環境を構築する方法が解説されています。Open AI Gymには、様々な事前構築済みの環境があり、特にCartPoleという環境が選ばれています。CartPoleでは、カートを動かし、棒を倒さないように保つゲームです。報酬はステップごとに1ポイント、最大200ポイントが目標です。ランダムステップで環境を可視化し、強化学習モデルがトレーニングされる前のランダムな動作を確認できます。

10:03

🤖 Kerasを使用したディープラーニングモデルの構築

この段落では、KerasのSequential APIを利用してディープラーニングモデルを作成するプロセスが説明されています。モデルは4つの状態を入力として取り、2つのアクション(左または右への移動)を出力します。モデルの概要は、Flattenレイヤーと2つのDenseレイヤーを通じて構成され、ReLUアクティベーション関数が使用されています。このモデルは、強化学習のトレーニングに使用されます。

15:03

🛠 Keras RLを使った強化学習モデルのトレーニング

ここでは、Keras RLを利用してDQN(Deep Q-Network)エージェントをセットアップし、強化学習モデルをトレーニングする方法が紹介されています。DQNエージェントは、ポリシーベースの学習手法を使用してトレーニングされ、Boltzmann Q Policyが適用されています。トレーニングには、環境、ステップ数、可視化の有無、ログの詳細レベルが指定されています。トレーニングが終了すると、モデルは200ポイント近くまで到達することがわかります。

20:03

🏆 モデルのテストとデプロイ

最後に、トレーニングされた強化学習モデルのパフォーマンスをテストし、モデルの重みを保存して後で再利用する方法が説明されています。DQNモデルのテストでは、ほぼすべてのエピソードで200ポイント近くのスコアが達成されています。モデルの重みは.h5fファイルとして保存され、後で再ロードして同じ環境でテストすることができます。このプロセスは、モデルをプロダクション環境にデプロイする際に役立ちます。

📚 総括とリソースへのリンク

ビデオの最後に、カバーされたトピックの総括が行われ、依存関係のインストール、Open AI Gymでのランダム環境の作成、KerasとKeras RLを使用したディープラーニングモデルのトレーニング、そしてメモリからのエージェントの再ロードが説明されています。また、このビデオが役立った場合は、いいね、登録、ベルのアイコンをクリックするよう呼びかけられ、質問やヘルプが必要な場合はコメント欄でサポートが提供されます。コースの資料、GitHubリポジトリ、ドキュメントへのリンクが説明の下に提供されています。

Mindmap

Keywords

💡強化学習

強化学習は、機械学習の一分野で、エージェントが環境と相互作用を通じて最適な行動を選択するように学習するプロセスを指します。このビデオでは、強化学習を通じてカートポル(CartPole)という問題を解決する方法が解説されています。例えば、ビデオでは強化学習を使用して、カートを動かして棒を倒れないように保とうとしています。

💡ディープラーニング

ディープラーニングは、人工知能のサブフィールドで、多层のニューラルネットワークを使用して複雑なパターンを学習することを指します。ビデオではディープラーニングモデルを構築し、強化学習と組み合わせてカートポル問題を解決しています。例えば、Kerasを使用してSequential APIを通じてディープニューラルネットワークを構築しています。

💡カートポル(CartPole)

カートポルは、強化学習のチュートリアルでよく使われるサンプル環境で、カートが棒を倒れないように動かすことが目標のタスクです。ビデオではカートポル環境を使って強化学習の概念を説明しており、ディープラーニングを用いてカートを最適に動かそうとしています。

💡OpenAI Gym

OpenAI Gymは、強化学習アルゴリズムの研究と開発に使用されるツールキットで、様々な環境を提供して学習アルゴリズムをテストすることができます。ビデオではOpenAI Gymを使ってカートポル環境を構築し、強化学習モデルをトレーニングしています。

💡KerasRL

KerasRLは、Kerasフレームワークに基づいて強化学習モデルを簡単に構築できるライブラリです。ビデオではKerasRLを使用してディープラーニングモデルを強化学習モデルに変換し、トレーニングしています。

💡ポリシーベース学習

ポリシーベース学習は、強化学習のアプローチの一つで、エージェントの行動選択の確率を直接最適化することを目指します。ビデオでは、Boltzmann Q Policyというポリシーを使用して、カートポル問題を解決しています。

💡DQN(Deep Q-Network)

DQNは、ディープラーニングとQ学習を組み合わせたアルゴリズムで、高次元の連続的なアクション空間を持つ問題に適用できます。ビデオではDQNを使用してカートポル環境での行動を選択し、最適な戦略を学習しています。

💡環境(Environment)

環境は、強化学習においてエージェントが行動をとり、その結果として報酬を受け取る周囲の世界を指します。ビデオではOpenAI Gymのカートポル環境を使って、エージェントが行動をとり、報酬を獲得する様子を解説しています。

💡報酬(Reward)

報酬は、強化学習でエージェントの行動が環境に与える影響を評価する指標です。ビデオでは、カートポルを倒さずに維持することで得られる報酬を最大化することを目標にしています。

💡アクション(Action)

アクションは、強化学習でエージェントが環境に対して取る行動を指します。ビデオのカートポル問題では、左または右に移動するという2つのアクションがあります。

Highlights

介绍了强化学习与传统的监督学习和无监督学习的区别。

强化学习的核心概念可以用'AREA 51'来记忆,分别代表Action、Reward、Environment和Agent。

使用Open AI Gym创建环境,开始强化学习模型的构建。

通过TensorFlow和Keras构建深度学习模型,并将其传递给KerasRL进行强化学习训练。

介绍了如何在Python和Jupyter Notebook中工作,构建和训练强化学习模型。

展示了安装TensorFlow、Keras、KerasRL和Open AI Gym等依赖库的过程。

使用Open AI Gym中的CartPole环境来测试强化学习模型。

CartPole环境的目标是通过左右移动小车来平衡杆子,每步获得1分,最高200分。

展示了如何使用Python代码随机操作CartPole环境,并观察其表现。

介绍了如何构建深度学习模型,使用Keras的Sequential API和Dense层。

展示了如何使用KerasRL训练深度学习模型,使用DQN(深度Q网络)代理。

介绍了使用Boltzmann Q策略进行策略型强化学习。

展示了如何使用Sequential Memory类来维护DQN代理的记忆。

展示了如何编译和训练DQN模型,并通过可视化观察训练过程。

展示了训练后的DQN模型在CartPole环境中的表现,能够达到接近200分的高分。

介绍了如何保存和重新加载DQN模型的权重,以便在生产环境中部署。

展示了如何重建DQN代理并重新加载权重,测试其在CartPole环境中的表现。

视频最后提供了课程材料和GitHub仓库链接,帮助观众开始他们的强化学习模型。

Transcripts

play00:00

before reinforcement learning

play00:03

after reinforcement learning

play00:07

what's happening guys my name is

play00:09

nicholas and in this video we're going

play00:10

to be going through a bit of a crash

play00:12

course on reinforcement learning

play00:14

now if you've ever worked with deep

play00:15

learning or machine learning before you

play00:17

know the two key forms are supervised

play00:19

and unsupervised learning

play00:20

now reinforcement learning is a little

play00:22

bit different to that because you tend

play00:24

to train

play00:24

in a live environment now there's a

play00:26

really easy way to remember the core

play00:28

concepts in reinforcement learning

play00:30

all you need to remember is area 51. now

play00:32

you're probably thinking what the hell

play00:34

does area 51 have to do with

play00:35

reinforcement learning

play00:36

well the area in area 51 stands for

play00:39

action reward environment and agent

play00:42

these are the four key things you need

play00:43

in

play00:44

any reinforcement learning model now in

play00:45

this video we're going to be covering

play00:47

all of those key concepts let's take a

play00:49

deeper look as to what we're going to be

play00:50

going through so in this video we're

play00:51

going to cover

play00:52

everything you need to get started with

play00:54

reinforcement learning we're going to

play00:55

start out by creating an environment

play00:57

using open ai gym

play00:58

we're then going to build a deep

play01:00

learning model using tensorflow and

play01:01

keras

play01:02

this same model will then pass to

play01:04

kerasrl in order to train our

play01:06

reinforcement learning model

play01:07

using policy-based learning now in terms

play01:10

of how we're going to be doing it we're

play01:11

going

play01:12

to be largely working within python and

play01:14

specifically

play01:15

we're going to be working inside of a

play01:16

jupyter notebook we'll start out by

play01:18

building our environment using open ai

play01:20

gym

play01:20

we'll then build our deep learning model

play01:22

again using tensorflow and keras

play01:24

and then once we've built that model

play01:26

we're then going to train it using

play01:27

kerasrl

play01:28

we'll then be able to take that same

play01:30

model save it down into memory and

play01:31

reload it for when we want to deploy it

play01:33

into production

play01:34

ready to get to it let's do it so

play01:37

there's a couple of key things that we

play01:38

need to do in order to build

play01:40

our deep reinforcement learning model so

play01:42

specifically we need to first up

play01:43

install our dependencies then what we're

play01:45

going to do is build an environment with

play01:48

open ai gym with just a couple of lines

play01:49

of code

play01:50

so this is going to allow us to see the

play01:52

environment that we're actually using

play01:53

reinforcement learning in later on

play01:56

then we're going to build a deep

play01:57

learning model with keras so we're

play01:58

specifically going to be using the

play02:00

sequential api there

play02:02

and then what we're going to do is train

play02:03

that keras model using keras

play02:05

reinforcement learning

play02:06

and last but not least we're going to

play02:07

delete it all and reload that agent from

play02:10

memory so this is going to allow you to

play02:11

deploy it into production if you want to

play02:13

later on

play02:13

so first up let's install our

play02:15

dependencies so what we're going to need

play02:16

here

play02:17

is tensorflow keras kerasrl as well as

play02:20

open ai

play02:21

gym

play02:33

so what we've done is we've installed

play02:35

our four key dependencies so we've used

play02:37

pip

play02:37

install and specifically we've installed

play02:39

tensorflow 2.3.0

play02:41

we've installed open ai gym so that's

play02:43

just gym

play02:44

we've installed keras and we've also

play02:46

installed keras rl2

play02:48

so those are all our dependencies now

play02:50

done and installed

play02:51

now what we can actually go and do is

play02:53

set up a random environment with open ai

play02:56

gym

play02:56

now open ai gym comes with a bunch of

play02:59

pre-built

play02:59

environments that you can use to test

play03:01

out reinforcement learning on

play03:03

so if we actually head on over to

play03:06

gym.openai.com

play03:07

you can see there's a bunch of random

play03:10

environments so

play03:11

here we've got some algorithms we've got

play03:13

atari games so if you wanted to build

play03:15

atari

play03:16

or video game style reinforcement

play03:18

learning engines you could

play03:20

we're going to be working on these

play03:21

classic control ones and specifically

play03:23

we're going to be using cartpol and so

play03:26

the whole idea behind carpol is that you

play03:28

want to basically

play03:29

move this cart down the bottom here in

play03:31

order to balance the pole

play03:33

up there so the whole idea is that for

play03:35

each step you take you get a point with

play03:37

a maximum of

play03:38

200 points so ideally what we're going

play03:41

to see when we start off is with our

play03:43

random steps we're not going to get

play03:44

anywhere near 200 but

play03:46

once we use deep learning and

play03:47

reinforcement learning we ideally should

play03:49

get a much closer to actually hitting

play03:52

our final result

play03:53

now we've got two movements we can

play03:55

either go left or right so

play03:56

what we're going to see is when we

play03:57

create our environment we're going to

play03:58

have two actions available either left

play04:00

or right

play04:01

if you work in different reinforcement

play04:03

learning environments you might have a

play04:04

different number of actions that you can

play04:06

take so for example you might go up or

play04:08

down left or right

play04:09

if you're working with other things so

play04:11

now what we're going to do is

play04:12

set up this environment so you can work

play04:14

with it within python so if we go back

play04:16

to our jupyter notebook

play04:17

let's start setting that up so the first

play04:19

thing that we need to do is import our

play04:20

dependencies so

play04:21

in order to do that we're going to

play04:23

import openai gym and we're also going

play04:25

to import the random library so we can

play04:27

take a bunch of random steps

play04:30

so those are our two key dependencies

play04:32

imported so

play04:33

and this is specifically for our open ai

play04:35

gym so we've imported gym

play04:37

and we've also imported random now what

play04:40

we can go and do is actually set up that

play04:42

environment

play04:52

so that's our environment set up so what

play04:54

we went and did there is we used

play04:56

the open ai gym library and specifically

play04:58

we used

play04:59

the make method to build our carpol

play05:02

environment so remember that was the

play05:03

carpol environment that we saw here

play05:06

we then extracted the states that we've

play05:08

got so this is available through env

play05:10

which is our environment that we just

play05:11

set up

play05:12

observation space dot shape so we're

play05:14

taking a look at all the different

play05:15

states that we've got available

play05:17

within our environment and we've also

play05:19

extracted the action so if you take a

play05:21

look we're getting that from our action

play05:22

space

play05:23

and we can see that we're going to have

play05:24

a specific number of actions so if we

play05:26

take a look at our states

play05:27

we've basically got four states

play05:29

available and if we take a look at our

play05:30

actions

play05:31

we've got two actions so basically those

play05:33

are left or right moving

play05:34

our carpal left or right now what we can

play05:36

actually go and do is actually

play05:38

visualize what it looks like when we're

play05:39

taking random steps within our carpol

play05:41

environment

play05:42

so ideally what we'll see is that our

play05:43

carpals just sort of moving randomly

play05:45

because we're taking random steps in

play05:47

order to get a specific score so

play05:49

remember with each step that we take

play05:51

where our carpol hasn't fully fallen

play05:53

over

play05:53

we're going to get one point with a

play05:55

maximum of 200 points so

play05:57

let's build our random environment

play06:26

all right so we've written a bit of code

play06:28

there now what we're actually going to

play06:30

do is

play06:30

start by breaking this down from here so

play06:32

the first thing that we're going to do

play06:34

is render our environment so this is

play06:35

going to allow us to see our cut in

play06:37

action when it's moving left and right

play06:39

then what we're doing is we're taking a

play06:41

random step so we're either

play06:43

going left or right so zero or one

play06:46

basically represents one of those steps

play06:47

so we're just taking a random choice

play06:49

to see how that impacts our environment

play06:51

then what we're doing is we're actually

play06:53

applying

play06:54

that action to our environment and we're

play06:56

getting a bunch of stuff as a result of

play06:57

that

play06:58

so we're getting our state we're getting

play06:59

our reward we're getting whether or not

play07:01

we've completed the game so whether or

play07:03

not we've failed or whether or not we've

play07:04

passed

play07:05

and we're also getting a bunch of

play07:06

information then

play07:08

based on our step we're going to get a

play07:10

reward so remember if we take a step in

play07:12

the correct direction and we haven't

play07:14

failed we get one point

play07:15

this basically allows us to accumulate

play07:17

our entire reward

play07:19

now if we fail or if we get to the end

play07:22

of the game then

play07:23

done is going to be set to true so what

play07:25

we're doing is we're continuously taking

play07:27

steps until we're complete

play07:28

so we reset the entire environment up

play07:31

here and then we're also printing out

play07:33

our final reward so ideally what we'll

play07:35

get is

play07:36

the episode number as well as our score

play07:38

so

play07:39

let's go on ahead and run that and see

play07:41

our episodes live and in action actually

play07:43

it looks like we've got a bug there

play07:45

episode

play07:48

all right so you can see our carts

play07:49

moving and it's moving randomly

play07:51

and you can see that our pole is sort of

play07:53

flailing about now what we're actually

play07:55

logging out is the score each time so it

play07:57

looks like

play07:58

we're surpassing a specific threshold

play08:00

and we're failing so we're only getting

play08:01

up to a maximum of about

play08:03

38 so that's our maximum score now

play08:06

ideally what we want to be able to get

play08:07

is

play08:07

all the way up to 200 and this is where

play08:10

reinforcement learning comes in

play08:11

so basically our deep learning model is

play08:13

going to learn the best action to take

play08:15

in that specific environment in order to

play08:17

maximize our score

play08:18

now this all starts with a deep learning

play08:20

model so let's go ahead and start

play08:22

creating a deep learning model

play08:24

now in order to do that we first up need

play08:26

to import some dependencies and these

play08:27

are largely going to be our tensorflow

play08:29

keras dependencies

play08:30

so let's go ahead and import those

play08:45

so we've imported our dependencies so

play08:47

we've specifically first up imported

play08:49

numpy so this is going to allow us to

play08:50

work with numpy arrays

play08:52

then we've imported the sequential api

play08:54

so this is going to allow us build a

play08:56

sequential model with keras

play08:58

then we've also imported two different

play08:59

types of layers so specifically we've

play09:01

imported

play09:02

our dense node as well as our flatten

play09:04

node and last but not least we've

play09:06

imported the atom optimizer so that's

play09:08

going to be the optimizer that we use

play09:10

to train our deep learning model now

play09:13

what we can go and do is actually go and

play09:15

build that model so we're going to build

play09:16

this

play09:17

wrapped inside of a function so we can

play09:18

reproduce this model whenever we need to

play09:39

so that's our build model function

play09:41

defined so what we've basically gone and

play09:43

done is created

play09:44

a new function called build model and to

play09:46

that we're going to pass two arguments

play09:47

so specifically our states

play09:49

so these were the states that we

play09:51

extracted from our environment up here

play09:53

and we're also going to pass through our

play09:55

actions so these are going to be the two

play09:57

different actions that we've got in our

play09:58

carpol environment

play09:59

in order to build our deep learning

play10:01

model we're first instantiating our

play10:02

sequential model then we're passing

play10:04

through the

play10:05

flatten node and specifically to that

play10:07

we're going to be passing through

play10:09

a flat node which contains our different

play10:11

states so remember our four different

play10:12

states that we had

play10:13

then we're adding two dense nodes to

play10:15

start building out our deep learning

play10:16

model with a relu activation function

play10:19

and last but not least our last dense

play10:22

node has our actions so this is

play10:23

basically going to mean

play10:25

that we pass through our states at the

play10:27

top and we

play10:28

pass through our actions down the bottom

play10:30

so ideally what we should be able to do

play10:32

is

play10:32

train our model based on the states

play10:34

coming through to determine the best

play10:36

actions

play10:36

to maximize our reward or our score that

play10:39

we can see here

play10:40

so let's go ahead and create an instance

play10:41

of that model just by using that build

play10:43

model function

play10:46

and we can also visualize what the model

play10:48

looks like using the model.summary

play10:50

function

play10:51

so you can see here that we're passing

play10:52

through our four different states

play10:54

we've got 24 dense nodes 24 dense nodes

play10:56

so these are going to be our fully

play10:58

connected layers within our neural

play10:59

network

play11:00

and then last but not least we're going

play11:01

to be passing out our two different

play11:03

actions that we want to take within our

play11:05

environment now what we can go and do is

play11:07

take this deep learning model and

play11:09

actually train it using keras rl

play11:11

so first up we need to import our keras

play11:13

rl dependencies so let's go ahead and do

play11:15

that

play11:26

so those are our dependencies imported

play11:29

so we've imported

play11:30

three key things here so we've imported

play11:32

out a deep

play11:33

queue network agent so basically there's

play11:36

a bunch of different agents within

play11:37

the keras rl environment so you can see

play11:40

we've got a dqm agent a naffa agent

play11:42

ddpg sasa sem so all of these are

play11:46

different agents that you can use to

play11:48

train

play11:48

your reinforcement learning model we're

play11:50

going to be using dqn for this

play11:51

particular video but

play11:52

try testing out some of the others and

play11:54

see how you go now what we

play11:56

also have is a specific policy so within

play11:58

reinforcement learning you've got

play11:59

different styles

play12:00

so you've got value-based reinforcement

play12:02

learning and you've also got

play12:03

policy-based reinforcement learning so

play12:05

in this case we're going to be using

play12:06

policy-based reinforcement learning and

play12:08

the specific policy that we're going to

play12:10

be using

play12:10

is the boltzmann q policy which you can

play12:13

see here

play12:14

now the last thing that we've gone and

play12:15

imported is sequential memory so for

play12:17

our dqn agent we're going to need to

play12:19

maintain some memory

play12:21

and the sequential memory class is what

play12:23

allows us to do that

play12:24

so now what we can go and do is set up

play12:27

our agent and again we're going to wrap

play12:28

this inside of a function so we can

play12:30

reproduce it when we want to reload it

play12:32

from memory so let's go

play12:33

ahead and build that function

play12:59

so that's our function defined now what

play13:01

we've basically done is we've named our

play13:03

function build

play13:04

agent and to that we pass through our

play13:06

model so this is

play13:07

our deep learning model that we

play13:09

specified up here and we're also passing

play13:11

through the different actions that we

play13:13

can take so those were the two different

play13:14

actions

play13:15

left or right that we had available

play13:17

within our environment

play13:18

then we set up our policy we set up our

play13:22

memory

play13:22

and we set up our dqn agent and to that

play13:25

dqn agent we actually pass through our

play13:27

deep learning model

play13:28

and memory our policy as well as a

play13:30

number of

play13:31

other keyword arguments so then what we

play13:34

do is we return that dqn

play13:35

agent so let's go on ahead and actually

play13:38

use this dqn agent to actually now

play13:40

go and train our reinforcement learning

play13:43

model so first up we want to start out

play13:45

by instantiating our dqm model

play13:47

then we're going to compile it and then

play13:48

we're going to go ahead and fit

play14:04

all right and there you go so you can

play14:06

see that our dqn model is now starting

play14:08

to train

play14:09

so what we actually did is we

play14:11

instantiated our or we used our build

play14:13

agent function to set up a new dqm model

play14:16

and that was that up here

play14:17

and we passed through our model as well

play14:19

as our actions

play14:21

we then compiled it and we passed

play14:22

through our optimizer so this was that

play14:24

atom optimizer that we imported right at

play14:26

the start

play14:27

and we also passed through the metrics

play14:29

that we want to track so in this case

play14:30

it's mean

play14:31

absolute error then we use the fit

play14:33

function to kick off the training

play14:35

and to that we pass through our entire

play14:36

environment the number of steps we want

play14:38

to take

play14:39

whether or not we want to visualize it

play14:40

so we'll take a look at that in a second

play14:42

and we also specified verbose as one so

play14:44

we don't want full logging we want a

play14:46

little bit of logging

play14:47

now what we can do is just let that go

play14:49

ahead and train to take a couple of

play14:51

minutes and then

play14:52

we should have a fully built

play14:54

reinforcement learning model

play14:55

five minutes later sweet so that's our

play14:58

reinforcement

play14:59

learning model now done dusted and

play15:02

trained so all

play15:02

up it took about 256 seconds to go and

play15:05

train and you can see

play15:06

in our fourth interval that we're

play15:08

accumulating a reward of about 200

play15:11

now what we can go and do is actually

play15:12

print out and see what our total scores

play15:15

were so remember when we started out up

play15:17

here so just taking random steps we were

play15:18

getting about a maximum score of about

play15:21

51

play15:22

but that's not all that great

play15:23

considering that the total maximum score

play15:25

for the game

play15:26

is about 200. so let's go and test this

play15:29

out and see what this

play15:30

actually looks like or how it's actually

play15:32

performing so we can do that using

play15:34

the dqn.test method so let's try that

play15:38

out

play15:48

all right so that's looking better

play15:50

already so you can see in virtually

play15:51

every single episode we're getting a

play15:53

score of about 200

play15:55

and our mean is 200. so what we did

play15:57

there in order to test that out

play15:59

is we accessed our dqn model and we use

play16:01

the test method

play16:03

to that we pass through our actual

play16:04

environment the number of games that we

play16:06

want to run so in this case

play16:08

they're called episodes so we ran 100

play16:10

games

play16:11

and whether or not we want to visualize

play16:12

it then what we did is we

play16:14

outputted our mean result now if we

play16:16

wanted to actually visualize what the

play16:17

difference is we can do that as well

play16:26

and you can see our model is performing

play16:28

way better so you can see it's actually

play16:30

able

play16:31

to balance the pole a whole lot better

play16:33

than what it was before when it was just

play16:34

randomly sort of

play16:36

flailing about we can test that out

play16:37

again so this time rather than doing

play16:39

five episodes say we wanted to

play16:41

um 15 for example so you can see that

play16:44

our model again

play16:45

it's performing way better than what it

play16:47

was initially so

play16:49

it's actually able to reiterate itself

play16:51

and resort to balanced it and make sure

play16:53

that that pole stays straight

play16:57

brings a tear to my eye so good

play17:07

sweet so that's all done now what

play17:10

happens if we actually wanted to go and

play17:12

save this model away

play17:13

and use it later on say for example we

play17:15

wanted to go and deploy it into

play17:16

production

play17:17

well what we can actually do is we can

play17:19

actually save the weights from our dqm

play17:21

model and then reload them later on and

play17:23

to try to test them out

play17:24

so we can do that using the save weights

play17:26

method

play17:27

from our dqm model so let's go ahead and

play17:29

save our weights

play17:30

then what we'll do is we'll blast away

play17:32

all of the stuff that we just created

play17:34

and we'll rebuild it by reloading our

play17:36

weights

play17:42

so we've now gone and saved our weight

play17:44

so if we actually take a look in our

play17:45

folder you can see that we've gone

play17:47

and generated two different h5f files so

play17:50

these basically allow us

play17:51

to save our reinforcement learning model

play17:54

weights

play17:55

now if we wanted to go and rebuild our

play17:56

agent first up let's start by deleting

play17:58

our model deleting our environment and

play18:00

deleting our dqn agent

play18:02

and then what we can do is rebuild it

play18:03

using all the functions that we had and

play18:05

reload those weights to test it out so

play18:07

if we go and do that

play18:10

so you can see if we go and try to use

play18:11

our dqn.test method

play18:15

there's nothing there because we've then

play18:17

gone and deleted it but what we can do

play18:19

is we can go and rebuild that

play18:21

environment and test it out so let's go

play18:22

and do that

play18:49

perfect so we've now gone and

play18:51

reinstantiated all of our models so we

play18:53

first up we built our environment

play18:55

we extracted our actions and our states

play18:57

just like we did before

play18:59

then we used our build model and our

play19:01

build agent

play19:02

functions to go and rebuild our deep

play19:04

learning model and

play19:05

reinstantiate our dqn agent and then

play19:08

last but not least we compiled it

play19:10

now what we can do is actually reload

play19:12

our weights into our model and then test

play19:13

it out again so in order to do that we

play19:15

can use the dqn

play19:16

dot load weights method so before up

play19:19

here we use save weights now we can

play19:21

load our weights in order to re-test

play19:22

this out

play19:26

and the file that we're going to pass to

play19:27

our load weights method is

play19:29

the one that we exported out here so we

play19:30

can copy that in and paste that here

play19:34

and now we've gone and reloaded our

play19:36

weights we can actually go

play19:37

and test out our environment again so

play19:39

ideally what we should get is similar

play19:41

results so

play19:43

again you can see it's performing well

play19:44

it's performing just as well as what it

play19:46

did

play19:47

before we deleted our weights and now we

play19:49

went and reloaded them

play19:50

and that about wraps up this video so we

play19:52

covered a bunch of stuff so specifically

play19:55

we went and installed our dependencies

play19:57

we then created a random environment

play19:59

using open ai gym and we got about a

play20:01

maximum score of about 51

play20:03

we then built a deep learning model

play20:04

using keras and then use keras rl to

play20:07

train that

play20:08

using policy-based reinforcement

play20:10

learning and then

play20:11

last but not least we went and reloaded

play20:13

that agent from memory so that allows

play20:15

you to work with this

play20:16

inside of a production environment if

play20:18

you want to go and deploy

play20:20

it and that about wraps it up thanks so

play20:22

much for tuning in guys hopefully you

play20:23

found this video useful if you did be

play20:25

sure to give it a thumbs up hit

play20:26

subscribe and tick that bell so you get

play20:28

notified of when i release future videos

play20:30

if you do have any questions or need any

play20:32

help be sure to drop a mention in the

play20:34

comments below

play20:35

and i'll get right back to you and all

play20:37

the course materials

play20:38

including the github repository as well

play20:40

as links to documentation are available

play20:42

in the description below

play20:43

so you can get a kickstart and get up

play20:45

and running with your reinforcement

play20:46

learning model

play20:47

thanks again for tuning in peace

Rate This

5.0 / 5 (0 votes)

Étiquettes Connexes
強化学習ディープラーニングカーポールオープンAIKerasTensorFlowモデルトレーニングポリシーベース環境構築実践ガイドAIモデル
Besoin d'un résumé en anglais ?