Training an AI to Conquer Double Dragon: Reinforcement Learning Demo
Summary
TLDRこのビデオでは、強化学習(RL)を使用して1988年のアーケードゲーム「ダブルドラゴン」を攻略するAIのトレーニング過程が紹介されています。Red Hat Open Shift内でモデルをトレーニングし、100以上のゲームインスタンスを実行して最適化を行っています。PythonのBaseline三などのライブラリを使用してAIをトレーニングし、ゲームボーイエミュレータPi boyをヘッドレスモードで実行しています。トレーニング中には報酬システムを調整し、最適な行動を見つけるために報酬とペナルティを適用します。トレーニング結果をTensorBoardで可視化し、モデルのパフォーマンスを評価しています。この技術は自動運転車やロボティクスなど、実用的な応用にも応じています。
Takeaways
- 🤖 このビデオでは、1988年のアーケードゲーム「ダブルドラゴン」をRed Hat Open Shift内でトレーニングしたモデルを使ってAIがゲームをクリアする方法が紹介されています。
- 🎮 AIは強化学習(RL)という機械学習の手法を使ってゲームを学習し、さまざまなランダムなアクションを試して最適化しています。
- 📚 トレーニングにはBaselineという機械学習ライブラリとPPOというトレーニングアルゴリズムが使用されています。
- 🕹️ Game Boyエミュレータ「Pi boy」を使って複数のインスタンスを同時に実行し、ゲームをクリアするようにトレーニングしています。
- 🔍 アクションの頻度やフレーム数などのパラメータを調整することで、ゲームのアクションを最適化しています。
- 🔄 モデルはチェックポイントを使ってトレーニングの進捗を確認し、必要に応じてロールバックも可能です。
- 🛠️ トレーニング中にモデルのパラメータを調整して、より良い結果を得ることができます。
- 📈 結果をTensorBoardを使って可視化することで、トレーニングの効果を評価しています。
- 🎨 AIはゲーム内のNPCに最適な攻撃パターンを見つけ出し、レベルを進める方法を学びます。
- 🔧 モデルは報酬システムを通じて学習し、ゲーム内のスコア、位置、新しいフレームへの進展などを報酬として得ています。
- 🚀 トレーニングを通じてAIは最適な攻撃方法を発見し、ゲームの最初のレベルをクリアするのに成功しました。
Q & A
ビデオではどのようなAIが紹介されていますか?
-ビデオでは、1988年のアーケードゲーム「ダブルドラゴン」をプレイし、Red Hat Open Shift内でトレーニングされたモデルを使用してゲームをクリアするAIが紹介されています。
ビデオで紹介されたAIはどのようにゲームをクリアするのですか?
-ビデオで紹介されたAIは強化学習(Reinforcement Learning, RL)という機械学習の手法を使用して、繰り返しのタスクを通してパターンと最適化を見つけ、ゲームをクリアする方法を学んでいきます。
ビデオの中で使用された強化学習のアルゴリズムは何ですか?
-ビデオの中で使用された強化学習のアルゴリズムはPPO(Proximal Policy Optimization)です。
ビデオ内でAIがトレーニングに使用したライブラリは何ですか?
-ビデオ内でAIがトレーニングに使用したライブラリはBaselinesとgymnasiumです。Baselinesは機械学習ライブラリで、AIをトレーニングするために使用されています。
ビデオ内でAIが使用したエミュレータは何ですか?
-ビデオ内でAIが使用したエミュレータはGame BoyエミュレータのPi boyです。これはゲームをプレイするための環境を提供します。
ビデオでAIがトレーニングに使用したCPUの数とその理由は何ですか?
-ビデオでAIがトレーニングに使用したCPUの数は10です。各CPUに対してGame Boyエミュレータのインスタンスが実行され、スケーラビリティを確保するために使用されています。
ビデオ内でAIがトレーニングに使用する際に重要なパラメーターの一つとして説明されたものは何ですか?
-ビデオ内でAIがトレーニングに使用する際に重要なパラメーターの一つとして説明されたものはgammaです。これはモデルが多様性を持たせるか、または特定の行動に固着するのを防ぐためのパラメーターです。
ビデオ内でAIがゲームをクリアするのに役立った重要な報酬の1つは何でしたか?
-ビデオ内でAIがゲームをクリアするのに役立った重要な報酬の1つは位置(POSITION)でした。新しいフレームごとに報酬が与えられることで、モデルはゲームを探索するように励まされます。
ビデオ内でAIがゲームから除外したアクションは何で、なぜ除外しましたか?
-ビデオ内でAIがゲームから除外したアクションはキック(ボタンB)でした。このバージョンのダブルドラゴンではキックが最適な解決策ではないと判断し、除外することでプレイスルーを30%以上向上させました。
ビデオ内でAIがトレーニングに使用したTensorBoardとは何ですか?
-ビデオ内でAIがトレーニングに使用したTensorBoardは、機械学習モデルのメトリックを可視化するためのツールです。トレーニング中のさまざまな指標をグラフィカルに表示して分析することができます。
ビデオ内でAIがトレーニングに使用したDockerファイルとは何ですか?
-ビデオ内でAIがトレーニングに使用したDockerファイルは、Open Shift AI Jupiterデータサイエンスイメージを拡張し、ゲームを実行するための必要なパッケージを追加するスクリプトです。
ビデオ内でAIがゲームをクリアするのに使用した最適な戦術とは何ですか?
-ビデオ内でAIがゲームをクリアするのに使用した最適な戦術は、背面からのパンチ(エルボーパンチ)を使用してレベルを進めることに気づいたことです。これは速度ランナーがレベルを迅速にクリアするために行うのと同じ戦術です。
ビデオ内で紹介された強化学習の応用分野には何がありますか?
-ビデオ内で紹介された強化学習の応用分野には、無人運転車やロボティクスなどがあります。この分野の機械学習は、汎用的なAIや大きな言語モデルにも関連しています。
ビデオ内でポケモンRLコミュニティから感謝された人物とその貢献とは何ですか?
-ビデオ内でポケモンRLコミュニティから感謝された人物はPeterで、彼はポケモンRedをクリアするためのPPOアルゴリズムを使用したモデルのトレーニング例をGitHubページで共有しています。
Outlines
🎮 AIがアーケードゲーム「ダブルドラゴン」を攻略
このビデオでは、Red Hat OpenShift内でトレーニングされたAIが1988年の人気アーケードゲーム「ダブルドラゴン」をクリアする様子が紹介されています。AIは100以上の「ダブルドラゴン」インスタンスをOpenShift内でプレイし、様々なランダムなアクションを試行錯誤し、ゲームをクリアするための最適な戦略を学びます。強化学習(RL)という機械学習の手法を使ってモデルをトレーニングし、繰り返しのタスクを通してパターンと最適化を見つけ出します。ビデオでは、トレーニングに用いたパラメータや設定、使用したPythonのパッケージ(Baseline、PPOアルゴリズム、gymnasiumライブラリなど)について説明し、Game BoyエミュレータPi boyをヘッドレスモードで実行し、メモリを抑えながらモデルをトレーニングする方法も紹介されています。
🔁 強化学習によるAIトレーニングの詳細
ビデオでは強化学習を用いたAIモデルのトレーニングプロセスが詳しく説明されています。モデルはゲーム内の報酬に基づいて学習し、期待通りの動作を報酬として与えることで最適化されます。トレーニングセッションを開始する際には、チェックポイントから再開するか、新しいセッションを開始するかを選択できます。トレーニング中に使用されるパラメータ(gammaなど)と、CPUでの実行についても触れられています。また、トレーニング過程でのエピソードの長さやゲーム内の死亡回数などのメトリックスをTensorBoardでログに記録し、トレーニングの進捗を監視する方法も紹介されています。
🤖 AIのゲームプレイと報酬システムの調整
AIがゲームをプレイする様子と、報酬システムを調整することでAIの学習を促進する方法が紹介されています。ビデオではAIがゲーム内のNPCに攻撃を仕掛ける際の最適なパターンを見つけ出す様子が見られます。また、AIがボス戦で苦戦する際には報酬システムを変更し、より多くのプレイスルーや特定の攻撃動作を最適化する必要があると説明されています。AIがゲームの各フレームをどのように認識しているかを示す画像も紹介され、モデルが過去の動作を考慮に入れるために3つのフレームを用いる理由も説明されています。
🏆 AIがゲームをクリアするための戦略の見直し
ビデオではAIがゲームをクリアするための戦略を調整し、最適化するプロセスが詳しく説明されています。AIは報酬を獲得する5つの重要な領域(スコア、位置、新しいレベルへの進出、ボス戦後の報酬、ライフの損失に対するペナルティ)に基づいて学習します。また、AIがゲーム内の行動をランダムに試行錯誤し、パターンを形成する様子や、最適な戦略を見つけ出すプロセスも紹介されています。トレーニング結果を評価し、報酬システムの調整や不要な行動の削除など、より効率的な戦略を見つける方法も説明されています。
🚀 強化学習の応用とコミュニティへの感謝の言葉
ビデオの最後に、強化学習がどのようにして実践的な応用に使われるかが紹介され、例えば自動運転車やロボティクスなどがあります。また、ビデオではポケモンRLコミュニティのピーターに感謝の言葉を述べており、彼のGitHubページにはポケモン赤をクリアするためのPPOアルゴリズムのトレーニング例が公開されています。ビデオではまた、ポケモンだけでなくスーパーマリオブラザーズなど他のゲームに関する議論も行われているDiscordコミュニティにも言及されています。
Mindmap
Keywords
💡強化学習(Reinforcement Learning, RL)
💡オープンシフト(Open Shift)
💡モデル(Model)
💡報酬(Reward)
💡ゲームボーイエミュレータ(Game Boy Emulator)
💡アクション周波数(Action Frequency)
💡最大ステップ(Max Steps)
💡チェックポイント(Checkpoint)
💡ハイパーパラメータ(Hyperparameter)
💡ニューラルネットワーク(Neural Network)
Highlights
AI击败1988年流行的街机游戏双截龙(Double Dragon),模型在Red Hat OpenShift中训练。
使用强化学习(Reinforcement Learning, RL)来训练AI,通过重复任务学习不同模式和优化。
展示AI如何在OpenShift中同时运行超过100个双截龙游戏实例。
使用Baseline三库和PPO训练算法来训练AI模型。
介绍Game Boy模拟器Pi boy的配置和优化设置。
设置CPU数量以并行运行多个模拟器实例。
使用回调设置检查点,以验证训练结果。
讨论gamma参数对模型训练多样性的影响。
CPU用于模型训练,GPU在此模型中提升有限。
使用TensorBoard记录和可视化训练指标。
AI在训练过程中的优化,例如通过删除不优动作提高效率。
AI学习过程中的奖励系统,包括得分、位置、新关卡和生命惩罚。
AI如何通过减少分辨率和改变像素结构来优化游戏帧的识别。
AI通过训练学会了如何有效地使用梯子和跳跃。
AI最终学会了如何击败第一个boss并进入下一关。
强化学习在自动驾驶汽车和机器人等领域的实际应用。
感谢Pokemon RL社区的Peter,提供GitHub页面和相关资源。
Transcripts
welcome in this video we'll be showing
an AI beating the popular
1988 arcade game double dragon with a
model trained within Red Hat open shift
[Music]
AI here we have over a 100 instances of
Double Dragon being played within open
shift
AI it's running through a variety of
different random
actions and then optimizing on those
actions to learn to beat the game Let's
dive into to how we made this possible
we're going to use a form of machine
learning called reinforcement learning
or
RL this is a form of machine learning
where we take repetitive tasks over and
over again and find different patterns
and
optimizations through training the model
this could be through different random
events or in our case we'll have
different rewards where we tell the
model when it's performing the way that
we are expecting and ways that we don't
want it to function let me show you some
examples of
this let's start within our notebook and
show a few ways that this AI model was
trained and some of the different
parameters and configurations that went
into the training
here we have a variety of different
packages that are being pulled into
python to train this model most
importantly is the Baseline
three this is the machine learning
library that we're using to train the AI
to ultimately beat the Double Dragon
game through this we're bringing in a
popular training algorithm called po
we also use the popular gymnasium
Library which I'll show you more here in
another
script now let's show the initiation of
the Game Boy environments that will play
concurrently and attempt to beat the
game we have a few helper functions here
but the primary function we're most
interested in is this make
environment this will create a variety
of different instances of the Game Boy
emulator Pi boy
we'll also configure in this case in
what we call a headless instance this
way I don't need to hear the sound or
the visuals this keeps the memory down
allows us to have more instances to
training the
model we have a variety of different
configurations here that we pass in to
the Game Boy
emulator the action frequency here lets
the Game Boy emulator know how many
frames should it have before each action
is taken this is critical as sometimes
you want actions to be a little bit
longer and if certain actions are done
too quickly it may not even be
registered by the game in this case I
found that the eight frames is The Sweet
Spot between the length of the actions
and how frequently we want actions to be
taken to speed up the process I have
loaded in a save file that ignores the
first cut scene this allows the emulator
to jump right in to where the action
starts and helps speed up the
process we also have here Max steps this
correlates to the amount of actions that
are
taken at the point of the maximum steps
the emulator will stop running to
prevent an infinite loop from
happening I have a variety of other
configurations but those are probably
the most important to go
over we now set the amount of CPUs that
will be used for this particular run in
this case have it set to
10 for every CPU that's set we'll have
an instance of the Game Boy emulator
running on that particular
CPU in this example I just have one
workbench that'll be running this in a
vertical fashion for scaling but there
are ways that we could configure this
with tools like Cube Ray or code flare
where we could distribute this across
many different nodes within open shift
AI here we'll go ahead and loop through
and create the necessary environments we
need and then we'll set up a call
back this is a way that we can set
checkpoints where we can verify if the
results are what we're expecting this is
a good way to keep certain records of
instances through their training there
may be a point that you make a change
that you don't like you then need to be
able to roll back to a particular
checkpoint this is a good way of showing
how that's possible within open shift
AI let's go ahead and start these
instances we'll come up here and play
this cell
all right our 10 instances are now ready
and let's go ahead and start the model
training I have two options here I can
start from a checkpoint if I provide a
file name or we can start a fresh
training session here for this video
I'll be showing a fresh training
run you'll see here that I don't have
very many parameters when starting a
fresh training session I typically like
to take my checkpoints after a few
iterations of training and then try to
apply different hyperparameters to my
model let's talk about a couple of
these here I have the gamma gamma is a
way to let the model training know how
much variety I want to take let's say
that the model has found a way to go
through uh a particular door within the
game but let's say maybe that was the
incorrect door maybe ultimately we
needed to use the right door instead of
the left
door if we're not training the model
appropriately and setting the right
gamma it may get fixated on that one
point in this case this would be the
overtraining of that particular run by
allowing the gamma to be a little bit
lower we can have some variety where the
model will try a few different
variations throughout the training
course ultimately this can lead to
higher run times and more iterations but
the quality of the output will likely be
better eventually you may like what
you're seeing in the output of the model
in which case you may bring up this
gamma a little bit more to make the
variations less to try to refine the
process that you currently have you'll
also see here that we're using the CPU
in this case for this type of model you
only get a marginal Improvement by using
a GPU so to keep cost down in my case
I'm just going to leverage the CPU that
I have but there may be some longer
models or different types of machine
learning training models where a GPU
will be more beneficial so if I wanted I
could come here and specify that I want
to use a
GPU I'm also logging out all of our
metrics into a tensor board log file
let's go ahead and start this training
run you'll see here this the training
has started and we specify that we're
using the CPU device in this
case it's going to take a few moments to
start getting some results
back at most we'll have
the episode length here for the maximum
amount of steps that are going to be
taken within the
game you may see some earlier instances
though of results and and this will be
the case where the playing character has
died three times and has lost the
game let's go ahead and take a closer
look at a training
run this is a video of me going from the
10 instances up to over a 100 we have a
few things that we'll take note on first
off you'll see that most of the frames
have certain trends that are happening
majority of the time we'll see maybe 80
or 90% of the instances following a very
similar path but let's go ahead and look
at some
outliers first we see here that the
playing character is eventually getting
stuck here this is preventing it from
moving further on in the
game we need to find ways to penalize
the model in this case and say getting
stuck is a bad scenario we can do this
in a few ways and I'll show some of
those here
shortly now let's move to another
section where the playing character is
actually progressing very quickly
through we'll see some Trends forming
often times the character will punch
backwards it is found the optimal
solution of beating the different NPCs
in the
game I shared this video with a
speedrunner recently and he thought this
was very interesting as often times the
speedrunners will do the exact same
thing to get through the level quickly
the AI in this case has found the
optimal path through the first level of
the
game we'll see in this particular frame
that the character has now gotten to the
first
boss in this case it's going to lose
because hasn't found the optimal pattern
to beating the boss in this case we'll
need to change the reward system
slightly first off ideally we want more
of these instances getting to the final
boss so that it has more playthroughs at
that point in the level we can also
change the rewards on how it optimizes
certain moves like the back punch to
beat this particular
boss let's dive into that more before we
going back to our Jupiter notebook I
want to show this particular
image it doesn't look like much is going
on here it's kind of a blurry image and
there's not much that we can make out
with our human eyes at least but this is
actually what the model sees on each
frame of the
game we have reduced the
resolution and changed some of the pixel
structure to optimize the training
within the
model A lot of times high resolution
images are actually a detriment they
take too much information it's too
complex for the model to figure
out this is a process where we've
normalized the data into a way that's
optimal for this particular algorithm
and the training of this
model you will also see that there are
three images here this helps the model
have some context of its previous moves
the top frame is what's happening now
and then the previous two on the bottom
are going to be frames from what had
happened just before this allows the
model some flexibility especially when
jumping over a object in the game or
having to use a ladder or interact with
an NPC
this is a way that we can add memory to
the
model now let's go ahead and look more
at the reward system I provide rewards
for five critical areas in the game we
have score so this is what happens when
our playing character either punches or
kicks a NPC in the game there's a
certain amount of score that's assigned
for kicks and punches and then as the
game progresses those will increase in
magnitude we then have position or in
this case I have it labeled as
POS this turns out to actually be the
most critical of all of the reward
metrics that we're tracking in the
game by having reward for each new frame
that we have in the game it encourages
our model to explore it's turns out in
these types of reinforcement learning
models exploration is the most critical
aspect sometimes you can even ignore
things like score all together every
time that the playing character goes
into a new level so the boss or after
it's beaten the boss I also reward it on
top of the positioning this makes it so
that the AI model knows how I want it to
progress through the level and not get
stuck in bizarre
scenarios lastly I penalize the AI for
losing lives this makes sure that it
tries to conserve its Health throughout
the level and finds optimal ways of
beating the
NPCs another way I've optimize the game
is by removing some unneeded
actions you'll see see here a list of
the actions that are available to the
model these are the actions that it
starts off doing randomly and then
eventually forms patterns on as it
progresses what's most important here is
that you'll see that I have removed this
button B in this case that's the
kick in this version of Double Dragon
the kick is not an optimal solution for
getting through the level by removing
this particular line I optimize the
playthrough by over
30% now let's go back and look at our
model training you'll see here we're
starting to get some results back let me
unpack this a little
bit here we have a variety of different
instances and the
score score is probably not the most
important metric here but it gives me an
idea of how the playing character is
interacting in the game what's most
important to me though is the final
level that it got to during this
playthrough you'll see here that we're
starting at
one but then here we actually have a
level
three this is where it takes on the
first boss we showed an example of this
in the previous View where we had many
different iterations of the game
playing this is where I've had to tweak
the reward systems as I mentioned to get
more of these getting to level
three now that we've done a few
runthrough of the game let's go ahead
and look at some metrics we can use the
popular tensor board to visually see
these metrics within our Jupiter
notebook we'll go ahead and start a
session if you come in here into our
scaler section we can see some more
graphs we'll things see things like
FPS the
KL different clip
fractions are inpr
loss some variance
functions and the different loss
functions here from the metrics that are
collected this is a good way for people
who are doing model training to see how
effective their model is in their
different training
Cycles I want to quickly show the docker
file that we use to make this workbench
possible here I take our open shift AI
Jupiter data
science here I take take our open shift
AI Jupiter data science image and then I
go ahead and add some things to it let
me show you the requirements
file most importantly here I'm adding in
the
pboy package this is the emulator we use
to run through the different
iterations of the game but this is an
example of how we can take a base image
within open shift a I and add the
necessary packages we need to run our
experiments and our model
training after making some adjustments
specifically removing the kick and also
increasing our positional rewards and
decreasing our scoring rewards I was
eventually able to beat the first level
of this game as mentioned before
eventually the AI learned that punching
from behind so using its elbow was the
most effective way to go through the
level it goes through very quickly by
finding a corner and then letting NPCs
come to it with his elbow move
eventually he'll make his way into the
boss battle where he'll do this very
same move but make sure to position
himself in the corner
and you can see here he finally beats
the boss and moves on to the next
mission now here I've had to train it in
a variety of ways to use these ladders
this is where vertical positioning was
critical and making sure that it knew
how to jump effectively between
platforms that's for another video
that's it for this demo we've shown how
reinforcement learning can be run on
open shift
AI now this is a cool and and fun
example but what are some practical
applications of this well this same type
of model is used to train things like
driverless cars and
Robotics there are a lot of applications
for this particular field of machine
learning and also how it relates back to
things like gener of AI and large
language
models I'd like to give a special thanks
to Peter from the Pokemon RL
Community here is a GitHub page of a
Pokemon version of what I did here
today Peter has examples of a PPO
algorithm training a model to beat
Pokemon
Red few things I want to call out
here is that he has an amazing video on
YouTube that outlines the process and
gives you great visuals of how the model
played through the
game lastly there's also a Discord
Community where we discuss things like
Pokemon but also a variety of other
games like Super Mario
Brothers do check them out thank you
[Music]
関連動画をさらに表示
生成AI モデルの開発と展開のためのNVIDIA NeMoフレームワーク
Introspective Agents: Performing Tasks With Reflection with LlamaIndex
Deep Reinforcement Learning Tutorial for Python in 20 Minutes
【テクスチャ作業丸ごと見せます‼前編】Substance3dPainterをプロが解説!Smart Materialを使わない柔軟なテクスチャ作成法を披露‼
4 Tricep Exercises (YOU SHOULD BE DOING!!)
Learn to deploy AI models on edge devices like smartphones
5.0 / 5 (0 votes)