Fixed-point Error Bounds for Mean-payoff Markov Decision Processes

Google TechTalks
25 Mar 202457:54

Summary

TLDRロベルト・コンティ教授が最適化、交通問題、ゲーム理論に関連する研究を行ってきたことを紹介し、特にマーロスの決定プロセスの制御に関する最新の研究結果を発表しました。彼は、この問題を解決するための反復スキームを調査し、Q学習アルゴリズムの收敛速度と収束に関する明確なエラー境界を提供することを目的としています。さらに、確率的な勾配降下と一般的な確率的な反復プロセスに関する彼の研究結果についても説明しています。

Takeaways

  • 🎓 ロベルト・コンティ教授は、最適化、交通、ゲーム理論などの問題に取り組む専門家です。
  • 📚 今回の講演では、最近の科学誌で発表された論文の主題に焦点を当てています。
  • 🔍 講演の目的は、マル可夫過程の決定プロセスにおける制御の最適化に関する問題を解決することです。
  • 🔄 繰り返しスキームとクロスジオンイテレーションとの関係に焦点を当て、解決策を探ります。
  • 🤖 Q学習アルゴリズムについて議論し、その收敛速度と有限時間の誤りToBoundsを提供することを目的としています。
  • 📈 進化的最適化問題を解決するための新しい方法を提案し、その理論的背景と応用を説明しています。
  • 🌐 马尔可夫過程に関する知識がない場合でも、オンライン学習とQ学習の適用が可能であることが示されています。
  • 📊 讲演では、理論的な分析と数值的な例示が併せて行われ、理解を深めるためにグラフやチャートが使用されています。
  • 🚀 ロベルト教授は、最適化問題に対する新しいアプローチの可能性を示し、今後の研究の方向性を示唆しています。
  • 📝 讲演の内容は、最適化、強化学習、オンライン学習などの分野の研究者にとって有益な情報源となるでしょう。
  • 🌟 讲演は、ロベルト教授の専門的な知識と経験を示すものであり、聴衆にとっては学びの機会を提供します。
  • 🔗 讲演の最後に、質問を受け付ける時間があり、参加者は自分の理解を深めるために質問を投稿できます。

Q & A

  • ロッベルト・コンティ教授はどの大学の教授ですか?

    -ロッベルト・コンティ教授はUniversity of Chileの教授です。

  • コンティ教授が取り組んでいる最適化問題の分野は何ですか?

    -コンティ教授は交通最適化問題の分野に取り組んでいます。

  • コンティ教授が発表した論文はどの科学誌で掲載されましたか?

    -コンティ教授が共同著者のカグエ・サンティアゴと発表した論文は「Science Journal」に掲載されました。

  • Q学習アルゴリズムとは何ですか?

    -Q学習アルゴリズムは、マルコフ決定過程(MDP)における最適報酬を最大化するためのアルゴリズムです。

  • コンティ教授が提案したQ学習のプロシージャの目的は何ですか?

    -コンティ教授が提案したQ学習のプロシージャの目的は、MDPの解決策に収束することを示し、反復の速度と有限時間の誤り範囲を提供することです。

  • コンティ教授が説明した問題の特別な性質は何ですか?

    -問題の特別な性質は、確率や報酬、遷移行列Pを事前に知らないという点です。これにより、学習しながら最適なポリシーをadaptively学習することが求められます。

  • コンティ教授が説明した例では、2つの状態と2つの行動があるとき、最適な長期平均報酬は何ですか?

    -2つの状態と2つの行動がある場合、最適な長期平均報酬は71/8です。

  • コンティ教授が言及した「非拡大的」という用語の意味は何ですか?

    -「非拡大的」とは、演算子が作用するベクトルに対して、その結果のノルムが入力のノルムと同じか小さいことを意味します。

  • コンティ教授が提案したQ学習のプロシージャにおいて、α_nはどのような役割を果たしますか?

    -α_nは、学習率として機能し、前のイテレーションのQ値と新しいサンプルに基づく更新値の重みを制御します。

  • コンティ教授が説明した「最適な価値」とは何を求めるものですか?

    -「最適な価値」とは、各状態での最適な行動を選択するポリシーを求めるものです。

  • コンティ教授の講演で提案されたQ学習のプロシージャは、どのような種類の成長速度を持っていると考えられますか?

    -Q学習のプロシージャは、成長速度が1/sqrt(T)を持ち、Tが大きくなるにつれて解約収束していくと考えられます。

Outlines

00:00

📚 研究紹介と最適化問題の概要

ビデオスクリプトの最初の段落では、Roberto Comti教授が交通最適化問題に関する研究を行い、2週間の滞在中に彼の専門知識を共有することが明らかにされています。彼の研究は、最近の科学誌に発表された控制最適化に関連しており、特定のマル可夫斯基決定プロセスの制御に焦点を当てています。スクリプトは、彼の講演の内容や目的、および予想される結果についての詳細な説明を提供しています。

05:00

🎯 马尔可夫決定プロセスと報酬機能

第二の段落では、マル可フスキー決定プロセスと報酬機能の概念が説明されています。このプロセスは、報酬に基づいて最適なアクションを選択することを目指しています。報酬は、現在の状態と行動、そして達成された最終状態に基づいています。目標は、長期的な平均的な報酬を最小化する統一的なポリシーを見つけることです。

10:00

🔄 イテレーション手法とQ学習の紹介

第三の段落では、特定のイテレーション手法とQ学習アルゴリズムが提案されています。この手法は、最小化問題を解決するための方法であり、Q学習はマル可フスキー決定プロセスの最適解を求めるためのアルゴリズムです。スクリプトでは、これらの技術がどのように機能し、最適解を見つけるためにどのように適用されるかについて説明されています。

15:03

📈 Q因子と最適解の探索

第四の段落では、Q因子と最適解の探索に関する詳細が提供されています。Q因子は、各状態でのアクションの相対値を示し、最適解を求めるために使用されます。スクリプトは、Q因子を計算する方法と、それらが最適解にどのように関連しているかについて説明しています。

20:04

🤔 探索と最適化の課題

第五の段落では、最適化プロセスにおいて探索の重要性とそれに伴う課題について議論されています。スクリプトは、レアイベントや高報酬の状態に対するアプローチ、および最適政策がこれらの状況を効果的に処理する方法について説明しています。

25:06

🌟 最適化とその応用

最後の段落では、最適化技術の応用とその結果について説明されています。スクリプトは、最適化問題の一般的なアプローチと、その技術がどのように使用され、特にオンライン学習の文脈での応用について詳細に説明しています。

Mindmap

Keywords

💡Optimization

最適化は、何らかの制限の下で特定の目標を最大化または最小化することを指します。このビデオでは、最適化は運輸問題やゲーム理論などの分野で使用され、特定の意思決定プロセスの制御を改善するために使用されています。

💡Transportation problems

運輸問題は、物流や供給チェーン管理の分野で発生する問題であり、最も効果的な方法で貨物を複数の地点から移動させることを目指します。このビデオでは、最適化手法が運輸問題の解決に役立っていることを示しています。

💡Game theory

ゲーム理論は、複数のエージェントが互いに影響を与え合いながら意思決定を行う際に現れる問題を分析する数学的な理論です。このビデオでは、ゲーム理論が最適化と組み合わされたり、意思決定プロセスに関連する問題に適用されることについて触れられています。

💡Markov decision processes

マルコフ決定プロセスは、時間の経過と状態の遷移に基づいて意思決定を行う数学モデルです。このビデオでは、マルコフ決定プロセスが最適化と組み合わされたり、特定の制御問題に適用されることについて説明されています。

💡Control

制御は、システムの状態を変化させるために必要なアクションを選択することを指します。このビデオでは、制御は、マルコフ決定プロセスや最適化問題において、特定の目標を達成するための意思決定プロセスに関連しています。

💡Iterative schemes

反復スキームは、何らかのアルゴリズムを繰り返し適用し、最終的な解に向かって進行するプロセスです。このビデオでは、反復スキームは、最適化問題やマルコフ決定プロセスでの制御を解決するために使用されています。

💡Q-learning

Qラーニングは、強化学習の一種であり、エージェントが環境と交互して最適な行動を選択するアルゴリズムです。このビデオでは、Qラーニングが最適化と組み合わされ、マルコフ決定プロセスでの制御問題に適用されています。

💡Nonexpansive maps

非拡大マップは、特定のノルムに対して、その適用によって得られる画像の距離が元の画像の距離よりも大きくならないような関数です。このビデオでは、非拡大マップは、最適化問題やマルコフ決定プロセスでの制御に関連しています。

💡Fixed point iterations

固定ポイント反復は、アルゴリズムの反復過程で、解空間内の一点に収束するプロセスです。このビデオでは、固定ポイント反復は、最適化問題やマルコフ決定プロセスでの制御を解決するために使用されています。

💡Stochastic gradient descent

確率的勾配降下は、機械学習において使用される最適化アルゴリズムで、パラメータを更新するためにランダムなサンプルから得られる勾配を使用します。このビデオでは、確率的勾配降下が最適化問題の解決に関連していることが示されています。

💡Error bounds

誤差範囲は、アルゴリズムの解が正解からどの程度離れているかを示す限界です。このビデオでは、誤差範囲が、最適化問題やマルコフ決定プロセスでの制御における重要な要素として強調されています。

Highlights

Roberto Comti, a professor at the University of Chile, discusses optimization and transportation problems, games, and control of Markov decision processes.

Comti's recent paper in Science journal focuses on control optimization and its connection with Markov decision processes.

The talk will cover three main topics, including an iterative scheme for minimizing payoff in Markov decision processes (MDPs) and its relation to Crossan iterations.

Q-learning, an algorithm for mean payoff MDPs, will be analyzed in terms of its convergence and speed of convergence.

Finite time error bounds for Q-learning will be provided, offering an estimate of the expected error after a certain number of iterations.

The talk will also touch on stochastic gradient descent and stochastic iterations with bounded variances.

A key challenge is addressing the problem of not knowing the transition probabilities and rewards in advance in Markov decision processes.

The goal is to find a policy that minimizes the long-run expected average payoff in a stochastic system described by a Markov chain.

The speaker will present a small example with two states and two actions to illustrate the optimization of long-run average payoff.

The concept of relative value for each state and its role in solving the fixed point equation of the problem will be discussed.

The talk will explore the use of Q-factors and the role of a normalization function in fixing the constant in the fixed point equation.

The speaker will delve into the technical assumptions and the properties of the operator involved in the fixed point problem.

The issue of non-uniqueness in the solution of the fixed point equation and its implications will be addressed.

The talk will present a method for adaptively learning an optimal policy in Markov decision processes without prior knowledge of probabilities and rewards.

The speaker will discuss the convergence of the Q-learning procedure and the conditions under which it converges to the solution of an MDP.

The talk will also cover the extension of these concepts to online learning scenarios where resampling is not possible.

The speaker will provide insights into the application of these methods in various fields such as maintenance in telecommunications and control of queuing networks.

Transcripts

play00:00

[Music]

play00:14

well hi everyone thank you for coming uh

play00:16

today we have uh Roberto comti he's a

play00:20

professor at University ofes and he has

play00:22

worked on optimization transportation

play00:25

problems uh

play00:27

games uh and well we're happy to have

play00:30

him he's staying for two weeks until

play00:33

this Saturday so he'll be leaving soon

play00:35

but of course if you want to contact him

play00:37

he's more than happy to answer questions

play00:41

now or later on so thank you very much

play00:44

to well first to to Daniel and all the

play00:48

team for the very

play00:51

nice reception here and last line was

play00:56

great uh my dinner and um

play01:02

so I I am planning to talk about this

play01:06

topic which is uh the subject of a

play01:08

recent paper it just came out uh a

play01:11

couple months ago in science journal

play01:14

control optimization it has to do with

play01:16

uh uh control of maros decision

play01:18

processes so my plan uh oh I should said

play01:22

that this is Joint war with a cague in

play01:24

Santiago Mario

play01:27

Brown uh so the paper is available if

play01:30

anyone um is interested I can give you

play01:34

the the print the

play01:36

reprint so that will be my program

play01:40

basically I will focus on the three

play01:42

first topics and I'll go quickly over

play01:45

the fourth so I will start by describing

play01:48

roughly what I mean what is the problem

play01:50

I'm facing this very

play01:52

basic uh I'm not pretending to do a

play01:55

surveys of mro decision process that

play01:58

would be uh huge topic so I will just

play02:02

give you the the more narrow problem

play02:06

that we're interested in and I will

play02:08

explain why this case is particularly

play02:11

has some pecularities that make it uh a

play02:14

bit

play02:16

challenging uh and then to to address

play02:20

this uh problem I will survey I will

play02:23

look at one particular uh iterative

play02:26

scheme for Min payoff

play02:28

mdps and establish a connection with

play02:31

this crosan iterations which are

play02:34

basically fixed point

play02:36

iterations for uh maps that are only

play02:39

nonexpansive though so there are no

play02:41

construction only nonexpansive and it

play02:44

turns out that the solution of the mark

play02:48

of decision process in M POF requires

play02:51

exactly this type of uh

play02:55

problems once I survey what we know

play02:58

about this cross schema iterations and

play03:01

the techniques we have to analyze that I

play03:03

will go back to uh Q learning which is

play03:06

the algorithm for me pay of

play03:09

mdps and explain what uh what we obtain

play03:13

so in terms of results that we expect to

play03:17

have there are two of them on the one

play03:20

hand you you we will propose or we will

play03:23

Analyze This Q learning procedure we

play03:26

would like to prove that this procedure

play03:28

converges to the solution of one of

play03:33

mdps but beyond that we would like to

play03:36

know what is the speed of convergence of

play03:39

the iterates toward the limit and even

play03:42

more importantly for us will be to

play03:45

provide finite time error bounds so

play03:48

after K ratios of the uh of the algorith

play03:53

I would like to have an

play03:55

estimate of the expected

play03:58

error that I'm and hopefully very the

play04:02

most explicitly

play04:05

possible uh and then there's a byproduct

play04:08

of the C iterations I will describe what

play04:12

it gives you when you study stochastic

play04:15

gradient descent and more generally the

play04:18

case of uh stochastic iterations with

play04:21

bounded variances but this depends on

play04:24

how much time we are left uh by the way

play04:28

any questions please feel free to

play04:30

interrupt any

play04:32

time so let me start by setting the

play04:35

stage and this is a sort of introduction

play04:38

for what comes later about me P so uh

play04:44

imagine that you have a stochastic

play04:47

system A system that moves

play04:49

stochastically and is described by a

play04:52

mark of chain so you have some

play04:54

transition probability you have finite

play04:57

set of states which will denote

play05:00

s and uh when moving from I to J so I

play05:06

will be the initial State J will

play05:07

represent the final State you have

play05:10

certain probability but on top of that

play05:13

you can control these probabilities by

play05:15

this action a everything will be assumed

play05:18

finite in this talk so you have a way to

play05:23

decide if I'm at State I what is the

play05:25

best action that will lead me to a next

play05:28

state J which is in some sense um

play05:33

desirable and desirability is controlled

play05:36

by this

play05:37

reward in general is a function of the

play05:39

initial state of the action you're

play05:41

taking and the final state that you

play05:43

reach it could depend on everything or

play05:46

depend only on one of those the example

play05:49

will only depend on the final State the

play05:51

state that you attain

play05:54

J um so once you have that if you're

play05:57

given an initial State let's say it's

play05:59

the deterministic you know exactly where

play06:01

your system start from this is I

play06:04

zero and if you prescribe a set a

play06:07

sequence of actions at well you have a

play06:10

mark of chain that will evolve according

play06:12

to the this transition probability so

play06:14

the the next state I t+ one t will be

play06:18

the epoch or time so at stage t+ one

play06:23

your new state will be selected

play06:25

according to this conditional

play06:27

probabilities over the state of uh

play06:31

states which depends on your current

play06:33

state it and your

play06:37

action what is the goal well I would

play06:40

like to prescribe or to find out a

play06:43

stationary policy what is a policy is a

play06:45

map that given a state tells you what to

play06:49

do in that state so it assigns to each

play06:51

state an action and I would like to

play06:55

choose this uh policy mu so that this

play06:58

control by choosing systematically at

play07:02

just as a function of the current state

play07:04

it to minimize the long run expected

play07:08

average

play07:09

payoff so you see once I have controls I

play07:11

have a stream of payoffs one for each

play07:15

stage and how do there are different

play07:19

ways to evaluate a stream of payoff one

play07:22

possibility is to use discounting so I

play07:25

discount with a certain rate future uh

play07:28

rewards are discounted to bring

play07:31

it uh which makes a lot of sense when

play07:34

you're talking about money but in some

play07:37

applications uh specially system that

play07:39

run

play07:41

continuously um suppose that you know R

play07:44

oft is uh some measurement of the um of

play07:49

the of a

play07:51

system which is uh which are

play07:54

um um

play07:58

doing Main

play08:01

well maybe maintenance you only want the

play08:02

things to be running smoothly in the

play08:05

long run well this type of object is not

play08:08

discounting the the objective function

play08:12

but what you do is you consider the

play08:14

long-term average one / T the sum of the

play08:17

you look at the

play08:19

longterm mean payoff over per period and

play08:24

of course this is stochastic so you take

play08:26

the

play08:28

expectation and then you look at at the

play08:31

limmit when T is

play08:33

large again you you add up you compute

play08:36

your average payoff per uh period you

play08:41

take the expected value in the long run

play08:44

that's

play08:46

uh what we want to optimize and

play08:50

U okay I'm basically following here

play08:54

classical reference which is there

play08:55

putterman where uh it discusses a lot

play08:59

lot of uh applications in maintenance in

play09:03

uh telecommunications or

play09:07

or control of Qing networks there are

play09:11

many many applications I will not give

play09:13

you any general examples I wouldn't be

play09:17

able I would rather present a very small

play09:21

motivation

play09:22

example the minimal example so suppose

play09:25

that I have only two states and two

play09:28

actions

play09:29

and suppose moreover that the rewards

play09:31

depend only on the final state so I have

play09:33

two states on state

play09:35

one my reward is one so when I'm I'm

play09:40

staying oops I'm staying at this state

play09:43

one here I get a reward of one if I'm at

play09:48

state two I get a reward of 10 of course

play09:50

if I'm maximizing the average payoff I

play09:52

would like to be at state two as often

play09:55

as

play09:56

possible now the the the transition

play10:00

probability depends on your action if

play10:01

your action is

play10:03

A1 and you're standing at one well you

play10:06

stay there with probability point3 you

play10:08

move to the good state with probability

play10:11

7 if you play action two well you stay

play10:16

with high probability at State

play10:18

one and only 01 to move to the good

play10:21

state so that's clear that if I'm at

play10:24

State one I should favor action

play10:27

A1 and vice versa at at state two I want

play10:30

to stay at that State with action one I

play10:34

stay only with probability three uh with

play10:38

action two is much

play10:40

higher so the optimal policy would be

play10:43

play A1 at one play A2 at two and if you

play10:47

look at the longterm average reward and

play10:50

so on and so forth you get that the

play10:52

value the optimal long run AA and this

play10:55

is the optimal it can be proved that

play10:57

this is the optimal is 71 over 8 in this

play11:01

example okay so what we're looking is uh

play11:05

at at those uh type of

play11:08

problem so this was very easy but now

play11:11

imagine that in the sem example I don't

play11:14

show you the

play11:16

probabilities and I don't show you the

play11:19

payoffs I don't show you anything the

play11:22

only thing I give you is that when you

play11:24

try one action I sample from the true

play11:28

distribution the next

play11:30

State and I show you the next state but

play11:33

I don't tell you with which probability

play11:35

I chose so you don't have you cannot

play11:38

compute this uh this objective function

play11:40

you can only observe what is the next

play11:43

state and what is the reward that you

play11:45

got in in that stage So You observe your

play11:48

current state You observe your action

play11:50

because you choose one and then you

play11:52

observe the next uh

play11:56

State and the reward

play12:00

but I'm hiding uh everything else the

play12:03

question is uh so I don't know P or R in

play12:07

advance but I can possibly learn along

play12:10

the way and that's a question can we

play12:13

come up with

play12:14

a method that

play12:17

adaptively

play12:18

adaptively uh learns an optimal policy

play12:22

that's the main

play12:24

issue so um I will start going a bit uh

play12:29

quicker there's some technical

play12:31

assumption here which is believe me

play12:33

quite mild I'm assuming that every time

play12:36

I get a a station I provide I prescribe

play12:39

a stationary

play12:41

policy uh while the the uh induced Mark

play12:45

of change is uni chain that means that

play12:48

there is a basically there is a unique

play12:50

stationary

play12:52

distribution uh well in that case the

play12:55

large R mu i z remember is just the

play12:59

long-term expected average

play13:02

payoff this uh value doesn't depend on

play13:06

initial State and that's quite expected

play13:09

right I'm just looking at the long run

play13:11

it doesn't matter there is a unique um

play13:15

limit uh

play13:18

stationary um law

play13:22

distribution so the initial State

play13:25

shouldn't play any role so from that

play13:27

point of view i z is IR Rel relevant but

play13:30

of course it depends on what policy mu

play13:32

I'm taking

play13:34

because because the stationary

play13:37

distribution will depend

play13:38

on well it turns out that the optimal

play13:42

value R Bar which is the minimum of our

play13:44

mu overall mu and the optimal policy can

play13:48

be obtained by

play13:49

solving this type of bmas equation and

play13:52

here's the fix Point formulation of the

play13:56

problem so I'm looking in this problem

play13:59

I'm looking for a vector v index by

play14:03

States it's it's called the relative

play14:05

value for every

play14:08

state and which satisfy this set of

play14:12

equations where you have V on on the

play14:14

right hand side and V comes over

play14:18

here so V of I is just the maximum

play14:21

overall actions of the expected value I

play14:24

I'm taking the expected value with the

play14:26

transition probability when I fix a

play14:29

of the reward plus VJ minus r

play14:33

bar so I should solve this equations

play14:37

this as an equation in

play14:38

V except that already to State this

play14:41

problem I should know R Bar the optimal

play14:44

value which is not known

play14:46

unfortunately and we will come later to

play14:49

that but if by any chance we know our

play14:52

bar then we only have to solve this

play14:56

problem and this is a fix Point equation

play14:58

V equal some TR of V or I think I will

play15:02

call it h what you see in the right hand

play15:05

side is an

play15:07

operator which acts on vector

play15:11

v and this operator you can see that it

play15:15

is monotone if I increase V it will give

play15:17

me something bigger and if I add a

play15:20

constant the constant comes out that's

play15:22

called isotone and this implies that

play15:25

this operator is nonexpansive in the L

play15:28

Infinity Norm the norm

play15:30

Infinity so you're Computing you need to

play15:33

compute a fixed point of a nonexpansive

play15:37

map with respect to a very specific Norm

play15:40

which is the norm

play15:42

Infinity the the solution is never

play15:44

unique because it's you had a constant

play15:47

on both sides you have a constant to V

play15:49

and you have another solution but

play15:51

basically that's the only uh source of

play15:55

non-uniqueness beside that this equation

play15:59

has a unique solution up to a

play16:03

constant uh the problem is that our bar

play16:07

I don't know a prior I could try

play16:10

to to adjust it right to to well that

play16:15

doesn't work I could try to guess what R

play16:18

Bar

play16:19

is and solve for V and then try to move

play16:23

adjust well that doesn't work because it

play16:25

turns out that this equation if if it

play16:29

has a Solutions then our bar should be

play16:32

the optimum value if if you miss by

play16:35

Epsilon this equation has no solution

play16:38

it's if and all leave so you really have

play16:40

to put plug there the exact optimal

play16:43

value R Bar which is not know it's not

play16:45

exact that's a big issue so you

play16:49

have

play16:50

um you have some fix Point theor fix

play16:55

Point problem of an operator that you

play16:56

cannot compute because you don't know

play16:59

our bar and you don't know

play17:02

p and you don't know R so you know

play17:05

nothing and you want to

play17:07

computer uh fix point and you really

play17:09

have to plug there the R Bar which is

play17:12

exact otherwise there is no

play17:16

solution um and the operator has some

play17:19

nice property is not expensive but still

play17:21

it's not expensive the nasty Norm it's

play17:24

notan

play17:25

Norm and most of fix Point Theory

play17:29

works for in the for good norms for ukan

play17:34

Norms here is not the case you have Norm

play17:38

Infinity well let me rewrite this

play17:42

equation by calling this Q factor so all

play17:45

the expression that it is inside the

play17:48

blue expression let's call it Q factor

play17:51

then I can express VI in terms of the Q

play17:54

factors you can rewrite this bman

play17:56

equation in terms of q c Factor

play18:00

so the only change is that the max

play18:02

operator moved inside so I'm replacing

play18:05

VJ by its expression in terms of the Q

play18:08

factor so that's another writing of the

play18:10

same equation and again this has a

play18:12

unique fix point up to a constant and

play18:16

provided that R Bar is exactly the

play18:17

optimal solution otherwise there is no

play18:20

fix

play18:21

point so this is the form uh we will

play18:25

address uh so here's what I just said

play18:28

the a fixed point for a map Age which is

play18:31

just an expectation you see I'm taking

play18:34

the sum over all possible future State

play18:38

according to probability P so I'm taking

play18:40

the expectation with respect to the next

play18:43

stage of

play18:45

the uh reward the immediate reward plus

play18:50

some sort of future

play18:52

reward except that I have to remove this

play18:55

arba

play18:57

here this map is nonexpansive as said in

play19:00

Infinity it has a unique fix point up to

play19:02

a constant but it might be difficult or

play19:06

even impossible to evalate because of

play19:08

one the R Bar is typically

play19:11

unknown and for a fix uh what I will do

play19:16

is that in the iterations instead of

play19:19

computing a fix point for

play19:22

a for the wrong RAR I will

play19:26

iterate with I will replace the R Bar by

play19:29

a function of the

play19:33

state and this is not our idea this is

play19:35

aadi that proposed in the 80s to to do

play19:39

this trick so we replace our bar with a

play19:42

function of F of Q I will tell you what

play19:44

function you should choose there and

play19:47

consider this new operator H star now I

play19:50

can

play19:51

evaluate except that I don't know uh how

play19:55

to compute the expectation know the

play19:57

reward

play19:59

so I can replace the expectation by some

play20:02

sequence of

play20:03

samples and uh try to do something with

play20:07

that that's that's the only information

play20:10

we get stage after stage well if you do

play20:14

that then here's what uh Q learning

play20:18

proposed by

play20:20

borker you say let's suppose that I have

play20:24

qn minus one an estimate of the fixed

play20:27

point I'm looking for

play20:30

uh I observe the next state and the

play20:36

reward and I replace the expectation by

play20:39

this sampling of the

play20:42

operator and with this by a simple

play20:44

averaging 1 minus alpha n the previous

play20:47

iterate plus alpha n this part which is

play20:51

just a sample of the operator there is

play20:53

no expectation there just a sample I do

play20:57

the averaging and I take this qn as my

play21:01

new uh estimate of the fix

play21:04

point and this interation you know x q n

play21:10

is a

play21:10

matrix which is indexed by state I and

play21:14

action

play21:15

a so this is what is called

play21:18

tabular rvi Q learning it says that for

play21:23

every I and a I

play21:26

update um this value so I take

play21:30

independent samples and what I'm

play21:33

requiring from f for technical reason uh

play21:36

is very simple it's simply that it is

play21:39

homogeneous with respect to adding of a

play21:45

constant if I do that then it tells that

play21:48

this H star now it has a unique fix

play21:52

Point unique not unique up to a constant

play21:56

so somehow the F of Q will fix the

play21:59

constant so it has a unique fix point

play22:02

and in fact it will be the only element

play22:04

of fixed age which was unique up to a

play22:06

constant which

play22:08

interpolates such that F of Q star gives

play22:10

me exactly the optimal

play22:13

value what I lose is that is this F of Q

play22:16

is no longer

play22:17

nonexpansive just because I added this

play22:20

function f of q but this will not be a

play22:22

problem examples of such functions well

play22:25

take the average of The Matrix

play22:29

or take the maximum of you know any

play22:33

normalization you can think of f as a

play22:35

normalization

play22:38

function so in this case going back to

play22:41

the example if you take the maximum the

play22:44

fix point is over

play22:47

here and if you look where the maximum

play22:50

is is 71 / 8 that was the optimal value

play22:53

71 over

play22:55

8 and this is the qar you fix the

play22:59

constant with

play23:03

that so here is an example on this

play23:06

example I run uh rvi Q learning so this

play23:09

iteration rvi which is there you can

play23:12

simulate and run so here are some

play23:15

numerical

play23:16

illustration well you have to choose

play23:18

what is your update parameter alpha n so

play23:22

here's for instance 1/ n +

play23:24

one and here 10 sample

play23:27

paths and I'm plotting here F of qn and

play23:30

as expected it you see it concentrates

play23:33

around R Bar which is 71

play23:36

over8 you see that all trajectories

play23:40

converge and uh in terms of expected

play23:43

rate if I look

play23:45

at what is the fix Point residual what

play23:47

is the error for the true operator AG

play23:50

which I cannot

play23:52

compute but still uh this expected

play23:56

value it's SC sces in this case like

play24:00

square root of Logan that gives you the

play24:03

rate of convergence maybe a bit faster

play24:05

because you see this product is still

play24:08

decreasing but I can tell you that your

play24:10

error is roughly decaying as one over

play24:13

square root of Logan is not very fast

play24:17

but it is what it

play24:19

is it seems to be pretty much uh tight

play24:23

in this example at least how about

play24:25

question yeah maybe I just misses pay

play24:28

attention but um how do you handle do

play24:31

you need to do exploration are you doing

play24:32

exploration or is your markof are you

play24:35

assuming that like every policy if I

play24:37

need if I need to do what any

play24:40

exploration like the any exploration so

play24:43

I me this convergence is like uh like

play24:47

convergence for a fixed policy or are

play24:49

you optimizing the policy as you go um

play24:52

I'm just

play24:54

doing this

play24:57

iteration over there okay so you are you

play25:01

are picking the best action each I'm

play25:03

taking the best action here yes when I

play25:05

compute the maximum yes uh sample

play25:09

maximum but it's a sample maximum yes so

play25:12

it's not a policy I I take qn minus one

play25:15

was my previous estimate of the optimal

play25:17

policy of the Q

play25:21

factors I

play25:23

observe the new

play25:25

state and I do a posterior what what

play25:28

would be the the best

play25:31

action starting from The Next Period on

play25:34

so here is the current r i which is the

play25:39

reward I get immediately and then I

play25:41

adjust my policy for the next

play25:45

State uh I take the best policy you are

play25:48

are are you assuming that all states are

play25:51

visited with positive probability for

play25:54

all policies not necessarily I only

play25:56

assume that if you give me any policy

play26:00

any fixed uh stationary policy the mark

play26:03

of change that results is uni chain it

play26:06

has a unique so there there could be

play26:08

states that are transient for that

play26:09

policy yes and different policies could

play26:12

have different Transit States yeah I

play26:15

mean I

play26:17

guess if you start out with policy that

play26:19

like never wants to visit some

play26:22

State uh and then like you observing

play26:25

good rewards like are you going to like

play26:27

under this strategy are you going to um

play26:31

like explore the whole

play26:33

Space there is no policy

play26:35

here um I'm not fixing the policy yeah I

play26:39

look at the state s i a n so the next

play26:43

state you arrive and you take the best

play26:45

action that you would take assuming that

play26:48

your Q factors are given by Q minus

play26:52

one eventually if if the Q converts to a

play26:56

q star the AR Max there will give you

play27:00

the optimal policy the optimal policy

play27:02

will be once you have converged when

play27:04

your where your qn convert so if you if

play27:07

I give you the Q star this fixed

play27:10

point then by taking any U that

play27:14

maximizes for for this Qi U or Qi a

play27:19

maximize

play27:20

over uh a this is the optimal policy but

play27:25

in this iteration I don't need to

play27:26

specify any policy so it's not polic I I

play27:29

guess my here's what I'm sorry maybe I

play27:32

don't mean to dwell on this but I just

play27:33

want to make sure I understand what's

play27:34

happening so if you had a a state action

play27:38

pair where there was a very rare High

play27:40

payoff and usually the payoff was like

play27:43

bad zero negative whatever and you

play27:46

sampled it a few times and then uh you

play27:49

saw that it was always bad to be here

play27:51

and then you ended up in a policy that

play27:56

like if there was a feasible policy that

play27:59

where that state was

play28:00

transient uh then like you may never

play28:02

even observe the very high payoff

play28:05

because you didn't visit it enough times

play28:07

how do you and but if the optimal policy

play28:10

needs to if the optimal policy does need

play28:12

that big rare payoff are you guaranteed

play28:14

to I mean that that it seems like you're

play28:17

saying that you should you should get

play28:18

there you should get there yeah the the

play28:21

theorem tells you that if you choose

play28:23

Alpha adequately it will converge to the

play28:26

optimal solution

play28:32

now if you have a state which usually

play28:34

gives you bad

play28:35

payoffs then this qn minus one when you

play28:39

update it will usually update well with

play28:42

the maximum of the previous one but if

play28:44

they're bad you're averaging with bad

play28:46

payoffs they

play28:48

are which is there is usually very bad

play28:52

you say with with small probability you

play28:54

get a super high payoff well when that

play28:57

happen happens when when the system

play29:00

jumps over

play29:01

there then uh you will get uh this

play29:05

payoff and will balance up and it will

play29:08

raise the qn and will raise

play29:10

the I I what ensures that we're going to

play29:14

continue returning to the state that had

play29:17

the bad

play29:20

payoff so Q is a matrix you're

play29:23

evaluating every state and every yes

play29:25

every state and every action every time

play29:30

yeah uh so you're forcing the iteration

play29:33

is saying I'm going to pick that b State

play29:36

because you need to evaluate Qi of a I

play29:38

being the best state with all the

play29:40

actions so you're sampling every time

play29:43

you're sampling every possible

play29:45

transition yes oh you're not it's not

play29:48

one sample path oh okay no sorry sorry I

play29:51

thought it was one sample path no

play29:53

I that's that's what we're currently

play29:55

working on okay it's online learning

play29:58

that no here is tabular so for every eye

play30:01

you simulate your next St for every ey

play30:03

in action you simulate what would happen

play30:06

and then you adapt uh it's a

play30:10

t i I had this um these idea of like I'd

play30:15

seen this before and like another and I

play30:17

thought that we were doing the other

play30:19

thing what's the intuition of what Q is

play30:23

uh that's that's a because it's not the

play30:26

policy is a sort of value but it's a

play30:29

relative value because you know that the

play30:32

the optimal value doesn't depend on the

play30:33

initial state where whereas this things

play30:36

factors are sort of what small

play30:40

advantages give you at at State ey to

play30:44

start with uh so it's a sort of first

play30:47

order approximation of the optimal

play30:49

policy first order sensitivity with

play30:51

respect to the action H sorry with

play30:54

respect to the initial

play30:56

state although in the long run the isal

play30:59

state doesn't have

play31:03

any any relevance uh well this gives you

play31:06

a sort of first order expansion okay and

play31:10

this why you get this RAR over there so

play31:14

I have a final slide where I try to to

play31:17

provide some intuition but it's purely

play31:19

technical I don't have a good uh no

play31:23

weird because Matrix question like okay

play31:27

if I have the ability need to sample Q

play31:30

like I can call sampling Q as much as I

play31:32

want like I mean is this procedure any

play31:36

better than like me just sampling Q then

play31:38

forming the lp that solves the long run

play31:40

average cost problem and then solving

play31:41

the lp like um mean is there and is

play31:45

there what is like if I were to do that

play31:47

what strategy should I use to sample

play31:48

yeah so the question is why shouldn't I

play31:51

sample I mean there are several things

play31:52

that you can do if you're out to sample

play31:55

repeatedly you can estimate your your

play31:58

probabilities by just doing a lot of

play32:00

sampling and then estimate your p a

play32:02

priority and then just solve for the

play32:04

equilibrium uh there is not a big

play32:06

Advantage as far as I know there is not

play32:10

much uh difference between the two y

play32:13

it's

play32:14

basically basically the same I mean here

play32:17

it's just you don't need to solve any

play32:20

equations uh but you just do a sort of

play32:25

uh update it's like you were doing uh

play32:29

you were solving the linear equations by

play32:32

a sort of iterative scheme okay and then

play32:36

you know as the process evolves your

play32:39

probabilities are adjusting themselves

play32:41

so it's not really

play32:43

different the the small advantage that I

play32:47

see is that this can be more easily

play32:50

adapted or extended rather and it's not

play32:53

trivial to the case that you're

play32:54

interested in this is the online case

play32:57

where I don't have the right so I'm at

play32:58

State I I'm only allowed to take one

play33:01

action and observe what happens with

play33:04

that action I cannot observe what if I

play33:06

had play a different action here I'm

play33:08

allowed to sample for up State Eye every

play33:11

possible action and see for every you

play33:14

control your system online you can

play33:16

you're not allowed to do that you cannot

play33:17

go about track and say okay what if I

play33:19

yesterday I did something else no okay

play33:23

but this uh this more challenging we're

play33:26

working on that right now but the

play33:28

techniques will more or less have have

play33:31

half of the results

play33:33

already okay so

play33:36

um I guess I'm I'm I'm running already

play33:40

out of time no I think that we have at

play33:45

least 15 more minutes okay so yeah I

play33:49

will try to speed up so here's

play33:51

another uh rule I take Al one over n+

play33:55

one so something which

play33:58

the case um uh the power law the sample

play34:03

paths are a bit more noisy the

play34:06

rate seems to go like uh 1/ sare otk 10

play34:11

of M really slow what is best log n

play34:16

square root of log n or 1 / n square

play34:20

root T of you can tell me both are are

play34:25

terrible and I would agree but if in

play34:27

fact log is better than this this

play34:30

polinomial I mean power law but in order

play34:34

to observe that this bits the log

play34:37

eventually it will beat but you need n

play34:39

10 to the 9 subtract it 10 to the 9

play34:42

iterations to to have this square root

play34:45

of 10 to beat the square root log n so

play34:49

it's

play34:51

uh so having a polinomial is nice but uh

play34:55

it only works for n VAR very large no

play34:58

one will do that so it's pretty useless

play35:02

well here's the spoiler all of this

play35:04

including the the 10 over there and the

play35:08

square root log n and all of this can be

play35:11

proved formally and with explicit error

play35:15

bounds

play35:17

uh so um I should probably cut

play35:24

through koses man iterations but let me

play35:28

quickly give you some rough idea of how

play35:31

we address this this problem so uh I

play35:35

take a more General problem which is to

play35:37

compute an or to approximate a fix point

play35:40

of an operator T operating of finite

play35:44

dimensional space and the only

play35:46

assumption is that t is not expansive

play35:48

with some norm and I'm assuming that t

play35:51

is only a computable add some random

play35:54

noise so I have and I have this stoas

play35:58

ceration cation is when you take un

play36:02

equals

play36:03

zero uh if you're interested in

play36:07

stochastic gradient descend well the

play36:08

operator T will be the identity minus 2

play36:13

over L the gradient of your objective

play36:15

function I'm assuming that you have a

play36:18

convex function which gradient is elip

play36:21

which more or less the standard setting

play36:23

for

play36:24

stoas then you can stoas is exactly this

play36:28

SK skm iteration with an operator very

play36:32

specific which is identity minus 2 overl

play36:35

the

play36:37

gradient in that case uh the norm will

play36:40

be ukan Norm you get no expansivity with

play36:44

respect tole Norm in our case the

play36:46

operator T will be this H for which you

play36:50

want compute the and here I'm only

play36:52

assuming that it's not expansive with

play36:54

respect to some Norm I don't ask

play36:56

anything about the

play36:58

norm so that's the the

play37:01

iteration it's it's sort of success is

play37:05

averages I'm averaging the previous

play37:07

iterate and the image except that the

play37:10

image I cannot compute I can compute it

play37:13

up to a noise

play37:15

un uh what I'm standing assumption the

play37:18

alpha n are averaging sequences so it's

play37:21

a sequence of scalers with this property

play37:24

that sum of Al Al 1us Al is Infinity you

play37:29

may have seen this uh many times in

play37:31

different context this this condition

play37:34

comes up and uh the noise I'm assuming

play37:38

it's a mark different sequence in L2 so

play37:40

it's it's essentially centered variable

play37:44

with zero expectation and with bounded

play37:47

variance with finite variance sorry so I

play37:50

will denote Theta n the variance of the

play37:53

noise this will pay a role you have to

play37:55

control that one otherwise you have in

play37:58

trouble and they're assumed to be finite

play38:02

but they could possibly grow with that

play38:06

which is essential to study

play38:09

the uh cerning iteration in the kar

play38:12

iteration you will have the thean grows

play38:14

like Logan you can establish this type

play38:18

of of growth so I'm assuming the noise

play38:22

is finite has finite variance but the

play38:25

variance May grow

play38:28

um so let me just give you a very rough

play38:31

idea how how we can proceed suppose that

play38:35

forget about the noise and you have this

play38:37

sort of successive averag so how can you

play38:41

prove that how can you estimate the

play38:44

residual of your fix point which is a

play38:46

measure the natural measure in this case

play38:48

of you would stop the iteration as soon

play38:51

as this is small how can you estimate

play38:54

that uh well if you had contraction that

play38:57

would be easy that's a

play38:59

b typical trick this this decays like

play39:05

aetric series but here is you just have

play39:07

non

play39:08

expansivity so what do we do we observe

play39:11

that exen is an average of the previous

play39:13

iteration and the previous image I can

play39:16

backt track I can say xn minus one is

play39:18

itself an average of the previous images

play39:21

and so forth and so and finally you get

play39:23

that xn is some sort of average along

play39:27

your

play39:28

PA iterations of the object and the PS

play39:32

are prescribed by the alphas and it's

play39:36

very explicit

play39:38

now suppose that we can estimate the

play39:41

distance from XM to XEL for the moment

play39:43

suppose that I have an estimate

play39:45

dmn then what happens with xn minus dxn

play39:49

I

play39:50

replace right the the value of TN in

play39:53

terms of of this Pi then I do triangle

play39:56

inequality

play39:58

uh I use the expansivity of T I use my

play40:01

bound so I get a Bound for xn minus 3 of

play40:07

the question is so how can we find an

play40:09

estimate of the distance between two

play40:11

itats M and

play40:13

N well we do the same and here's what

play40:17

where optimization comes into play so

play40:19

we'll do a matching you you see XM is an

play40:23

average of the of some vectors XM is

play40:26

also an average of some Vector let's

play40:28

match those weights those

play40:32

probabilities so I take

play40:36

uh variable z n so I I assume that PM is

play40:41

the blue

play40:43

um the blue weights are the pi i m the

play40:48

red is the red mass and I will try to

play40:51

match those by doing a uh an optimal

play40:55

transport problem so solving a linear

play40:56

problem

play40:58

because you see XM minus XM is just this

play41:02

difference if I take

play41:05

z uh with those good properties that

play41:08

send not send the transport from PI M to

play41:12

Pi n i can express everything in terms

play41:14

of a single sum and

play41:17

now let's do triangle inequality

play41:21

here so if I do trality I use no

play41:24

expansivity I get Z J the previous

play41:28

iterates if I assume that I already

play41:30

computed bounds for the previous iterate

play41:33

I have a new Bound for m

play41:38

andn so the idea is to do this

play41:41

recursively to use the previous bound D

play41:44

IUS one J minus one to constru the next

play41:46

bound MN and then proceeding

play41:51

uh

play41:53

inductively so I Define my dmn as the

play41:57

optimal transport problem for sending Pi

play42:00

M to Pi n and what is the cost the

play42:04

transportation cost will be the

play42:05

distances that I estimated

play42:10

previously so it seems a bit involve all

play42:14

of this

play42:15

because it it has one advantage is that

play42:19

you see the operator T disappeared plays

play42:21

no role this thing is just an iteration

play42:24

of real

play42:26

numbers

play42:27

you take the d j minus is a double

play42:31

induction on M and N but you can build

play42:34

that and this has a lot of properties

play42:36

this dmn turns out to be a distance it

play42:39

turns out there is a greedy algorithm to

play42:42

compute them you don't need to solve the

play42:43

linear program I can give you a greedy

play42:46

procedure to compute them and more

play42:49

importantly you see we went through all

play42:52

of these to find these bounds and these

play42:54

bounds provide you an estimate finally

play42:57

for the residual you get at the end

play43:02

iteration and this dmn you can compute

play43:05

it a priority they only depends on the

play43:08

on the alphas that you

play43:10

chose given your Alpha you compute your

play43:12

D with the idea your computer B RN and

play43:16

you're done you have your

play43:18

estate independent of the operator T

play43:21

independent of the space independent of

play43:24

the dimension it it could even work in

play43:26

infinite dimension

play43:28

so the bounds you obtain here are very

play43:29

robust they work for all operators and

play43:33

you might say okay that's that's too

play43:35

much it it means that we we

play43:38

overestimated everything well no I can

play43:41

construct you an operator which

play43:43

satisfies each and every inequality as

play43:45

an

play43:46

equality where XM XM minus xn in Norm is

play43:51

exactly equal to

play43:53

dmn and uh your residual over here the

play43:57

RN so the norm of xn minus t of xn is

play44:01

equal to this

play44:03

RN there is an operator which satisfies

play44:06

each for every n m it satisfies

play44:08

everything with

play44:10

equality that doesn't mean if you give

play44:13

me a specific

play44:15

operator maybe you can do better but it

play44:19

does mean that uh the technique in the

play44:22

worst case is the best you can expect

play44:24

this

play44:26

Valance

play44:28

uh maybe I missed an assumption or

play44:31

something I I mean I've seen like the

play44:32

contractive mapping stuff before but

play44:34

this is all I've never seen this um

play44:37

I like is there some assumption on like

play44:40

the boundedness of the space so that way

play44:43

the non-expensive mapping converges it

play44:47

comes through I come through okay so

play44:51

here is exactly this uh this issue for

play44:54

the moment I'm assuming that my operator

play44:56

from RN to r or Rd to

play44:59

Rd um so here's the result which there

play45:05

is an original results in Bon Brook and

play45:08

then we completed that uh in 2013 so a

play45:13

special case was treated in Bayon Brook

play45:15

who stated this as a as a conjecture and

play45:18

we proved it in 2013 so let me let's let

play45:22

me Define Kappa equal two times the

play45:24

diesel from x0 to 15 this already

play45:27

requires that I'm assuming that t has a

play45:29

fix Point okay so all my iterations will

play45:33

leave in the ball Center at X zero and

play45:37

with this G and and in fact the operator

play45:39

T is confined to that ball alternatively

play45:42

I could take Kappa if if your operator T

play45:45

is operating from a convex bounded set

play45:48

into itself I could take Kappa just the

play45:50

diameter yeah of that set okay so that's

play45:54

I think that yeah I mean I was something

play45:56

like you know like X to X plus one is a

play45:58

non-expensive mapping that is that

play45:59

doesn't that will go to Infinity yeah no

play46:02

this already ensures that you have a

play46:04

fixed point and here I'm assuming the

play46:06

priority and if it is bounded it should

play46:09

have so take Kaa this then let's define

play46:14

to n as the sum of Al Alpha K 1us Alpha

play46:17

K and by assumption this is going to

play46:20

infinity and theine sigma of Y this

play46:23

function the minimum forget about the

play46:25

minimum is one sare root of Pi

play46:29

y

play46:32

then by analyzing this recursive optimal

play46:35

transport problem you can

play46:37

prove explicitly more explicit bound you

play46:41

can get that this Norm is less or equal

play46:45

the the residual after n iterations is

play46:48

less or equal than the Kappa the two

play46:50

times the distance to the fixed Point

play46:53

divid by square root of Pi

play46:55

to

play46:58

and you could think that P is a white P

play47:02

appears here it's a

play47:05

mysterious what I can tell you is that

play47:07

you cannot improve is the best you can

play47:09

expect for that bound if you give me

play47:12

anything smaller than one/ sare root of

play47:15

Pi I can build you an example that

play47:17

violates this

play47:18

equality this was any Norm or

play47:21

ukan and the Pi still weird I mean you

play47:24

can see the pi showing up but when you

play47:28

it comes out from the analysis when the

play47:30

analysis is a bit involved because you

play47:32

reinterpret your Z the the optimal

play47:35

transport well

play47:37

first you take your optimal transport

play47:39

you bound it by a suboptimal transport

play47:42

very special one this suboptimal

play47:44

transport you re interpret it as

play47:47

a as a transition probability for a mark

play47:50

of chain which has nothing to do with

play47:52

the mar decision a mark of chain in the

play47:55

is z in

play47:57

square and when you Analyze This you

play48:01

boil down to uh to the gamess

play48:04

ru you go you ask the probabilities of

play48:07

what estimate you can give for the Glam

play48:09

ruin you end up with a function which

play48:13

is Gus hyper geometric function you have

play48:16

this bounds in terms of this special

play48:18

function and that special function gives

play48:20

you rise to one square root of Pi so

play48:23

it's not about normal distribution

play48:25

that's what one would expect no no no

play48:27

it's not that it's a bit more involved

play48:30

it comes from a probabilistic analysis

play48:33

which was not obvious at the beginning

play48:35

in I mean you have this recursive

play48:37

optimal

play48:38

transport you you take something which

play48:40

is suboptimal that you reinterpret as a

play48:43

mark of chain gives you gous ruin and

play48:46

eventually gives you a very explicit

play48:49

bounce and what's interesting that you

play48:51

can only proove that but except you did

play48:54

a lot of

play48:55

majorization but the

play48:57

majorization somehow are not uh over

play49:02

stated this this bound is tight you

play49:04

cannot improve this we we proved with

play49:06

Mario

play49:09

Bravo okay in particular you can have if

play49:13

goes to Infinity then the residual goes

play49:15

to zero of course one square root and

play49:17

and it tells you what what the speed is

play49:19

one/ square root of T you already got

play49:23

but this of course was for for the

play49:25

deterministic case there's no noise here

play49:29

so what about if I introduce

play49:32

noise so first you can say let's let's

play49:36

introduce some error n but suppose that

play49:39

I can control this error it's not a it's

play49:42

not a stochastic noise but it's some

play49:45

let's say Precision I compute the

play49:48

operator T and I have an error that I

play49:52

it's under my

play49:53

control well it turns out that this

play49:57

bound that I have here you

play50:00

see this Bound in in this

play50:02

equation it will still be there but

play50:05

there should be some

play50:07

extra uh account for the for the errors

play50:11

and here's the ugly expression so you

play50:14

have your Sigma to as before and here

play50:17

you have the errors of course if the

play50:19

error is zero if the e k and plus one

play50:22

are zero is the same as before otherwise

play50:25

you have here sort of convolution in in

play50:28

the ter in between it's ugly don't I

play50:30

don't expect you to to interpret or to

play50:34

retain this but you have some some B

play50:37

that you can control except that this

play50:39

requires to be controlled for instance

play50:43

if I take en such that the sum is

play50:46

somehow controlled then I can prove that

play50:49

the sequence still

play50:52

converges

play50:54

uh so okay here are some examples of

play50:59

speeds of convergence for different

play51:01

controls of the

play51:03

errors

play51:06

um my only Point here don't look at the

play51:08

equations maybe if you're interested you

play51:10

can look it later my point is this

play51:13

bounce really can be used to and

play51:15

translated into uh not only rates of

play51:18

convergence but it can give you explicit

play51:21

bounds this is less or equal than log to

play51:24

n/ root of 2 N multiply by this

play51:27

Con so you can really exploit this very

play51:30

precisely

play51:32

so okay so here is the basic tool a mar

play51:35

of change with rewards but I will skip

play51:37

that I'm running out of time so let me

play51:40

skip

play51:41

this and uh so let's go back to

play51:44

stochastic am and eventually come back

play51:47

to stochastic ma of decision process how

play51:50

all of this applies well B iteration is

play51:52

of this form now un is an error but

play51:56

which is not controlled is a stochastic

play51:59

term if I want to apply my previous bond

play52:04

that we obtain previously well it would

play52:06

require that the noise goes to zero

play52:08

almost surely or something like this and

play52:12

moreover that this sum of alpha K Norm

play52:14

UK is finite but since the norm the sum

play52:17

of alpha K is divergent This requires

play52:20

that UK converges to zero fast enough

play52:23

and unfortunately UK is not under my

play52:25

control

play52:27

is a stochastic if if I have a normal

play52:30

noise well I cannot expect something

play52:33

like this not even in

play52:36

expectation uh it's very restrictive

play52:39

this uh so the trick is to modify skm

play52:44

and interpret an average

play52:48

iteration um for minutes so let me just

play52:52

skip there is some trick to replace un

play52:55

by a sort of of average noise and then

play52:58

you you you have much better chances

play53:02

that this average noise comes to

play53:06

zero and everything is controlled deps

play53:09

of the

play53:10

variance and

play53:12

uh finally you can get convergence under

play53:17

some control on the variances but maybe

play53:20

it's more meaningful you get explicit

play53:22

error

play53:24

bounds uh as before in terms of

play53:27

controlling the variances that's that

play53:30

becomes a bit technical but it's the

play53:31

same the same ideas that you just adapt

play53:35

them over and over so if you go to Q

play53:38

learning so here's

play53:40

the I don't need to recall but this was

play53:43

the setting your you have a operator age

play53:47

which is non expansive blah blah blah uh

play53:50

here's the iteration that we propos it's

play53:52

exactly the form of the km and and by

play53:57

using this uh methodology developed in

play54:00

terms of optimal transport you get

play54:03

this so take alpha

play54:07

n you're averaging as a power then

play54:12

almost

play54:13

surely F of

play54:15

qm remember the F was replacing the R

play54:18

Bar well in fact it will convert to R

play54:20

Bar also the trajectory Q will converge

play54:23

to the optimal solution in nor Infinity

play54:27

well every Norm is equivalent so doesn't

play54:29

matter Q converts to this qar which is

play54:32

the unique solution of f qar equal R

play54:35

Bar and moreover there is an explicit

play54:38

constant this you have to look at the

play54:41

paper such that the fix Point error is

play54:44

bounded by Kaa a/ square root of to

play54:48

n this gives you the rates of

play54:55

convergence uh

play54:57

um so I think I can I can stop

play55:00

there um if you're interested we can

play55:04

later discuss how this applies also to

play55:06

stochastic grad descent and how it

play55:10

recovers uh even pretty recent results

play55:15

from last year and but he recovers it

play55:19

from a technique which doesn't have to

play55:21

do with optimization or anything it's

play55:23

just fix pointed

play55:25

ratios

play55:28

uh and for your question once you have

play55:31

the qar or any good enough approximation

play55:35

you derive the optimal policy by saying

play55:37

for each state I look at what is the

play55:40

optimal

play55:42

action what remains to do is H what you

play55:45

said what do I do when I I don't have

play55:47

the possibility to

play55:50

resample and I have only one uh One path

play55:55

of the r

play55:57

we are working on

play55:59

that with with Mario and with zoto Bruno

play56:04

French

play56:06

colleague he's coming to

play56:09

Santiago next week a couple weeks and we

play56:13

hope to finish that so that's we expect

play56:16

to have news within you know next months

play56:20

so the end of this year maybe we can say

play56:23

something we do not expect this already

play56:27

slow so we do not expect that this

play56:30

online will be faster for sure

play56:34

not and uh but then there are many

play56:37

variants but this is one uh possibility

play56:40

there are many variants with modate

play56:42

reduction so there's a there's still a

play56:45

lot to to

play56:47

do uh but the take-home message for me

play56:51

would be that

play56:53

um since we're talking about

play56:56

all of this is basically concerned with

play56:59

Computing independent of the iterations

play57:01

that you do you're trying to compute a

play57:03

fix point of a nonexpansive map and

play57:05

nonexpansive is with respect to a bad

play57:08

Norm the norm

play57:10

Infinity

play57:11

so

play57:13

uh when you think of that there is this

play57:16

technique based on optimal transport

play57:17

that might be useful that's a take home

play57:19

method keep in mind that you can

play57:21

estimate this

play57:23

by and the mathematic is very nice I

play57:26

learn a lot of combinatorics and uh

play57:29

special functions and

play57:31

probability

play57:33

uh that's it thank you very much thank

play57:36

Robert oh thank you let's have a copy

play57:39

[Music]

play57:52

yes

Rate This

5.0 / 5 (0 votes)

Related Tags
最適化学習プロセスマーケティング戦略ロベルト・コンティ制御理論最適政策確率的システムオンライン学習Q学習最適解
Do you need a summary in English?