# Fixed-point Error Bounds for Mean-payoff Markov Decision Processes

### Summary

TLDRロベルト・コンティ教授が最適化、交通問題、ゲーム理論に関連する研究を行ってきたことを紹介し、特にマーロスの決定プロセスの制御に関する最新の研究結果を発表しました。彼は、この問題を解決するための反復スキームを調査し、Q学習アルゴリズムの收敛速度と収束に関する明確なエラー境界を提供することを目的としています。さらに、確率的な勾配降下と一般的な確率的な反復プロセスに関する彼の研究結果についても説明しています。

### Takeaways

- 🎓 ロベルト・コンティ教授は、最適化、交通、ゲーム理論などの問題に取り組む専門家です。
- 📚 今回の講演では、最近の科学誌で発表された論文の主題に焦点を当てています。
- 🔍 講演の目的は、マル可夫過程の決定プロセスにおける制御の最適化に関する問題を解決することです。
- 🔄 繰り返しスキームとクロスジオンイテレーションとの関係に焦点を当て、解決策を探ります。
- 🤖 Q学習アルゴリズムについて議論し、その收敛速度と有限時間の誤りToBoundsを提供することを目的としています。
- 📈 進化的最適化問題を解決するための新しい方法を提案し、その理論的背景と応用を説明しています。
- 🌐 马尔可夫過程に関する知識がない場合でも、オンライン学習とQ学習の適用が可能であることが示されています。
- 📊 讲演では、理論的な分析と数值的な例示が併せて行われ、理解を深めるためにグラフやチャートが使用されています。
- 🚀 ロベルト教授は、最適化問題に対する新しいアプローチの可能性を示し、今後の研究の方向性を示唆しています。
- 📝 讲演の内容は、最適化、強化学習、オンライン学習などの分野の研究者にとって有益な情報源となるでしょう。
- 🌟 讲演は、ロベルト教授の専門的な知識と経験を示すものであり、聴衆にとっては学びの機会を提供します。
- 🔗 讲演の最後に、質問を受け付ける時間があり、参加者は自分の理解を深めるために質問を投稿できます。

### Q & A

### ロッベルト・コンティ教授はどの大学の教授ですか？

-ロッベルト・コンティ教授はUniversity of Chileの教授です。

### コンティ教授が取り組んでいる最適化問題の分野は何ですか？

-コンティ教授は交通最適化問題の分野に取り組んでいます。

### コンティ教授が発表した論文はどの科学誌で掲載されましたか？

-コンティ教授が共同著者のカグエ・サンティアゴと発表した論文は「Science Journal」に掲載されました。

### Q学習アルゴリズムとは何ですか？

-Q学習アルゴリズムは、マルコフ決定過程（MDP）における最適報酬を最大化するためのアルゴリズムです。

### コンティ教授が提案したQ学習のプロシージャの目的は何ですか？

-コンティ教授が提案したQ学習のプロシージャの目的は、MDPの解決策に収束することを示し、反復の速度と有限時間の誤り範囲を提供することです。

### コンティ教授が説明した問題の特別な性質は何ですか？

-問題の特別な性質は、確率や報酬、遷移行列Pを事前に知らないという点です。これにより、学習しながら最適なポリシーをadaptively学習することが求められます。

### コンティ教授が説明した例では、2つの状態と2つの行動があるとき、最適な長期平均報酬は何ですか？

-2つの状態と2つの行動がある場合、最適な長期平均報酬は71/8です。

### コンティ教授が言及した「非拡大的」という用語の意味は何ですか？

-「非拡大的」とは、演算子が作用するベクトルに対して、その結果のノルムが入力のノルムと同じか小さいことを意味します。

### コンティ教授が提案したQ学習のプロシージャにおいて、α_nはどのような役割を果たしますか？

-α_nは、学習率として機能し、前のイテレーションのQ値と新しいサンプルに基づく更新値の重みを制御します。

### コンティ教授が説明した「最適な価値」とは何を求めるものですか？

-「最適な価値」とは、各状態での最適な行動を選択するポリシーを求めるものです。

### コンティ教授の講演で提案されたQ学習のプロシージャは、どのような種類の成長速度を持っていると考えられますか？

-Q学習のプロシージャは、成長速度が1/sqrt(T)を持ち、Tが大きくなるにつれて解約収束していくと考えられます。

### Outlines

### 📚 研究紹介と最適化問題の概要

ビデオスクリプトの最初の段落では、Roberto Comti教授が交通最適化問題に関する研究を行い、2週間の滞在中に彼の専門知識を共有することが明らかにされています。彼の研究は、最近の科学誌に発表された控制最適化に関連しており、特定のマル可夫斯基決定プロセスの制御に焦点を当てています。スクリプトは、彼の講演の内容や目的、および予想される結果についての詳細な説明を提供しています。

### 🎯 马尔可夫決定プロセスと報酬機能

第二の段落では、マル可フスキー決定プロセスと報酬機能の概念が説明されています。このプロセスは、報酬に基づいて最適なアクションを選択することを目指しています。報酬は、現在の状態と行動、そして達成された最終状態に基づいています。目標は、長期的な平均的な報酬を最小化する統一的なポリシーを見つけることです。

### 🔄 イテレーション手法とQ学習の紹介

第三の段落では、特定のイテレーション手法とQ学習アルゴリズムが提案されています。この手法は、最小化問題を解決するための方法であり、Q学習はマル可フスキー決定プロセスの最適解を求めるためのアルゴリズムです。スクリプトでは、これらの技術がどのように機能し、最適解を見つけるためにどのように適用されるかについて説明されています。

### 📈 Q因子と最適解の探索

第四の段落では、Q因子と最適解の探索に関する詳細が提供されています。Q因子は、各状態でのアクションの相対値を示し、最適解を求めるために使用されます。スクリプトは、Q因子を計算する方法と、それらが最適解にどのように関連しているかについて説明しています。

### 🤔 探索と最適化の課題

第五の段落では、最適化プロセスにおいて探索の重要性とそれに伴う課題について議論されています。スクリプトは、レアイベントや高報酬の状態に対するアプローチ、および最適政策がこれらの状況を効果的に処理する方法について説明しています。

### 🌟 最適化とその応用

最後の段落では、最適化技術の応用とその結果について説明されています。スクリプトは、最適化問題の一般的なアプローチと、その技術がどのように使用され、特にオンライン学習の文脈での応用について詳細に説明しています。

### Mindmap

### Keywords

### 💡Optimization

### 💡Transportation problems

### 💡Game theory

### 💡Markov decision processes

### 💡Control

### 💡Iterative schemes

### 💡Q-learning

### 💡Nonexpansive maps

### 💡Fixed point iterations

### 💡Stochastic gradient descent

### 💡Error bounds

### Highlights

Roberto Comti, a professor at the University of Chile, discusses optimization and transportation problems, games, and control of Markov decision processes.

Comti's recent paper in Science journal focuses on control optimization and its connection with Markov decision processes.

The talk will cover three main topics, including an iterative scheme for minimizing payoff in Markov decision processes (MDPs) and its relation to Crossan iterations.

Q-learning, an algorithm for mean payoff MDPs, will be analyzed in terms of its convergence and speed of convergence.

Finite time error bounds for Q-learning will be provided, offering an estimate of the expected error after a certain number of iterations.

The talk will also touch on stochastic gradient descent and stochastic iterations with bounded variances.

A key challenge is addressing the problem of not knowing the transition probabilities and rewards in advance in Markov decision processes.

The goal is to find a policy that minimizes the long-run expected average payoff in a stochastic system described by a Markov chain.

The speaker will present a small example with two states and two actions to illustrate the optimization of long-run average payoff.

The concept of relative value for each state and its role in solving the fixed point equation of the problem will be discussed.

The talk will explore the use of Q-factors and the role of a normalization function in fixing the constant in the fixed point equation.

The speaker will delve into the technical assumptions and the properties of the operator involved in the fixed point problem.

The issue of non-uniqueness in the solution of the fixed point equation and its implications will be addressed.

The talk will present a method for adaptively learning an optimal policy in Markov decision processes without prior knowledge of probabilities and rewards.

The speaker will discuss the convergence of the Q-learning procedure and the conditions under which it converges to the solution of an MDP.

The talk will also cover the extension of these concepts to online learning scenarios where resampling is not possible.

The speaker will provide insights into the application of these methods in various fields such as maintenance in telecommunications and control of queuing networks.

### Transcripts

[Music]

well hi everyone thank you for coming uh

today we have uh Roberto comti he's a

professor at University ofes and he has

worked on optimization transportation

problems uh

games uh and well we're happy to have

him he's staying for two weeks until

this Saturday so he'll be leaving soon

but of course if you want to contact him

he's more than happy to answer questions

now or later on so thank you very much

to well first to to Daniel and all the

team for the very

nice reception here and last line was

great uh my dinner and um

so I I am planning to talk about this

topic which is uh the subject of a

recent paper it just came out uh a

couple months ago in science journal

control optimization it has to do with

uh uh control of maros decision

processes so my plan uh oh I should said

that this is Joint war with a cague in

Santiago Mario

Brown uh so the paper is available if

anyone um is interested I can give you

the the print the

reprint so that will be my program

basically I will focus on the three

first topics and I'll go quickly over

the fourth so I will start by describing

roughly what I mean what is the problem

I'm facing this very

basic uh I'm not pretending to do a

surveys of mro decision process that

would be uh huge topic so I will just

give you the the more narrow problem

that we're interested in and I will

explain why this case is particularly

has some pecularities that make it uh a

bit

challenging uh and then to to address

this uh problem I will survey I will

look at one particular uh iterative

scheme for Min payoff

mdps and establish a connection with

this crosan iterations which are

basically fixed point

iterations for uh maps that are only

nonexpansive though so there are no

construction only nonexpansive and it

turns out that the solution of the mark

of decision process in M POF requires

exactly this type of uh

problems once I survey what we know

about this cross schema iterations and

the techniques we have to analyze that I

will go back to uh Q learning which is

the algorithm for me pay of

mdps and explain what uh what we obtain

so in terms of results that we expect to

have there are two of them on the one

hand you you we will propose or we will

Analyze This Q learning procedure we

would like to prove that this procedure

converges to the solution of one of

mdps but beyond that we would like to

know what is the speed of convergence of

the iterates toward the limit and even

more importantly for us will be to

provide finite time error bounds so

after K ratios of the uh of the algorith

I would like to have an

estimate of the expected

error that I'm and hopefully very the

most explicitly

possible uh and then there's a byproduct

of the C iterations I will describe what

it gives you when you study stochastic

gradient descent and more generally the

case of uh stochastic iterations with

bounded variances but this depends on

how much time we are left uh by the way

any questions please feel free to

interrupt any

time so let me start by setting the

stage and this is a sort of introduction

for what comes later about me P so uh

imagine that you have a stochastic

system A system that moves

stochastically and is described by a

mark of chain so you have some

transition probability you have finite

set of states which will denote

s and uh when moving from I to J so I

will be the initial State J will

represent the final State you have

certain probability but on top of that

you can control these probabilities by

this action a everything will be assumed

finite in this talk so you have a way to

decide if I'm at State I what is the

best action that will lead me to a next

state J which is in some sense um

desirable and desirability is controlled

by this

reward in general is a function of the

initial state of the action you're

taking and the final state that you

reach it could depend on everything or

depend only on one of those the example

will only depend on the final State the

state that you attain

J um so once you have that if you're

given an initial State let's say it's

the deterministic you know exactly where

your system start from this is I

zero and if you prescribe a set a

sequence of actions at well you have a

mark of chain that will evolve according

to the this transition probability so

the the next state I t+ one t will be

the epoch or time so at stage t+ one

your new state will be selected

according to this conditional

probabilities over the state of uh

states which depends on your current

state it and your

action what is the goal well I would

like to prescribe or to find out a

stationary policy what is a policy is a

map that given a state tells you what to

do in that state so it assigns to each

state an action and I would like to

choose this uh policy mu so that this

control by choosing systematically at

just as a function of the current state

it to minimize the long run expected

average

payoff so you see once I have controls I

have a stream of payoffs one for each

stage and how do there are different

ways to evaluate a stream of payoff one

possibility is to use discounting so I

discount with a certain rate future uh

rewards are discounted to bring

it uh which makes a lot of sense when

you're talking about money but in some

applications uh specially system that

run

continuously um suppose that you know R

oft is uh some measurement of the um of

the of a

system which is uh which are

um um

doing Main

well maybe maintenance you only want the

things to be running smoothly in the

long run well this type of object is not

discounting the the objective function

but what you do is you consider the

long-term average one / T the sum of the

you look at the

longterm mean payoff over per period and

of course this is stochastic so you take

the

expectation and then you look at at the

limmit when T is

large again you you add up you compute

your average payoff per uh period you

take the expected value in the long run

that's

uh what we want to optimize and

U okay I'm basically following here

classical reference which is there

putterman where uh it discusses a lot

lot of uh applications in maintenance in

uh telecommunications or

or control of Qing networks there are

many many applications I will not give

you any general examples I wouldn't be

able I would rather present a very small

motivation

example the minimal example so suppose

that I have only two states and two

actions

and suppose moreover that the rewards

depend only on the final state so I have

two states on state

one my reward is one so when I'm I'm

staying oops I'm staying at this state

one here I get a reward of one if I'm at

state two I get a reward of 10 of course

if I'm maximizing the average payoff I

would like to be at state two as often

as

possible now the the the transition

probability depends on your action if

your action is

A1 and you're standing at one well you

stay there with probability point3 you

move to the good state with probability

7 if you play action two well you stay

with high probability at State

one and only 01 to move to the good

state so that's clear that if I'm at

State one I should favor action

A1 and vice versa at at state two I want

to stay at that State with action one I

stay only with probability three uh with

action two is much

higher so the optimal policy would be

play A1 at one play A2 at two and if you

look at the longterm average reward and

so on and so forth you get that the

value the optimal long run AA and this

is the optimal it can be proved that

this is the optimal is 71 over 8 in this

example okay so what we're looking is uh

at at those uh type of

problem so this was very easy but now

imagine that in the sem example I don't

show you the

probabilities and I don't show you the

payoffs I don't show you anything the

only thing I give you is that when you

try one action I sample from the true

distribution the next

State and I show you the next state but

I don't tell you with which probability

I chose so you don't have you cannot

compute this uh this objective function

you can only observe what is the next

state and what is the reward that you

got in in that stage So You observe your

current state You observe your action

because you choose one and then you

observe the next uh

State and the reward

but I'm hiding uh everything else the

question is uh so I don't know P or R in

advance but I can possibly learn along

the way and that's a question can we

come up with

a method that

adaptively

adaptively uh learns an optimal policy

that's the main

issue so um I will start going a bit uh

quicker there's some technical

assumption here which is believe me

quite mild I'm assuming that every time

I get a a station I provide I prescribe

a stationary

policy uh while the the uh induced Mark

of change is uni chain that means that

there is a basically there is a unique

stationary

distribution uh well in that case the

large R mu i z remember is just the

long-term expected average

payoff this uh value doesn't depend on

initial State and that's quite expected

right I'm just looking at the long run

it doesn't matter there is a unique um

limit uh

stationary um law

distribution so the initial State

shouldn't play any role so from that

point of view i z is IR Rel relevant but

of course it depends on what policy mu

I'm taking

because because the stationary

distribution will depend

on well it turns out that the optimal

value R Bar which is the minimum of our

mu overall mu and the optimal policy can

be obtained by

solving this type of bmas equation and

here's the fix Point formulation of the

problem so I'm looking in this problem

I'm looking for a vector v index by

States it's it's called the relative

value for every

state and which satisfy this set of

equations where you have V on on the

right hand side and V comes over

here so V of I is just the maximum

overall actions of the expected value I

I'm taking the expected value with the

transition probability when I fix a

of the reward plus VJ minus r

bar so I should solve this equations

this as an equation in

V except that already to State this

problem I should know R Bar the optimal

value which is not known

unfortunately and we will come later to

that but if by any chance we know our

bar then we only have to solve this

problem and this is a fix Point equation

V equal some TR of V or I think I will

call it h what you see in the right hand

side is an

operator which acts on vector

v and this operator you can see that it

is monotone if I increase V it will give

me something bigger and if I add a

constant the constant comes out that's

called isotone and this implies that

this operator is nonexpansive in the L

Infinity Norm the norm

Infinity so you're Computing you need to

compute a fixed point of a nonexpansive

map with respect to a very specific Norm

which is the norm

Infinity the the solution is never

unique because it's you had a constant

on both sides you have a constant to V

and you have another solution but

basically that's the only uh source of

non-uniqueness beside that this equation

has a unique solution up to a

constant uh the problem is that our bar

I don't know a prior I could try

to to adjust it right to to well that

doesn't work I could try to guess what R

Bar

is and solve for V and then try to move

adjust well that doesn't work because it

turns out that this equation if if it

has a Solutions then our bar should be

the optimum value if if you miss by

Epsilon this equation has no solution

it's if and all leave so you really have

to put plug there the exact optimal

value R Bar which is not know it's not

exact that's a big issue so you

have

um you have some fix Point theor fix

Point problem of an operator that you

cannot compute because you don't know

our bar and you don't know

p and you don't know R so you know

nothing and you want to

computer uh fix point and you really

have to plug there the R Bar which is

exact otherwise there is no

solution um and the operator has some

nice property is not expensive but still

it's not expensive the nasty Norm it's

notan

Norm and most of fix Point Theory

works for in the for good norms for ukan

Norms here is not the case you have Norm

Infinity well let me rewrite this

equation by calling this Q factor so all

the expression that it is inside the

blue expression let's call it Q factor

then I can express VI in terms of the Q

factors you can rewrite this bman

equation in terms of q c Factor

so the only change is that the max

operator moved inside so I'm replacing

VJ by its expression in terms of the Q

factor so that's another writing of the

same equation and again this has a

unique fix point up to a constant and

provided that R Bar is exactly the

optimal solution otherwise there is no

fix

point so this is the form uh we will

address uh so here's what I just said

the a fixed point for a map Age which is

just an expectation you see I'm taking

the sum over all possible future State

according to probability P so I'm taking

the expectation with respect to the next

stage of

the uh reward the immediate reward plus

some sort of future

reward except that I have to remove this

arba

here this map is nonexpansive as said in

Infinity it has a unique fix point up to

a constant but it might be difficult or

even impossible to evalate because of

one the R Bar is typically

unknown and for a fix uh what I will do

is that in the iterations instead of

computing a fix point for

a for the wrong RAR I will

iterate with I will replace the R Bar by

a function of the

state and this is not our idea this is

aadi that proposed in the 80s to to do

this trick so we replace our bar with a

function of F of Q I will tell you what

function you should choose there and

consider this new operator H star now I

can

evaluate except that I don't know uh how

to compute the expectation know the

reward

so I can replace the expectation by some

sequence of

samples and uh try to do something with

that that's that's the only information

we get stage after stage well if you do

that then here's what uh Q learning

proposed by

borker you say let's suppose that I have

qn minus one an estimate of the fixed

point I'm looking for

uh I observe the next state and the

reward and I replace the expectation by

this sampling of the

operator and with this by a simple

averaging 1 minus alpha n the previous

iterate plus alpha n this part which is

just a sample of the operator there is

no expectation there just a sample I do

the averaging and I take this qn as my

new uh estimate of the fix

point and this interation you know x q n

is a

matrix which is indexed by state I and

action

a so this is what is called

tabular rvi Q learning it says that for

every I and a I

update um this value so I take

independent samples and what I'm

requiring from f for technical reason uh

is very simple it's simply that it is

homogeneous with respect to adding of a

constant if I do that then it tells that

this H star now it has a unique fix

Point unique not unique up to a constant

so somehow the F of Q will fix the

constant so it has a unique fix point

and in fact it will be the only element

of fixed age which was unique up to a

constant which

interpolates such that F of Q star gives

me exactly the optimal

value what I lose is that is this F of Q

is no longer

nonexpansive just because I added this

function f of q but this will not be a

problem examples of such functions well

take the average of The Matrix

or take the maximum of you know any

normalization you can think of f as a

normalization

function so in this case going back to

the example if you take the maximum the

fix point is over

here and if you look where the maximum

is is 71 / 8 that was the optimal value

71 over

8 and this is the qar you fix the

constant with

that so here is an example on this

example I run uh rvi Q learning so this

iteration rvi which is there you can

simulate and run so here are some

numerical

illustration well you have to choose

what is your update parameter alpha n so

here's for instance 1/ n +

one and here 10 sample

paths and I'm plotting here F of qn and

as expected it you see it concentrates

around R Bar which is 71

over8 you see that all trajectories

converge and uh in terms of expected

rate if I look

at what is the fix Point residual what

is the error for the true operator AG

which I cannot

compute but still uh this expected

value it's SC sces in this case like

square root of Logan that gives you the

rate of convergence maybe a bit faster

because you see this product is still

decreasing but I can tell you that your

error is roughly decaying as one over

square root of Logan is not very fast

but it is what it

is it seems to be pretty much uh tight

in this example at least how about

question yeah maybe I just misses pay

attention but um how do you handle do

you need to do exploration are you doing

exploration or is your markof are you

assuming that like every policy if I

need if I need to do what any

exploration like the any exploration so

I me this convergence is like uh like

convergence for a fixed policy or are

you optimizing the policy as you go um

I'm just

doing this

iteration over there okay so you are you

are picking the best action each I'm

taking the best action here yes when I

compute the maximum yes uh sample

maximum but it's a sample maximum yes so

it's not a policy I I take qn minus one

was my previous estimate of the optimal

policy of the Q

factors I

observe the new

state and I do a posterior what what

would be the the best

action starting from The Next Period on

so here is the current r i which is the

reward I get immediately and then I

adjust my policy for the next

State uh I take the best policy you are

are are you assuming that all states are

visited with positive probability for

all policies not necessarily I only

assume that if you give me any policy

any fixed uh stationary policy the mark

of change that results is uni chain it

has a unique so there there could be

states that are transient for that

policy yes and different policies could

have different Transit States yeah I

mean I

guess if you start out with policy that

like never wants to visit some

State uh and then like you observing

good rewards like are you going to like

under this strategy are you going to um

like explore the whole

Space there is no policy

here um I'm not fixing the policy yeah I

look at the state s i a n so the next

state you arrive and you take the best

action that you would take assuming that

your Q factors are given by Q minus

one eventually if if the Q converts to a

q star the AR Max there will give you

the optimal policy the optimal policy

will be once you have converged when

your where your qn convert so if you if

I give you the Q star this fixed

point then by taking any U that

maximizes for for this Qi U or Qi a

maximize

over uh a this is the optimal policy but

in this iteration I don't need to

specify any policy so it's not polic I I

guess my here's what I'm sorry maybe I

don't mean to dwell on this but I just

want to make sure I understand what's

happening so if you had a a state action

pair where there was a very rare High

payoff and usually the payoff was like

bad zero negative whatever and you

sampled it a few times and then uh you

saw that it was always bad to be here

and then you ended up in a policy that

like if there was a feasible policy that

where that state was

transient uh then like you may never

even observe the very high payoff

because you didn't visit it enough times

how do you and but if the optimal policy

needs to if the optimal policy does need

that big rare payoff are you guaranteed

to I mean that that it seems like you're

saying that you should you should get

there you should get there yeah the the

theorem tells you that if you choose

Alpha adequately it will converge to the

optimal solution

now if you have a state which usually

gives you bad

payoffs then this qn minus one when you

update it will usually update well with

the maximum of the previous one but if

they're bad you're averaging with bad

payoffs they

are which is there is usually very bad

you say with with small probability you

get a super high payoff well when that

happen happens when when the system

jumps over

there then uh you will get uh this

payoff and will balance up and it will

raise the qn and will raise

the I I what ensures that we're going to

continue returning to the state that had

the bad

payoff so Q is a matrix you're

evaluating every state and every yes

every state and every action every time

yeah uh so you're forcing the iteration

is saying I'm going to pick that b State

because you need to evaluate Qi of a I

being the best state with all the

actions so you're sampling every time

you're sampling every possible

transition yes oh you're not it's not

one sample path oh okay no sorry sorry I

thought it was one sample path no

I that's that's what we're currently

working on okay it's online learning

that no here is tabular so for every eye

you simulate your next St for every ey

in action you simulate what would happen

and then you adapt uh it's a

t i I had this um these idea of like I'd

seen this before and like another and I

thought that we were doing the other

thing what's the intuition of what Q is

uh that's that's a because it's not the

policy is a sort of value but it's a

relative value because you know that the

the optimal value doesn't depend on the

initial state where whereas this things

factors are sort of what small

advantages give you at at State ey to

start with uh so it's a sort of first

order approximation of the optimal

policy first order sensitivity with

respect to the action H sorry with

respect to the initial

state although in the long run the isal

state doesn't have

any any relevance uh well this gives you

a sort of first order expansion okay and

this why you get this RAR over there so

I have a final slide where I try to to

provide some intuition but it's purely

technical I don't have a good uh no

weird because Matrix question like okay

if I have the ability need to sample Q

like I can call sampling Q as much as I

want like I mean is this procedure any

better than like me just sampling Q then

forming the lp that solves the long run

average cost problem and then solving

the lp like um mean is there and is

there what is like if I were to do that

what strategy should I use to sample

yeah so the question is why shouldn't I

sample I mean there are several things

that you can do if you're out to sample

repeatedly you can estimate your your

probabilities by just doing a lot of

sampling and then estimate your p a

priority and then just solve for the

equilibrium uh there is not a big

Advantage as far as I know there is not

much uh difference between the two y

it's

basically basically the same I mean here

it's just you don't need to solve any

equations uh but you just do a sort of

uh update it's like you were doing uh

you were solving the linear equations by

a sort of iterative scheme okay and then

you know as the process evolves your

probabilities are adjusting themselves

so it's not really

different the the small advantage that I

see is that this can be more easily

adapted or extended rather and it's not

trivial to the case that you're

interested in this is the online case

where I don't have the right so I'm at

State I I'm only allowed to take one

action and observe what happens with

that action I cannot observe what if I

had play a different action here I'm

allowed to sample for up State Eye every

possible action and see for every you

control your system online you can

you're not allowed to do that you cannot

go about track and say okay what if I

yesterday I did something else no okay

but this uh this more challenging we're

working on that right now but the

techniques will more or less have have

half of the results

already okay so

um I guess I'm I'm I'm running already

out of time no I think that we have at

least 15 more minutes okay so yeah I

will try to speed up so here's

another uh rule I take Al one over n+

one so something which

the case um uh the power law the sample

paths are a bit more noisy the

rate seems to go like uh 1/ sare otk 10

of M really slow what is best log n

square root of log n or 1 / n square

root T of you can tell me both are are

terrible and I would agree but if in

fact log is better than this this

polinomial I mean power law but in order

to observe that this bits the log

eventually it will beat but you need n

10 to the 9 subtract it 10 to the 9

iterations to to have this square root

of 10 to beat the square root log n so

it's

uh so having a polinomial is nice but uh

it only works for n VAR very large no

one will do that so it's pretty useless

well here's the spoiler all of this

including the the 10 over there and the

square root log n and all of this can be

proved formally and with explicit error

bounds

uh so um I should probably cut

through koses man iterations but let me

quickly give you some rough idea of how

we address this this problem so uh I

take a more General problem which is to

compute an or to approximate a fix point

of an operator T operating of finite

dimensional space and the only

assumption is that t is not expansive

with some norm and I'm assuming that t

is only a computable add some random

noise so I have and I have this stoas

ceration cation is when you take un

equals

zero uh if you're interested in

stochastic gradient descend well the

operator T will be the identity minus 2

over L the gradient of your objective

function I'm assuming that you have a

convex function which gradient is elip

which more or less the standard setting

for

stoas then you can stoas is exactly this

SK skm iteration with an operator very

specific which is identity minus 2 overl

the

gradient in that case uh the norm will

be ukan Norm you get no expansivity with

respect tole Norm in our case the

operator T will be this H for which you

want compute the and here I'm only

assuming that it's not expansive with

respect to some Norm I don't ask

anything about the

norm so that's the the

iteration it's it's sort of success is

averages I'm averaging the previous

iterate and the image except that the

image I cannot compute I can compute it

up to a noise

un uh what I'm standing assumption the

alpha n are averaging sequences so it's

a sequence of scalers with this property

that sum of Al Al 1us Al is Infinity you

may have seen this uh many times in

different context this this condition

comes up and uh the noise I'm assuming

it's a mark different sequence in L2 so

it's it's essentially centered variable

with zero expectation and with bounded

variance with finite variance sorry so I

will denote Theta n the variance of the

noise this will pay a role you have to

control that one otherwise you have in

trouble and they're assumed to be finite

but they could possibly grow with that

which is essential to study

the uh cerning iteration in the kar

iteration you will have the thean grows

like Logan you can establish this type

of of growth so I'm assuming the noise

is finite has finite variance but the

variance May grow

um so let me just give you a very rough

idea how how we can proceed suppose that

forget about the noise and you have this

sort of successive averag so how can you

prove that how can you estimate the

residual of your fix point which is a

measure the natural measure in this case

of you would stop the iteration as soon

as this is small how can you estimate

that uh well if you had contraction that

would be easy that's a

b typical trick this this decays like

aetric series but here is you just have

non

expansivity so what do we do we observe

that exen is an average of the previous

iteration and the previous image I can

backt track I can say xn minus one is

itself an average of the previous images

and so forth and so and finally you get

that xn is some sort of average along

your

PA iterations of the object and the PS

are prescribed by the alphas and it's

very explicit

now suppose that we can estimate the

distance from XM to XEL for the moment

suppose that I have an estimate

dmn then what happens with xn minus dxn

I

replace right the the value of TN in

terms of of this Pi then I do triangle

inequality

uh I use the expansivity of T I use my

bound so I get a Bound for xn minus 3 of

the question is so how can we find an

estimate of the distance between two

itats M and

N well we do the same and here's what

where optimization comes into play so

we'll do a matching you you see XM is an

average of the of some vectors XM is

also an average of some Vector let's

match those weights those

probabilities so I take

uh variable z n so I I assume that PM is

the blue

um the blue weights are the pi i m the

red is the red mass and I will try to

match those by doing a uh an optimal

transport problem so solving a linear

problem

because you see XM minus XM is just this

difference if I take

z uh with those good properties that

send not send the transport from PI M to

Pi n i can express everything in terms

of a single sum and

now let's do triangle inequality

here so if I do trality I use no

expansivity I get Z J the previous

iterates if I assume that I already

computed bounds for the previous iterate

I have a new Bound for m

andn so the idea is to do this

recursively to use the previous bound D

IUS one J minus one to constru the next

bound MN and then proceeding

uh

inductively so I Define my dmn as the

optimal transport problem for sending Pi

M to Pi n and what is the cost the

transportation cost will be the

distances that I estimated

previously so it seems a bit involve all

of this

because it it has one advantage is that

you see the operator T disappeared plays

no role this thing is just an iteration

of real

numbers

you take the d j minus is a double

induction on M and N but you can build

that and this has a lot of properties

this dmn turns out to be a distance it

turns out there is a greedy algorithm to

compute them you don't need to solve the

linear program I can give you a greedy

procedure to compute them and more

importantly you see we went through all

of these to find these bounds and these

bounds provide you an estimate finally

for the residual you get at the end

iteration and this dmn you can compute

it a priority they only depends on the

on the alphas that you

chose given your Alpha you compute your

D with the idea your computer B RN and

you're done you have your

estate independent of the operator T

independent of the space independent of

the dimension it it could even work in

infinite dimension

so the bounds you obtain here are very

robust they work for all operators and

you might say okay that's that's too

much it it means that we we

overestimated everything well no I can

construct you an operator which

satisfies each and every inequality as

an

equality where XM XM minus xn in Norm is

exactly equal to

dmn and uh your residual over here the

RN so the norm of xn minus t of xn is

equal to this

RN there is an operator which satisfies

each for every n m it satisfies

everything with

equality that doesn't mean if you give

me a specific

operator maybe you can do better but it

does mean that uh the technique in the

worst case is the best you can expect

this

Valance

uh maybe I missed an assumption or

something I I mean I've seen like the

contractive mapping stuff before but

this is all I've never seen this um

I like is there some assumption on like

the boundedness of the space so that way

the non-expensive mapping converges it

comes through I come through okay so

here is exactly this uh this issue for

the moment I'm assuming that my operator

from RN to r or Rd to

Rd um so here's the result which there

is an original results in Bon Brook and

then we completed that uh in 2013 so a

special case was treated in Bayon Brook

who stated this as a as a conjecture and

we proved it in 2013 so let me let's let

me Define Kappa equal two times the

diesel from x0 to 15 this already

requires that I'm assuming that t has a

fix Point okay so all my iterations will

leave in the ball Center at X zero and

with this G and and in fact the operator

T is confined to that ball alternatively

I could take Kappa if if your operator T

is operating from a convex bounded set

into itself I could take Kappa just the

diameter yeah of that set okay so that's

I think that yeah I mean I was something

like you know like X to X plus one is a

non-expensive mapping that is that

doesn't that will go to Infinity yeah no

this already ensures that you have a

fixed point and here I'm assuming the

priority and if it is bounded it should

have so take Kaa this then let's define

to n as the sum of Al Alpha K 1us Alpha

K and by assumption this is going to

infinity and theine sigma of Y this

function the minimum forget about the

minimum is one sare root of Pi

y

then by analyzing this recursive optimal

transport problem you can

prove explicitly more explicit bound you

can get that this Norm is less or equal

the the residual after n iterations is

less or equal than the Kappa the two

times the distance to the fixed Point

divid by square root of Pi

to

and you could think that P is a white P

appears here it's a

mysterious what I can tell you is that

you cannot improve is the best you can

expect for that bound if you give me

anything smaller than one/ sare root of

Pi I can build you an example that

violates this

equality this was any Norm or

ukan and the Pi still weird I mean you

can see the pi showing up but when you

it comes out from the analysis when the

analysis is a bit involved because you

reinterpret your Z the the optimal

transport well

first you take your optimal transport

you bound it by a suboptimal transport

very special one this suboptimal

transport you re interpret it as

a as a transition probability for a mark

of chain which has nothing to do with

the mar decision a mark of chain in the

is z in

square and when you Analyze This you

boil down to uh to the gamess

ru you go you ask the probabilities of

what estimate you can give for the Glam

ruin you end up with a function which

is Gus hyper geometric function you have

this bounds in terms of this special

function and that special function gives

you rise to one square root of Pi so

it's not about normal distribution

that's what one would expect no no no

it's not that it's a bit more involved

it comes from a probabilistic analysis

which was not obvious at the beginning

in I mean you have this recursive

optimal

transport you you take something which

is suboptimal that you reinterpret as a

mark of chain gives you gous ruin and

eventually gives you a very explicit

bounce and what's interesting that you

can only proove that but except you did

a lot of

majorization but the

majorization somehow are not uh over

stated this this bound is tight you

cannot improve this we we proved with

Mario

Bravo okay in particular you can have if

goes to Infinity then the residual goes

to zero of course one square root and

and it tells you what what the speed is

one/ square root of T you already got

but this of course was for for the

deterministic case there's no noise here

so what about if I introduce

noise so first you can say let's let's

introduce some error n but suppose that

I can control this error it's not a it's

not a stochastic noise but it's some

let's say Precision I compute the

operator T and I have an error that I

it's under my

control well it turns out that this

bound that I have here you

see this Bound in in this

equation it will still be there but

there should be some

extra uh account for the for the errors

and here's the ugly expression so you

have your Sigma to as before and here

you have the errors of course if the

error is zero if the e k and plus one

are zero is the same as before otherwise

you have here sort of convolution in in

the ter in between it's ugly don't I

don't expect you to to interpret or to

retain this but you have some some B

that you can control except that this

requires to be controlled for instance

if I take en such that the sum is

somehow controlled then I can prove that

the sequence still

converges

uh so okay here are some examples of

speeds of convergence for different

controls of the

errors

um my only Point here don't look at the

equations maybe if you're interested you

can look it later my point is this

bounce really can be used to and

translated into uh not only rates of

convergence but it can give you explicit

bounds this is less or equal than log to

n/ root of 2 N multiply by this

Con so you can really exploit this very

precisely

so okay so here is the basic tool a mar

of change with rewards but I will skip

that I'm running out of time so let me

skip

this and uh so let's go back to

stochastic am and eventually come back

to stochastic ma of decision process how

all of this applies well B iteration is

of this form now un is an error but

which is not controlled is a stochastic

term if I want to apply my previous bond

that we obtain previously well it would

require that the noise goes to zero

almost surely or something like this and

moreover that this sum of alpha K Norm

UK is finite but since the norm the sum

of alpha K is divergent This requires

that UK converges to zero fast enough

and unfortunately UK is not under my

control

is a stochastic if if I have a normal

noise well I cannot expect something

like this not even in

expectation uh it's very restrictive

this uh so the trick is to modify skm

and interpret an average

iteration um for minutes so let me just

skip there is some trick to replace un

by a sort of of average noise and then

you you you have much better chances

that this average noise comes to

zero and everything is controlled deps

of the

variance and

uh finally you can get convergence under

some control on the variances but maybe

it's more meaningful you get explicit

error

bounds uh as before in terms of

controlling the variances that's that

becomes a bit technical but it's the

same the same ideas that you just adapt

them over and over so if you go to Q

learning so here's

the I don't need to recall but this was

the setting your you have a operator age

which is non expansive blah blah blah uh

here's the iteration that we propos it's

exactly the form of the km and and by

using this uh methodology developed in

terms of optimal transport you get

this so take alpha

n you're averaging as a power then

almost

surely F of

qm remember the F was replacing the R

Bar well in fact it will convert to R

Bar also the trajectory Q will converge

to the optimal solution in nor Infinity

well every Norm is equivalent so doesn't

matter Q converts to this qar which is

the unique solution of f qar equal R

Bar and moreover there is an explicit

constant this you have to look at the

paper such that the fix Point error is

bounded by Kaa a/ square root of to

n this gives you the rates of

convergence uh

um so I think I can I can stop

there um if you're interested we can

later discuss how this applies also to

stochastic grad descent and how it

recovers uh even pretty recent results

from last year and but he recovers it

from a technique which doesn't have to

do with optimization or anything it's

just fix pointed

ratios

uh and for your question once you have

the qar or any good enough approximation

you derive the optimal policy by saying

for each state I look at what is the

optimal

action what remains to do is H what you

said what do I do when I I don't have

the possibility to

resample and I have only one uh One path

of the r

we are working on

that with with Mario and with zoto Bruno

French

colleague he's coming to

Santiago next week a couple weeks and we

hope to finish that so that's we expect

to have news within you know next months

so the end of this year maybe we can say

something we do not expect this already

slow so we do not expect that this

online will be faster for sure

not and uh but then there are many

variants but this is one uh possibility

there are many variants with modate

reduction so there's a there's still a

lot to to

do uh but the take-home message for me

would be that

um since we're talking about

all of this is basically concerned with

Computing independent of the iterations

that you do you're trying to compute a

fix point of a nonexpansive map and

nonexpansive is with respect to a bad

Norm the norm

Infinity

so

uh when you think of that there is this

technique based on optimal transport

that might be useful that's a take home

method keep in mind that you can

estimate this

by and the mathematic is very nice I

learn a lot of combinatorics and uh

special functions and

probability

uh that's it thank you very much thank

Robert oh thank you let's have a copy

[Music]

yes

## Browse More Related Video

5.0 / 5 (0 votes)