Fixed-point Error Bounds for Mean-payoff Markov Decision Processes
Summary
TLDRロベルト・コンティ教授が最適化、交通問題、ゲーム理論に関連する研究を行ってきたことを紹介し、特にマーロスの決定プロセスの制御に関する最新の研究結果を発表しました。彼は、この問題を解決するための反復スキームを調査し、Q学習アルゴリズムの收敛速度と収束に関する明確なエラー境界を提供することを目的としています。さらに、確率的な勾配降下と一般的な確率的な反復プロセスに関する彼の研究結果についても説明しています。
Takeaways
- 🎓 ロベルト・コンティ教授は、最適化、交通、ゲーム理論などの問題に取り組む専門家です。
- 📚 今回の講演では、最近の科学誌で発表された論文の主題に焦点を当てています。
- 🔍 講演の目的は、マル可夫過程の決定プロセスにおける制御の最適化に関する問題を解決することです。
- 🔄 繰り返しスキームとクロスジオンイテレーションとの関係に焦点を当て、解決策を探ります。
- 🤖 Q学習アルゴリズムについて議論し、その收敛速度と有限時間の誤りToBoundsを提供することを目的としています。
- 📈 進化的最適化問題を解決するための新しい方法を提案し、その理論的背景と応用を説明しています。
- 🌐 马尔可夫過程に関する知識がない場合でも、オンライン学習とQ学習の適用が可能であることが示されています。
- 📊 讲演では、理論的な分析と数值的な例示が併せて行われ、理解を深めるためにグラフやチャートが使用されています。
- 🚀 ロベルト教授は、最適化問題に対する新しいアプローチの可能性を示し、今後の研究の方向性を示唆しています。
- 📝 讲演の内容は、最適化、強化学習、オンライン学習などの分野の研究者にとって有益な情報源となるでしょう。
- 🌟 讲演は、ロベルト教授の専門的な知識と経験を示すものであり、聴衆にとっては学びの機会を提供します。
- 🔗 讲演の最後に、質問を受け付ける時間があり、参加者は自分の理解を深めるために質問を投稿できます。
Q & A
ロッベルト・コンティ教授はどの大学の教授ですか?
-ロッベルト・コンティ教授はUniversity of Chileの教授です。
コンティ教授が取り組んでいる最適化問題の分野は何ですか?
-コンティ教授は交通最適化問題の分野に取り組んでいます。
コンティ教授が発表した論文はどの科学誌で掲載されましたか?
-コンティ教授が共同著者のカグエ・サンティアゴと発表した論文は「Science Journal」に掲載されました。
Q学習アルゴリズムとは何ですか?
-Q学習アルゴリズムは、マルコフ決定過程(MDP)における最適報酬を最大化するためのアルゴリズムです。
コンティ教授が提案したQ学習のプロシージャの目的は何ですか?
-コンティ教授が提案したQ学習のプロシージャの目的は、MDPの解決策に収束することを示し、反復の速度と有限時間の誤り範囲を提供することです。
コンティ教授が説明した問題の特別な性質は何ですか?
-問題の特別な性質は、確率や報酬、遷移行列Pを事前に知らないという点です。これにより、学習しながら最適なポリシーをadaptively学習することが求められます。
コンティ教授が説明した例では、2つの状態と2つの行動があるとき、最適な長期平均報酬は何ですか?
-2つの状態と2つの行動がある場合、最適な長期平均報酬は71/8です。
コンティ教授が言及した「非拡大的」という用語の意味は何ですか?
-「非拡大的」とは、演算子が作用するベクトルに対して、その結果のノルムが入力のノルムと同じか小さいことを意味します。
コンティ教授が提案したQ学習のプロシージャにおいて、α_nはどのような役割を果たしますか?
-α_nは、学習率として機能し、前のイテレーションのQ値と新しいサンプルに基づく更新値の重みを制御します。
コンティ教授が説明した「最適な価値」とは何を求めるものですか?
-「最適な価値」とは、各状態での最適な行動を選択するポリシーを求めるものです。
コンティ教授の講演で提案されたQ学習のプロシージャは、どのような種類の成長速度を持っていると考えられますか?
-Q学習のプロシージャは、成長速度が1/sqrt(T)を持ち、Tが大きくなるにつれて解約収束していくと考えられます。
Outlines
📚 研究紹介と最適化問題の概要
ビデオスクリプトの最初の段落では、Roberto Comti教授が交通最適化問題に関する研究を行い、2週間の滞在中に彼の専門知識を共有することが明らかにされています。彼の研究は、最近の科学誌に発表された控制最適化に関連しており、特定のマル可夫斯基決定プロセスの制御に焦点を当てています。スクリプトは、彼の講演の内容や目的、および予想される結果についての詳細な説明を提供しています。
🎯 马尔可夫決定プロセスと報酬機能
第二の段落では、マル可フスキー決定プロセスと報酬機能の概念が説明されています。このプロセスは、報酬に基づいて最適なアクションを選択することを目指しています。報酬は、現在の状態と行動、そして達成された最終状態に基づいています。目標は、長期的な平均的な報酬を最小化する統一的なポリシーを見つけることです。
🔄 イテレーション手法とQ学習の紹介
第三の段落では、特定のイテレーション手法とQ学習アルゴリズムが提案されています。この手法は、最小化問題を解決するための方法であり、Q学習はマル可フスキー決定プロセスの最適解を求めるためのアルゴリズムです。スクリプトでは、これらの技術がどのように機能し、最適解を見つけるためにどのように適用されるかについて説明されています。
📈 Q因子と最適解の探索
第四の段落では、Q因子と最適解の探索に関する詳細が提供されています。Q因子は、各状態でのアクションの相対値を示し、最適解を求めるために使用されます。スクリプトは、Q因子を計算する方法と、それらが最適解にどのように関連しているかについて説明しています。
🤔 探索と最適化の課題
第五の段落では、最適化プロセスにおいて探索の重要性とそれに伴う課題について議論されています。スクリプトは、レアイベントや高報酬の状態に対するアプローチ、および最適政策がこれらの状況を効果的に処理する方法について説明しています。
🌟 最適化とその応用
最後の段落では、最適化技術の応用とその結果について説明されています。スクリプトは、最適化問題の一般的なアプローチと、その技術がどのように使用され、特にオンライン学習の文脈での応用について詳細に説明しています。
Mindmap
Keywords
💡Optimization
💡Transportation problems
💡Game theory
💡Markov decision processes
💡Control
💡Iterative schemes
💡Q-learning
💡Nonexpansive maps
💡Fixed point iterations
💡Stochastic gradient descent
💡Error bounds
Highlights
Roberto Comti, a professor at the University of Chile, discusses optimization and transportation problems, games, and control of Markov decision processes.
Comti's recent paper in Science journal focuses on control optimization and its connection with Markov decision processes.
The talk will cover three main topics, including an iterative scheme for minimizing payoff in Markov decision processes (MDPs) and its relation to Crossan iterations.
Q-learning, an algorithm for mean payoff MDPs, will be analyzed in terms of its convergence and speed of convergence.
Finite time error bounds for Q-learning will be provided, offering an estimate of the expected error after a certain number of iterations.
The talk will also touch on stochastic gradient descent and stochastic iterations with bounded variances.
A key challenge is addressing the problem of not knowing the transition probabilities and rewards in advance in Markov decision processes.
The goal is to find a policy that minimizes the long-run expected average payoff in a stochastic system described by a Markov chain.
The speaker will present a small example with two states and two actions to illustrate the optimization of long-run average payoff.
The concept of relative value for each state and its role in solving the fixed point equation of the problem will be discussed.
The talk will explore the use of Q-factors and the role of a normalization function in fixing the constant in the fixed point equation.
The speaker will delve into the technical assumptions and the properties of the operator involved in the fixed point problem.
The issue of non-uniqueness in the solution of the fixed point equation and its implications will be addressed.
The talk will present a method for adaptively learning an optimal policy in Markov decision processes without prior knowledge of probabilities and rewards.
The speaker will discuss the convergence of the Q-learning procedure and the conditions under which it converges to the solution of an MDP.
The talk will also cover the extension of these concepts to online learning scenarios where resampling is not possible.
The speaker will provide insights into the application of these methods in various fields such as maintenance in telecommunications and control of queuing networks.
Transcripts
[Music]
well hi everyone thank you for coming uh
today we have uh Roberto comti he's a
professor at University ofes and he has
worked on optimization transportation
problems uh
games uh and well we're happy to have
him he's staying for two weeks until
this Saturday so he'll be leaving soon
but of course if you want to contact him
he's more than happy to answer questions
now or later on so thank you very much
to well first to to Daniel and all the
team for the very
nice reception here and last line was
great uh my dinner and um
so I I am planning to talk about this
topic which is uh the subject of a
recent paper it just came out uh a
couple months ago in science journal
control optimization it has to do with
uh uh control of maros decision
processes so my plan uh oh I should said
that this is Joint war with a cague in
Santiago Mario
Brown uh so the paper is available if
anyone um is interested I can give you
the the print the
reprint so that will be my program
basically I will focus on the three
first topics and I'll go quickly over
the fourth so I will start by describing
roughly what I mean what is the problem
I'm facing this very
basic uh I'm not pretending to do a
surveys of mro decision process that
would be uh huge topic so I will just
give you the the more narrow problem
that we're interested in and I will
explain why this case is particularly
has some pecularities that make it uh a
bit
challenging uh and then to to address
this uh problem I will survey I will
look at one particular uh iterative
scheme for Min payoff
mdps and establish a connection with
this crosan iterations which are
basically fixed point
iterations for uh maps that are only
nonexpansive though so there are no
construction only nonexpansive and it
turns out that the solution of the mark
of decision process in M POF requires
exactly this type of uh
problems once I survey what we know
about this cross schema iterations and
the techniques we have to analyze that I
will go back to uh Q learning which is
the algorithm for me pay of
mdps and explain what uh what we obtain
so in terms of results that we expect to
have there are two of them on the one
hand you you we will propose or we will
Analyze This Q learning procedure we
would like to prove that this procedure
converges to the solution of one of
mdps but beyond that we would like to
know what is the speed of convergence of
the iterates toward the limit and even
more importantly for us will be to
provide finite time error bounds so
after K ratios of the uh of the algorith
I would like to have an
estimate of the expected
error that I'm and hopefully very the
most explicitly
possible uh and then there's a byproduct
of the C iterations I will describe what
it gives you when you study stochastic
gradient descent and more generally the
case of uh stochastic iterations with
bounded variances but this depends on
how much time we are left uh by the way
any questions please feel free to
interrupt any
time so let me start by setting the
stage and this is a sort of introduction
for what comes later about me P so uh
imagine that you have a stochastic
system A system that moves
stochastically and is described by a
mark of chain so you have some
transition probability you have finite
set of states which will denote
s and uh when moving from I to J so I
will be the initial State J will
represent the final State you have
certain probability but on top of that
you can control these probabilities by
this action a everything will be assumed
finite in this talk so you have a way to
decide if I'm at State I what is the
best action that will lead me to a next
state J which is in some sense um
desirable and desirability is controlled
by this
reward in general is a function of the
initial state of the action you're
taking and the final state that you
reach it could depend on everything or
depend only on one of those the example
will only depend on the final State the
state that you attain
J um so once you have that if you're
given an initial State let's say it's
the deterministic you know exactly where
your system start from this is I
zero and if you prescribe a set a
sequence of actions at well you have a
mark of chain that will evolve according
to the this transition probability so
the the next state I t+ one t will be
the epoch or time so at stage t+ one
your new state will be selected
according to this conditional
probabilities over the state of uh
states which depends on your current
state it and your
action what is the goal well I would
like to prescribe or to find out a
stationary policy what is a policy is a
map that given a state tells you what to
do in that state so it assigns to each
state an action and I would like to
choose this uh policy mu so that this
control by choosing systematically at
just as a function of the current state
it to minimize the long run expected
average
payoff so you see once I have controls I
have a stream of payoffs one for each
stage and how do there are different
ways to evaluate a stream of payoff one
possibility is to use discounting so I
discount with a certain rate future uh
rewards are discounted to bring
it uh which makes a lot of sense when
you're talking about money but in some
applications uh specially system that
run
continuously um suppose that you know R
oft is uh some measurement of the um of
the of a
system which is uh which are
um um
doing Main
well maybe maintenance you only want the
things to be running smoothly in the
long run well this type of object is not
discounting the the objective function
but what you do is you consider the
long-term average one / T the sum of the
you look at the
longterm mean payoff over per period and
of course this is stochastic so you take
the
expectation and then you look at at the
limmit when T is
large again you you add up you compute
your average payoff per uh period you
take the expected value in the long run
that's
uh what we want to optimize and
U okay I'm basically following here
classical reference which is there
putterman where uh it discusses a lot
lot of uh applications in maintenance in
uh telecommunications or
or control of Qing networks there are
many many applications I will not give
you any general examples I wouldn't be
able I would rather present a very small
motivation
example the minimal example so suppose
that I have only two states and two
actions
and suppose moreover that the rewards
depend only on the final state so I have
two states on state
one my reward is one so when I'm I'm
staying oops I'm staying at this state
one here I get a reward of one if I'm at
state two I get a reward of 10 of course
if I'm maximizing the average payoff I
would like to be at state two as often
as
possible now the the the transition
probability depends on your action if
your action is
A1 and you're standing at one well you
stay there with probability point3 you
move to the good state with probability
7 if you play action two well you stay
with high probability at State
one and only 01 to move to the good
state so that's clear that if I'm at
State one I should favor action
A1 and vice versa at at state two I want
to stay at that State with action one I
stay only with probability three uh with
action two is much
higher so the optimal policy would be
play A1 at one play A2 at two and if you
look at the longterm average reward and
so on and so forth you get that the
value the optimal long run AA and this
is the optimal it can be proved that
this is the optimal is 71 over 8 in this
example okay so what we're looking is uh
at at those uh type of
problem so this was very easy but now
imagine that in the sem example I don't
show you the
probabilities and I don't show you the
payoffs I don't show you anything the
only thing I give you is that when you
try one action I sample from the true
distribution the next
State and I show you the next state but
I don't tell you with which probability
I chose so you don't have you cannot
compute this uh this objective function
you can only observe what is the next
state and what is the reward that you
got in in that stage So You observe your
current state You observe your action
because you choose one and then you
observe the next uh
State and the reward
but I'm hiding uh everything else the
question is uh so I don't know P or R in
advance but I can possibly learn along
the way and that's a question can we
come up with
a method that
adaptively
adaptively uh learns an optimal policy
that's the main
issue so um I will start going a bit uh
quicker there's some technical
assumption here which is believe me
quite mild I'm assuming that every time
I get a a station I provide I prescribe
a stationary
policy uh while the the uh induced Mark
of change is uni chain that means that
there is a basically there is a unique
stationary
distribution uh well in that case the
large R mu i z remember is just the
long-term expected average
payoff this uh value doesn't depend on
initial State and that's quite expected
right I'm just looking at the long run
it doesn't matter there is a unique um
limit uh
stationary um law
distribution so the initial State
shouldn't play any role so from that
point of view i z is IR Rel relevant but
of course it depends on what policy mu
I'm taking
because because the stationary
distribution will depend
on well it turns out that the optimal
value R Bar which is the minimum of our
mu overall mu and the optimal policy can
be obtained by
solving this type of bmas equation and
here's the fix Point formulation of the
problem so I'm looking in this problem
I'm looking for a vector v index by
States it's it's called the relative
value for every
state and which satisfy this set of
equations where you have V on on the
right hand side and V comes over
here so V of I is just the maximum
overall actions of the expected value I
I'm taking the expected value with the
transition probability when I fix a
of the reward plus VJ minus r
bar so I should solve this equations
this as an equation in
V except that already to State this
problem I should know R Bar the optimal
value which is not known
unfortunately and we will come later to
that but if by any chance we know our
bar then we only have to solve this
problem and this is a fix Point equation
V equal some TR of V or I think I will
call it h what you see in the right hand
side is an
operator which acts on vector
v and this operator you can see that it
is monotone if I increase V it will give
me something bigger and if I add a
constant the constant comes out that's
called isotone and this implies that
this operator is nonexpansive in the L
Infinity Norm the norm
Infinity so you're Computing you need to
compute a fixed point of a nonexpansive
map with respect to a very specific Norm
which is the norm
Infinity the the solution is never
unique because it's you had a constant
on both sides you have a constant to V
and you have another solution but
basically that's the only uh source of
non-uniqueness beside that this equation
has a unique solution up to a
constant uh the problem is that our bar
I don't know a prior I could try
to to adjust it right to to well that
doesn't work I could try to guess what R
Bar
is and solve for V and then try to move
adjust well that doesn't work because it
turns out that this equation if if it
has a Solutions then our bar should be
the optimum value if if you miss by
Epsilon this equation has no solution
it's if and all leave so you really have
to put plug there the exact optimal
value R Bar which is not know it's not
exact that's a big issue so you
have
um you have some fix Point theor fix
Point problem of an operator that you
cannot compute because you don't know
our bar and you don't know
p and you don't know R so you know
nothing and you want to
computer uh fix point and you really
have to plug there the R Bar which is
exact otherwise there is no
solution um and the operator has some
nice property is not expensive but still
it's not expensive the nasty Norm it's
notan
Norm and most of fix Point Theory
works for in the for good norms for ukan
Norms here is not the case you have Norm
Infinity well let me rewrite this
equation by calling this Q factor so all
the expression that it is inside the
blue expression let's call it Q factor
then I can express VI in terms of the Q
factors you can rewrite this bman
equation in terms of q c Factor
so the only change is that the max
operator moved inside so I'm replacing
VJ by its expression in terms of the Q
factor so that's another writing of the
same equation and again this has a
unique fix point up to a constant and
provided that R Bar is exactly the
optimal solution otherwise there is no
fix
point so this is the form uh we will
address uh so here's what I just said
the a fixed point for a map Age which is
just an expectation you see I'm taking
the sum over all possible future State
according to probability P so I'm taking
the expectation with respect to the next
stage of
the uh reward the immediate reward plus
some sort of future
reward except that I have to remove this
arba
here this map is nonexpansive as said in
Infinity it has a unique fix point up to
a constant but it might be difficult or
even impossible to evalate because of
one the R Bar is typically
unknown and for a fix uh what I will do
is that in the iterations instead of
computing a fix point for
a for the wrong RAR I will
iterate with I will replace the R Bar by
a function of the
state and this is not our idea this is
aadi that proposed in the 80s to to do
this trick so we replace our bar with a
function of F of Q I will tell you what
function you should choose there and
consider this new operator H star now I
can
evaluate except that I don't know uh how
to compute the expectation know the
reward
so I can replace the expectation by some
sequence of
samples and uh try to do something with
that that's that's the only information
we get stage after stage well if you do
that then here's what uh Q learning
proposed by
borker you say let's suppose that I have
qn minus one an estimate of the fixed
point I'm looking for
uh I observe the next state and the
reward and I replace the expectation by
this sampling of the
operator and with this by a simple
averaging 1 minus alpha n the previous
iterate plus alpha n this part which is
just a sample of the operator there is
no expectation there just a sample I do
the averaging and I take this qn as my
new uh estimate of the fix
point and this interation you know x q n
is a
matrix which is indexed by state I and
action
a so this is what is called
tabular rvi Q learning it says that for
every I and a I
update um this value so I take
independent samples and what I'm
requiring from f for technical reason uh
is very simple it's simply that it is
homogeneous with respect to adding of a
constant if I do that then it tells that
this H star now it has a unique fix
Point unique not unique up to a constant
so somehow the F of Q will fix the
constant so it has a unique fix point
and in fact it will be the only element
of fixed age which was unique up to a
constant which
interpolates such that F of Q star gives
me exactly the optimal
value what I lose is that is this F of Q
is no longer
nonexpansive just because I added this
function f of q but this will not be a
problem examples of such functions well
take the average of The Matrix
or take the maximum of you know any
normalization you can think of f as a
normalization
function so in this case going back to
the example if you take the maximum the
fix point is over
here and if you look where the maximum
is is 71 / 8 that was the optimal value
71 over
8 and this is the qar you fix the
constant with
that so here is an example on this
example I run uh rvi Q learning so this
iteration rvi which is there you can
simulate and run so here are some
numerical
illustration well you have to choose
what is your update parameter alpha n so
here's for instance 1/ n +
one and here 10 sample
paths and I'm plotting here F of qn and
as expected it you see it concentrates
around R Bar which is 71
over8 you see that all trajectories
converge and uh in terms of expected
rate if I look
at what is the fix Point residual what
is the error for the true operator AG
which I cannot
compute but still uh this expected
value it's SC sces in this case like
square root of Logan that gives you the
rate of convergence maybe a bit faster
because you see this product is still
decreasing but I can tell you that your
error is roughly decaying as one over
square root of Logan is not very fast
but it is what it
is it seems to be pretty much uh tight
in this example at least how about
question yeah maybe I just misses pay
attention but um how do you handle do
you need to do exploration are you doing
exploration or is your markof are you
assuming that like every policy if I
need if I need to do what any
exploration like the any exploration so
I me this convergence is like uh like
convergence for a fixed policy or are
you optimizing the policy as you go um
I'm just
doing this
iteration over there okay so you are you
are picking the best action each I'm
taking the best action here yes when I
compute the maximum yes uh sample
maximum but it's a sample maximum yes so
it's not a policy I I take qn minus one
was my previous estimate of the optimal
policy of the Q
factors I
observe the new
state and I do a posterior what what
would be the the best
action starting from The Next Period on
so here is the current r i which is the
reward I get immediately and then I
adjust my policy for the next
State uh I take the best policy you are
are are you assuming that all states are
visited with positive probability for
all policies not necessarily I only
assume that if you give me any policy
any fixed uh stationary policy the mark
of change that results is uni chain it
has a unique so there there could be
states that are transient for that
policy yes and different policies could
have different Transit States yeah I
mean I
guess if you start out with policy that
like never wants to visit some
State uh and then like you observing
good rewards like are you going to like
under this strategy are you going to um
like explore the whole
Space there is no policy
here um I'm not fixing the policy yeah I
look at the state s i a n so the next
state you arrive and you take the best
action that you would take assuming that
your Q factors are given by Q minus
one eventually if if the Q converts to a
q star the AR Max there will give you
the optimal policy the optimal policy
will be once you have converged when
your where your qn convert so if you if
I give you the Q star this fixed
point then by taking any U that
maximizes for for this Qi U or Qi a
maximize
over uh a this is the optimal policy but
in this iteration I don't need to
specify any policy so it's not polic I I
guess my here's what I'm sorry maybe I
don't mean to dwell on this but I just
want to make sure I understand what's
happening so if you had a a state action
pair where there was a very rare High
payoff and usually the payoff was like
bad zero negative whatever and you
sampled it a few times and then uh you
saw that it was always bad to be here
and then you ended up in a policy that
like if there was a feasible policy that
where that state was
transient uh then like you may never
even observe the very high payoff
because you didn't visit it enough times
how do you and but if the optimal policy
needs to if the optimal policy does need
that big rare payoff are you guaranteed
to I mean that that it seems like you're
saying that you should you should get
there you should get there yeah the the
theorem tells you that if you choose
Alpha adequately it will converge to the
optimal solution
now if you have a state which usually
gives you bad
payoffs then this qn minus one when you
update it will usually update well with
the maximum of the previous one but if
they're bad you're averaging with bad
payoffs they
are which is there is usually very bad
you say with with small probability you
get a super high payoff well when that
happen happens when when the system
jumps over
there then uh you will get uh this
payoff and will balance up and it will
raise the qn and will raise
the I I what ensures that we're going to
continue returning to the state that had
the bad
payoff so Q is a matrix you're
evaluating every state and every yes
every state and every action every time
yeah uh so you're forcing the iteration
is saying I'm going to pick that b State
because you need to evaluate Qi of a I
being the best state with all the
actions so you're sampling every time
you're sampling every possible
transition yes oh you're not it's not
one sample path oh okay no sorry sorry I
thought it was one sample path no
I that's that's what we're currently
working on okay it's online learning
that no here is tabular so for every eye
you simulate your next St for every ey
in action you simulate what would happen
and then you adapt uh it's a
t i I had this um these idea of like I'd
seen this before and like another and I
thought that we were doing the other
thing what's the intuition of what Q is
uh that's that's a because it's not the
policy is a sort of value but it's a
relative value because you know that the
the optimal value doesn't depend on the
initial state where whereas this things
factors are sort of what small
advantages give you at at State ey to
start with uh so it's a sort of first
order approximation of the optimal
policy first order sensitivity with
respect to the action H sorry with
respect to the initial
state although in the long run the isal
state doesn't have
any any relevance uh well this gives you
a sort of first order expansion okay and
this why you get this RAR over there so
I have a final slide where I try to to
provide some intuition but it's purely
technical I don't have a good uh no
weird because Matrix question like okay
if I have the ability need to sample Q
like I can call sampling Q as much as I
want like I mean is this procedure any
better than like me just sampling Q then
forming the lp that solves the long run
average cost problem and then solving
the lp like um mean is there and is
there what is like if I were to do that
what strategy should I use to sample
yeah so the question is why shouldn't I
sample I mean there are several things
that you can do if you're out to sample
repeatedly you can estimate your your
probabilities by just doing a lot of
sampling and then estimate your p a
priority and then just solve for the
equilibrium uh there is not a big
Advantage as far as I know there is not
much uh difference between the two y
it's
basically basically the same I mean here
it's just you don't need to solve any
equations uh but you just do a sort of
uh update it's like you were doing uh
you were solving the linear equations by
a sort of iterative scheme okay and then
you know as the process evolves your
probabilities are adjusting themselves
so it's not really
different the the small advantage that I
see is that this can be more easily
adapted or extended rather and it's not
trivial to the case that you're
interested in this is the online case
where I don't have the right so I'm at
State I I'm only allowed to take one
action and observe what happens with
that action I cannot observe what if I
had play a different action here I'm
allowed to sample for up State Eye every
possible action and see for every you
control your system online you can
you're not allowed to do that you cannot
go about track and say okay what if I
yesterday I did something else no okay
but this uh this more challenging we're
working on that right now but the
techniques will more or less have have
half of the results
already okay so
um I guess I'm I'm I'm running already
out of time no I think that we have at
least 15 more minutes okay so yeah I
will try to speed up so here's
another uh rule I take Al one over n+
one so something which
the case um uh the power law the sample
paths are a bit more noisy the
rate seems to go like uh 1/ sare otk 10
of M really slow what is best log n
square root of log n or 1 / n square
root T of you can tell me both are are
terrible and I would agree but if in
fact log is better than this this
polinomial I mean power law but in order
to observe that this bits the log
eventually it will beat but you need n
10 to the 9 subtract it 10 to the 9
iterations to to have this square root
of 10 to beat the square root log n so
it's
uh so having a polinomial is nice but uh
it only works for n VAR very large no
one will do that so it's pretty useless
well here's the spoiler all of this
including the the 10 over there and the
square root log n and all of this can be
proved formally and with explicit error
bounds
uh so um I should probably cut
through koses man iterations but let me
quickly give you some rough idea of how
we address this this problem so uh I
take a more General problem which is to
compute an or to approximate a fix point
of an operator T operating of finite
dimensional space and the only
assumption is that t is not expansive
with some norm and I'm assuming that t
is only a computable add some random
noise so I have and I have this stoas
ceration cation is when you take un
equals
zero uh if you're interested in
stochastic gradient descend well the
operator T will be the identity minus 2
over L the gradient of your objective
function I'm assuming that you have a
convex function which gradient is elip
which more or less the standard setting
for
stoas then you can stoas is exactly this
SK skm iteration with an operator very
specific which is identity minus 2 overl
the
gradient in that case uh the norm will
be ukan Norm you get no expansivity with
respect tole Norm in our case the
operator T will be this H for which you
want compute the and here I'm only
assuming that it's not expansive with
respect to some Norm I don't ask
anything about the
norm so that's the the
iteration it's it's sort of success is
averages I'm averaging the previous
iterate and the image except that the
image I cannot compute I can compute it
up to a noise
un uh what I'm standing assumption the
alpha n are averaging sequences so it's
a sequence of scalers with this property
that sum of Al Al 1us Al is Infinity you
may have seen this uh many times in
different context this this condition
comes up and uh the noise I'm assuming
it's a mark different sequence in L2 so
it's it's essentially centered variable
with zero expectation and with bounded
variance with finite variance sorry so I
will denote Theta n the variance of the
noise this will pay a role you have to
control that one otherwise you have in
trouble and they're assumed to be finite
but they could possibly grow with that
which is essential to study
the uh cerning iteration in the kar
iteration you will have the thean grows
like Logan you can establish this type
of of growth so I'm assuming the noise
is finite has finite variance but the
variance May grow
um so let me just give you a very rough
idea how how we can proceed suppose that
forget about the noise and you have this
sort of successive averag so how can you
prove that how can you estimate the
residual of your fix point which is a
measure the natural measure in this case
of you would stop the iteration as soon
as this is small how can you estimate
that uh well if you had contraction that
would be easy that's a
b typical trick this this decays like
aetric series but here is you just have
non
expansivity so what do we do we observe
that exen is an average of the previous
iteration and the previous image I can
backt track I can say xn minus one is
itself an average of the previous images
and so forth and so and finally you get
that xn is some sort of average along
your
PA iterations of the object and the PS
are prescribed by the alphas and it's
very explicit
now suppose that we can estimate the
distance from XM to XEL for the moment
suppose that I have an estimate
dmn then what happens with xn minus dxn
I
replace right the the value of TN in
terms of of this Pi then I do triangle
inequality
uh I use the expansivity of T I use my
bound so I get a Bound for xn minus 3 of
the question is so how can we find an
estimate of the distance between two
itats M and
N well we do the same and here's what
where optimization comes into play so
we'll do a matching you you see XM is an
average of the of some vectors XM is
also an average of some Vector let's
match those weights those
probabilities so I take
uh variable z n so I I assume that PM is
the blue
um the blue weights are the pi i m the
red is the red mass and I will try to
match those by doing a uh an optimal
transport problem so solving a linear
problem
because you see XM minus XM is just this
difference if I take
z uh with those good properties that
send not send the transport from PI M to
Pi n i can express everything in terms
of a single sum and
now let's do triangle inequality
here so if I do trality I use no
expansivity I get Z J the previous
iterates if I assume that I already
computed bounds for the previous iterate
I have a new Bound for m
andn so the idea is to do this
recursively to use the previous bound D
IUS one J minus one to constru the next
bound MN and then proceeding
uh
inductively so I Define my dmn as the
optimal transport problem for sending Pi
M to Pi n and what is the cost the
transportation cost will be the
distances that I estimated
previously so it seems a bit involve all
of this
because it it has one advantage is that
you see the operator T disappeared plays
no role this thing is just an iteration
of real
numbers
you take the d j minus is a double
induction on M and N but you can build
that and this has a lot of properties
this dmn turns out to be a distance it
turns out there is a greedy algorithm to
compute them you don't need to solve the
linear program I can give you a greedy
procedure to compute them and more
importantly you see we went through all
of these to find these bounds and these
bounds provide you an estimate finally
for the residual you get at the end
iteration and this dmn you can compute
it a priority they only depends on the
on the alphas that you
chose given your Alpha you compute your
D with the idea your computer B RN and
you're done you have your
estate independent of the operator T
independent of the space independent of
the dimension it it could even work in
infinite dimension
so the bounds you obtain here are very
robust they work for all operators and
you might say okay that's that's too
much it it means that we we
overestimated everything well no I can
construct you an operator which
satisfies each and every inequality as
an
equality where XM XM minus xn in Norm is
exactly equal to
dmn and uh your residual over here the
RN so the norm of xn minus t of xn is
equal to this
RN there is an operator which satisfies
each for every n m it satisfies
everything with
equality that doesn't mean if you give
me a specific
operator maybe you can do better but it
does mean that uh the technique in the
worst case is the best you can expect
this
Valance
uh maybe I missed an assumption or
something I I mean I've seen like the
contractive mapping stuff before but
this is all I've never seen this um
I like is there some assumption on like
the boundedness of the space so that way
the non-expensive mapping converges it
comes through I come through okay so
here is exactly this uh this issue for
the moment I'm assuming that my operator
from RN to r or Rd to
Rd um so here's the result which there
is an original results in Bon Brook and
then we completed that uh in 2013 so a
special case was treated in Bayon Brook
who stated this as a as a conjecture and
we proved it in 2013 so let me let's let
me Define Kappa equal two times the
diesel from x0 to 15 this already
requires that I'm assuming that t has a
fix Point okay so all my iterations will
leave in the ball Center at X zero and
with this G and and in fact the operator
T is confined to that ball alternatively
I could take Kappa if if your operator T
is operating from a convex bounded set
into itself I could take Kappa just the
diameter yeah of that set okay so that's
I think that yeah I mean I was something
like you know like X to X plus one is a
non-expensive mapping that is that
doesn't that will go to Infinity yeah no
this already ensures that you have a
fixed point and here I'm assuming the
priority and if it is bounded it should
have so take Kaa this then let's define
to n as the sum of Al Alpha K 1us Alpha
K and by assumption this is going to
infinity and theine sigma of Y this
function the minimum forget about the
minimum is one sare root of Pi
y
then by analyzing this recursive optimal
transport problem you can
prove explicitly more explicit bound you
can get that this Norm is less or equal
the the residual after n iterations is
less or equal than the Kappa the two
times the distance to the fixed Point
divid by square root of Pi
to
and you could think that P is a white P
appears here it's a
mysterious what I can tell you is that
you cannot improve is the best you can
expect for that bound if you give me
anything smaller than one/ sare root of
Pi I can build you an example that
violates this
equality this was any Norm or
ukan and the Pi still weird I mean you
can see the pi showing up but when you
it comes out from the analysis when the
analysis is a bit involved because you
reinterpret your Z the the optimal
transport well
first you take your optimal transport
you bound it by a suboptimal transport
very special one this suboptimal
transport you re interpret it as
a as a transition probability for a mark
of chain which has nothing to do with
the mar decision a mark of chain in the
is z in
square and when you Analyze This you
boil down to uh to the gamess
ru you go you ask the probabilities of
what estimate you can give for the Glam
ruin you end up with a function which
is Gus hyper geometric function you have
this bounds in terms of this special
function and that special function gives
you rise to one square root of Pi so
it's not about normal distribution
that's what one would expect no no no
it's not that it's a bit more involved
it comes from a probabilistic analysis
which was not obvious at the beginning
in I mean you have this recursive
optimal
transport you you take something which
is suboptimal that you reinterpret as a
mark of chain gives you gous ruin and
eventually gives you a very explicit
bounce and what's interesting that you
can only proove that but except you did
a lot of
majorization but the
majorization somehow are not uh over
stated this this bound is tight you
cannot improve this we we proved with
Mario
Bravo okay in particular you can have if
goes to Infinity then the residual goes
to zero of course one square root and
and it tells you what what the speed is
one/ square root of T you already got
but this of course was for for the
deterministic case there's no noise here
so what about if I introduce
noise so first you can say let's let's
introduce some error n but suppose that
I can control this error it's not a it's
not a stochastic noise but it's some
let's say Precision I compute the
operator T and I have an error that I
it's under my
control well it turns out that this
bound that I have here you
see this Bound in in this
equation it will still be there but
there should be some
extra uh account for the for the errors
and here's the ugly expression so you
have your Sigma to as before and here
you have the errors of course if the
error is zero if the e k and plus one
are zero is the same as before otherwise
you have here sort of convolution in in
the ter in between it's ugly don't I
don't expect you to to interpret or to
retain this but you have some some B
that you can control except that this
requires to be controlled for instance
if I take en such that the sum is
somehow controlled then I can prove that
the sequence still
converges
uh so okay here are some examples of
speeds of convergence for different
controls of the
errors
um my only Point here don't look at the
equations maybe if you're interested you
can look it later my point is this
bounce really can be used to and
translated into uh not only rates of
convergence but it can give you explicit
bounds this is less or equal than log to
n/ root of 2 N multiply by this
Con so you can really exploit this very
precisely
so okay so here is the basic tool a mar
of change with rewards but I will skip
that I'm running out of time so let me
skip
this and uh so let's go back to
stochastic am and eventually come back
to stochastic ma of decision process how
all of this applies well B iteration is
of this form now un is an error but
which is not controlled is a stochastic
term if I want to apply my previous bond
that we obtain previously well it would
require that the noise goes to zero
almost surely or something like this and
moreover that this sum of alpha K Norm
UK is finite but since the norm the sum
of alpha K is divergent This requires
that UK converges to zero fast enough
and unfortunately UK is not under my
control
is a stochastic if if I have a normal
noise well I cannot expect something
like this not even in
expectation uh it's very restrictive
this uh so the trick is to modify skm
and interpret an average
iteration um for minutes so let me just
skip there is some trick to replace un
by a sort of of average noise and then
you you you have much better chances
that this average noise comes to
zero and everything is controlled deps
of the
variance and
uh finally you can get convergence under
some control on the variances but maybe
it's more meaningful you get explicit
error
bounds uh as before in terms of
controlling the variances that's that
becomes a bit technical but it's the
same the same ideas that you just adapt
them over and over so if you go to Q
learning so here's
the I don't need to recall but this was
the setting your you have a operator age
which is non expansive blah blah blah uh
here's the iteration that we propos it's
exactly the form of the km and and by
using this uh methodology developed in
terms of optimal transport you get
this so take alpha
n you're averaging as a power then
almost
surely F of
qm remember the F was replacing the R
Bar well in fact it will convert to R
Bar also the trajectory Q will converge
to the optimal solution in nor Infinity
well every Norm is equivalent so doesn't
matter Q converts to this qar which is
the unique solution of f qar equal R
Bar and moreover there is an explicit
constant this you have to look at the
paper such that the fix Point error is
bounded by Kaa a/ square root of to
n this gives you the rates of
convergence uh
um so I think I can I can stop
there um if you're interested we can
later discuss how this applies also to
stochastic grad descent and how it
recovers uh even pretty recent results
from last year and but he recovers it
from a technique which doesn't have to
do with optimization or anything it's
just fix pointed
ratios
uh and for your question once you have
the qar or any good enough approximation
you derive the optimal policy by saying
for each state I look at what is the
optimal
action what remains to do is H what you
said what do I do when I I don't have
the possibility to
resample and I have only one uh One path
of the r
we are working on
that with with Mario and with zoto Bruno
French
colleague he's coming to
Santiago next week a couple weeks and we
hope to finish that so that's we expect
to have news within you know next months
so the end of this year maybe we can say
something we do not expect this already
slow so we do not expect that this
online will be faster for sure
not and uh but then there are many
variants but this is one uh possibility
there are many variants with modate
reduction so there's a there's still a
lot to to
do uh but the take-home message for me
would be that
um since we're talking about
all of this is basically concerned with
Computing independent of the iterations
that you do you're trying to compute a
fix point of a nonexpansive map and
nonexpansive is with respect to a bad
Norm the norm
Infinity
so
uh when you think of that there is this
technique based on optimal transport
that might be useful that's a take home
method keep in mind that you can
estimate this
by and the mathematic is very nice I
learn a lot of combinatorics and uh
special functions and
probability
uh that's it thank you very much thank
Robert oh thank you let's have a copy
[Music]
yes
Browse More Related Video
5.0 / 5 (0 votes)