Giulio Biroli - Generative AI and Diffusion Models: a Statistical Physics Analysis
Summary
TLDRこのスクリプトは、最近の扩散モデルに関する研究を紹介し、生成的AIの進歩とその物理的な意味を探求しています。扩散モデルは、画像やテキスト生成の最先端に位置しており、興味深いトピックです。スクリプトでは、扩散モデルの仕組み、特にその逆プロセスとその中的での物理的な性質について詳細に説明し、データの次元数やデータの量がどのようにモデルの性能に影響を与えるかについても議論します。また、扩散モデルの理論的な背景と、その精度に関する数学的な結果も触れています。最後に、研究者が取り組んでいるいくつかの重要な課題についても触れています。
Takeaways
- 🌟 拡散モデルは非常に興味深く、特に物理学からの多くの知見が適用されています。
- 🔍 拡散モデルは生成的AIの分野で重要な突破を果たしており、画像やテキスト生成の分野で結果を出すことが示されています。
- 📈 拡散モデルは時間の経過とともに画像を白ノイズに変化させ、逆に白ノイズを画像に戻すことで新しい画像を生成します。
- 🔄 拡散プロセスは前向プロセスと逆のプロセスであり、その時間スケールが重要です。
- 🎯 拡散モデルの学習には、スコア関数の近似に関する重要な役割があります。
- 📚 理論的には、十分に近似されたスコア関数を持つ拡散モデルは、データの分布を適切に近似できます。
- 📈 データの次元とデータの量のバランスが重要であり、異なる時間スケールでの拡散モデルのパフォーマンスが異なります。
- 🧠 データの分布が特定のデータセットに関連する場合、拡散モデルはデータの重みを正しく捉えることが難しい可能性があります。
- 🔧 拡散モデルの学習には、データ次元とデータの量の関係、および近似クラスの選択などの重要な要素があります。
- 🌐 拡散モデルは高次元のデータにも適用され、その次元とデータの量の関係が複雑な問題を解く上で重要な役割を果たします。
- 🚀 拡散モデルの研究は進化し続けており、より複雑な問題や実際の応用にも取り組むようになっています。
Q & A
ディフュージョンモデルとは何ですか?
-ディフュージョンモデルは、画像やテキストの生成に使用される方法であり、特に最近の進歩の中で突破的な成果を生み出しているとされています。
ディフュージョンモデルが注目される理由は何ですか?
-ディフュージョンモデルは、高品質な画像やテキストを生成することができ、その能力が多くの分野で価値があると認識されているため注目されています。
ディフュージョンモデルがどのように動作するか説明してください。
-ディフュージョンモデルは、最初にノイズを含む画像を生成し、徐々にそのノイズを減少させることで、最終的にクリーンな画像を生成するプロセスを経て動作します。
ディフュージョンモデルの数学的な美点は何ですか?
-ディフュージョンモデルの数学的な美点は、そのシンプルさにあります。モデルを理解し、実装することが比較的容易で、効果的な結果を生成することができるためです。
ディフュージョンモデルを学習させるためにはどのような手順が必要ですか?
-ディフュージョンモデルを学習させるためには、時間逆転のプロセスを理解し、スコア関数を学習する必要があります。深度学習技術を用いて、データからスコア関数を近似することが一般的です。
ディフュージョンモデルが生成する画像の品質はどのように評価されるのですか?
-ディフュージョンモデルが生成する画像の品質は、視覚的に評価されることが一般的です。また、画像の鮮明さ、解像度、リアルな表現力などの観点から評価することもできます。
ディフュージョンモデルを使用する際の課題は何ですか?
-ディフュージョンモデルを使用する際の課題としては、データの次元性、データの量、そしてモデルの適切な訓練などが挙げられます。特に、高次元でのデータ処理には計算コストが高くなり、適切な近似技術が必要です。
ディフュージョンモデルの応用分野にはどのようなものがありますか?
-ディフュージョンモデルは、画像生成、テキスト生成、音声合成、データの復元や增强など、多くの分野で応用されています。また、アート作品の作成や新しい資料の発見など、創造的な分野にも活用されています。
ディフュージョンモデルの研究が進むことで、どのような影響が期待されますか?
-ディフュージョンモデルの研究が進むことで、より高品質な画像やテキストの生成が期待されます。また、データの欠如やノイズの多いデータを扱う能力が向上し、医療分野や自然災害の迅速な対応など、社会的な課題に対する貢献が期待されます。
ディフュージョンモデルと他の生成モデルとの比較において、どのような特徴がありますか?
-ディフュージョンモデルは、生成モデルの一種であり、他のモデルと比較して、特に画像や音声の高品質な生成能力が特徴です。また、データの前処理や後処理が比較的少ないため、シンプルな構造を持ち、計算コストを抑えることができます。
ディフュージョンモデルを実装する際に注意すべき点は何ですか?
-ディフュージョンモデルを実装する際には、適切なデータセットの選択、適切なネットワーク構造の設計、そして適切な学習率やパラメータの調整など特别注意が必要です。また、過学習や欠学習を避けるために、適切な正則化技術を用いることも重要です。
Outlines
🤖 導入と拡散モデルの紹介
この段落では、講師が拡散モデルとその興味深い性質について説明し、特に物理学からの視点から見て多くの可能性があると強調しています。拡散モデルは最近の生成AIの突破的な進歩であり、画像とテキスト生成の分野で最高の結果を出していると述べています。講師は、拡散モデルがどのように動作するかを解説し、特にそのシンプルさと美しさに焦点を当てています。
🧠 拡散モデルの数学と学習プロセス
この段落では、拡散モデルの数学的背景と学習プロセスについて詳しく説明されています。講師は、物理の観点から時逆概念を導入し、拡散モデルが白ノイズから画像を復元する方法を説明しています。また、生成されたデータの確率分布をどのように近似するか、そしてその分布を最小化するパラメータを見つけるために、深層学習技術を使用していることも触れています。
📈 理論的な結果と次元の役割
この段落では、拡散モデルに関する理論的な結果とデータ次元の役割について述べられています。講師は、最近の研究に基づいて、モデルの近似がデータ分布にどの程度近づくかを示す結果を紹介しています。また、データの次元数が大きいほど、モデルの正確さが向上することが示唆されています。しかし、次元数が非常に大きい場合のデータの需要量との違いについても議論しており、重要な問題が開かれています。
🌐 次元とデータの数との競争
この段落では、データ次元の数とその競争について詳しく説明されています。講師は、高次元データの場合、拡散モデルがどのように動作するかを示しています。また、データの数と次元の関係を通じて、どの程度のデータ数が必要なか、そしてどの時点でモデルが正確に動作するかについても議論されています。この分析は、拡散モデルの理解と改善に役立つ重要な視点を提供しています。
🧬 データ分布と対称性の破れ
この段落では、データ分布と対称性の破れについて詳細に説明されています。講師は、シンプルなモデルであるIsingモデルを使用して、データの異なる構成に対する対称性の破れを示しています。また、小さく異なる2つの状態を持つシステムを扱い、拡散モデルがこれらの状態をどのように復元するかを探求しています。この分析は、拡散モデルの生成プロセスとその制限に関する深い理解を提供しています。
🕒 時間と次元の競争の結論
この最後の段落では、時間とデータ次元の競争についての結論が示されています。講師は、2つの異なる時間の範囲について議論し、それぞれが生成プロセスと対称性の破れに関連していることを説明しています。また、次元の数とデータ数との競争を通じて、拡散モデルの正確さとその制限についても触れています。最後に、講師は、この分野での今後の研究方向と実用的な応用についても触れています。
Mindmap
Keywords
💡diffusion models
💡generative AI
💡score function
💡stochastic dynamics
💡machine learning
💡high-dimensional data
💡equilibrium thermodynamics
💡time reversal
💡approximation
💡deep nets
💡theoretical results
Highlights
Diffusion models are a breakthrough in generative AI, particularly in image and text generation.
Diffusion models transform images into white noise and learn to reverse the process, generating new images from noise.
The beauty of diffusion models lies in their simplicity, which is also a key factor in their success.
The role of the score function in diffusion models is crucial as it allows the model to learn how to go backward in time.
Machine learning techniques are used to approximate the score function, enabling the generation of new data.
The study explores the role of dimension and the number of data in diffusion models, particularly in high dimensions.
Theoretical results on diffusion models are scarce, indicating a vast area for potential research and discovery.
The concept of effective temperature is introduced, relating to the noise added to the system during the diffusion process.
The importance of the early stages of the backward process in capturing the correct probability distribution is discussed.
The potential role of symmetry breaking in diffusion models and its impact on the generation process is highlighted.
The study finds that the number of data and dimension have a competitive relationship in the performance of diffusion models.
The role of approximation classes and network architecture in the performance of diffusion models is questioned.
The potential applications of diffusion models in areas such as image completion and copyright issues are mentioned.
The importance of stopping the diffusion process at the right time to avoid collapsing back to the original data is discussed.
The study concludes that diffusion models can be considered good depending on what is being observed or asked of the model.
Future research directions include understanding the role of the exact score and the competition between the number of data and dimension.
Transcripts
all right okay so um so today I would
like to tell you about the very recent
work that I've done in collaboration
with mzar on diffusion models I find
diffusion models very very very very
interesting I think there is there is a
lot of lot to do especially from physics
so what I want to do today is to give
you an introduction I will also go read
in the detail of the math because it's
very simple this is also the beauty of
it and then I will tell you what
we what should I click to
change all right okay
so generative I think generative AI is
one of really of the Breakthrough that
has have been done in in the the last
years and clearly there are iive result
in image and text generation and here I
just while I show you I mean diff Fusion
models
are the models the method that are used
in the State ofthe art for imagination
and now this I mean it's before they
used guns and but starting from 2020
actually diffusion models are really the
one that are used so here I just give
you an example of images generated by
diffusion models and we should what
these Fusion models are and then well
I'm sure that maybe many of you have
played with d e or it's equivalent from
Google so here is well in which you do a
diffusion model are used in the
application for text to image and here
create a three three picture just
yesterday for you uh using using that
just to show you what what they can do
all right so what I want to do so next
is really explain to you what diffusion
models are and so diffusion models are
something very simple so let just let me
introduce a little bit of notation so
you start with a set of images let's
call it a new so it's just a vector in
dimension n mu runs from one to P so P
will be the number of data and what the
model do uh what the model do is the
following so you start a l equation at
time it equal zero while the L equation
is initialized to the value which just
correspond to the image and then you run
a very simple
on L equation so one of the most simple
L equation and I'm sorry I know that
there mathematician here and well this
is DX I know that is not defined but
inform there is no problem as we do in
physics so you run this equation and
what you do when you run this equation
is that you start from this initial
condition and this equation at long time
where converge to Independent
identically distributed gaan with mean
zero in variance one on each for each
component of the vector
well visually what you do is start from
a nice image then you run this equation
and you transform this equation in White
Noise you also know exactly what is the
probability L of X at time T well if you
know what is the probability of Thea
let's call a well it's very simple this
is just just integrate this equation and
what you find is that you just conol
this equation with the G so you have you
know what the probability P of e at a
given time is and what you see what this
G is it's noising putting noise on the
image now what this diffusion model
learn is they learn they learn how to go
backward in time so you see you
transform images thetion images in White
Noise well if you learn how to go back
so to transform white noise in an image
then if you want to generate a new image
it's easy to just give a white noise you
just draw a white noise you let it go
backwards and you you get a new image so
it looks miraculous but this is really
what this models are doing now well to
show you exactly a little bit how they
do
this so the idea is the following so how
to go backward in time uh so while in
physics we learn about time reversal so
this time reversal is a little bit
different from what we do in physics but
is a generic property of stochastic
equation is that well if you know what
is PT of X and we know what PT of X is
well if you take the log respect to X
this is what is called score function
and now if you use this score function
and you define this Lan equation which
is just the backw in time L equation
this L equation what it does is really
you integrate it you just go backward
and it transform quite nice in the
original distribution is0 a and this is
an exact result now the problem is that
in general you don't know the score
function but if you knew the score
function that this is allow you actually
to go back in in in time and transform
white noise in the data and this well so
the idea then is that then you can use
actually machine buring techniques to
try to learn this score function and if
you learn it well enough then you
generate new data and at least for me
which that I work a long time on
stochastic Dynamics L equation I if you
think about this it just means that you
start from white noise and you learn
what are the forces that work on this
kind of particle and that transform a
white noise
they make it move this forces and then
this white noise become I know face I
mean the face of man or the picture of
dog now this idea interesting actually
this idea of go back over time and using
model then in 2015 in a paper by
physicist pH the boundary with with Mach
learning was still physicist and if you
see actually the title of the paper was
different thermodynamics and you will
see we discuss also uh some idea about
of equilibrium thermodynamics at the end
so while this was done in 2015 and
somehow remain silent or until 202 20
and then 2020 really picked up the lot I
think because people learn actually how
to learn the score so in this paper I
mean they proposed the idea but then
starting from 2012 they they learn how
toar they understood how to learn this
score from data and and then it started
to be used everywhere so well for a nice
review you can see this this review by
Yan which is I 2022 2023 so let me tell
you now how you learn
SC so the idea is the following so while
you have to learn this function so these
are the forces that you have to use
while this function is a function in
high dimension well if you have data in
principle the idea is that you should
well it's a regression program in which
you will param
with typically a deep net function and
then this function should be as well as
as much as possible equal to to this to
this function f and so let for the
moment let Define population loss so the
idea is again for moment there are no
data there is the exact PT of X the
probability distribution at time is the
noise version of the data and then what
you have to do is that you construct a
parameterization of this function term
of parameter Theta and you want to
minimize this function and typically
they use some kinds of deep net which
has this kind of this kind of form now
you can work a little bit about this
around this population loss and what you
do is you just develop the square and so
you take you have one term which is s
Theta Square then you have the cross
product and then you have F square but F
square has no parameter in it so it's
just a you don't care so if you look at
these terms well this term here is the
gr of L divided by the P so P me P so
this p and this p goes away so here what
you get is DX s of X and gra of P so
what is gr of P well if you take PT of X
which is here the gr with respect to X
will act on this so you just make an xus
A expon minus C go back go down okay so
at the end you end up with this kind of
loss this loss which is the expectation
over all the stochastic process so X and
A of s x and then xus a
expon and now this is very easy actually
to make it a problem of empirical
minimization because now once you have
the data so for each for each data for
each image you run the Lan Dynamic so
you have one trajectory you multiple
trajectory but let's let's keep it
simple so for each image you round one
trajectory so you have a and so what you
have to do now is well for each data and
each trajectory well you have just to
minimize this empirical loss so find the
parameter the of P then minimize it and
if you find the parameter then you find
an approximation of of the score and
then while you do it for a fixed number
of times so you discretize the lement
Dynamics in fixed in with small steps
and then once you have it then you can
run your backward your vement Dynamics
and you can create in principle very
simple of course there is theorization
here and then there is also the fact
about really how these things work so
what I want to do now uh before telling
you our result is to tell you what is
known from the theoretical point of and
actually what is interesting is very
very little this time so actually many
physicist that work on on this m
learning problem start introduction say
that it's a very po problem but there is
very
known especially in deep Nets well there
is little more but I mean there is a
long tradition math and computer science
on this and actually on this case there
is actually very little know meaning
that even ma computer there are not many
work and so the state of the art let's
say theoretical math results have been
obtained young researcher the and also
another one is called who Ox for and
this these are the kinds of
result they have found so very formally
what they tell you is that but if you
find an approximation such that s of X
is close enough to the score to the true
score if the distribution of data is
regular enough then you can find posi
constant that the probability
distribution the probity distribution
that you get model is close enough to
the distribution of the data and close
enough so this is to variation distance
for the ones we know otherwise just a
distance and you see that this is less
than different Factor so big T is the
time at which you run the L Dynamics
because well you cannot run it at
infinite time so you know that you start
from the data you run the Dynamics at
certain point you stop and then you go
backward in time so this tell you that
you have to go long enough time but this
is this is more and then actually you
see that when you run enough in time you
may have a problem because so you can
actually keep this if you take an
approximation which is good enough and
if you take a time in the is so could
good that this is a very
bad but actually if you do additional
additional hypothesis on the data
distribution then this becomes a power
so it's it's not as bad as okay so this
is the kind of result that they have and
well if you look at this result there is
clearly one or at least or maybe two
elephant in the room so the first thing
is we all know that one big problem what
makes the problem difficult is that the
data are very high dimension which is
what mat discussed and then there is the
course of dimensionality and then you
should discuss how many parameters you
need and here well the dimension is not
is not is not here actually hi you all
this Con so while there are really many
important questions are open
and can uh can help to make progress and
so well let me tell you just just there
is no selection well the first more I
think the most important one is what is
the RO of the dimensional the data in
all this business and then if you try to
understand this clearly this is also
related to with how many data uh sorry
it's more the RO of the dimension of the
data and then how much data you need
actually to have a good diffusion model
I will also try to tell you what does it
mean I can say what is a good diffusion
model it's not a question it's a good
diffusion diffusion model will produce
an approximation of the probability
distribution this are probability
distribution high dimension and while
what is a good approximation of the
which in my Dimension is not it's not a
tri question and then I think there is
another question that I would not
address at all is what is the role of
the approximation class so why they use
what they call a unit they use some kind
of
deep net to
uh to approximate the the score and of
course one should should ask I mean what
is the role of the number of parameter
that are used what is the for why the
foror is important which are exactly the
kind of questions that M addressed but I
mean we want should address also in this
context all right so what I want to tell
you in the following is what we did
which is the first attemp first study of
the role of Dimension and also the
number of data in uh in the case of
diffusion model so while since there is
nothing on the RO of Dimension
well which we can start with the simp
the simplest model so what is the simp
model
data and Okay g dat of course is simple
but here we take High dimensional so
this would be the first example that
stud so high dimension what you do Di
the data again the same notation P the
number of the data and now we consider
the limit in which the dimension is very
large and the number of dat is very
large and the vector of the data Su just
a g Vector which has a certain Zer and
certain c0 and now c0 is an N byn Matrix
and infinity so we should say something
about properties and the property will
be that the density of value of C will
converge some function to some well
defined
all right then let's do let's apply the
idea of the model so again I remind you
you start you take a certain number of
data then you run
your equation now in this case well
since p z so PT of X is the distribution
of X at T and well what you have is that
here the convolution of the initial
G with a G so thetion of G one is a so
it means that P of X is
a so which means that the score exact
score this is the L of a so L of is
quadratic function X you take the
derivative so you get the score the
exact score is lar next which means that
has this form so the exact score has
this form and you can comput it exactly
and the this Matrix W after T is just
given by this expression so you can see
that when Z this goes away you get just
c one the yes just understand how do you
define the dimension of the data not of
this the real
data the dimens I mean just just the
it's just for an image I just Define the
number of pixel and then maybe there is
the
color just most definition possible
and so when T is equal Z you see W is
just the inverse
ofan data and when go Infinity this goes
away and this goes away just get the
identity the temperat so the system is
very simple the score is very simple at
long time all right so this is the
analysis in the exact case but of course
not we don't want to understand exactly
we want to understand when we have
certain number of data and we we have a
certain dimension of
data so what you should do and please
well ask question if there is something
which is not because I think I have less
than to keep in minutes so
yeah when you learn in practice when
they learn T do they learn different
networks with completely different for
every time no it's the same
same
so the uh yes the is like a variable in
the network yeah yeah actually they
don't even do it they don't even do this
time this time this time they just throw
at random the different
times yes
exactly yes well sometimes I mean again
maybe this again you have time so here
you see here I Define it at a given time
what they do they Define
is over Theory maybe with some waying so
they they they wait more the initial
time than the time and then they just
minimize in one
row okay so in this case C data let's
work out what what they do in practice
so this we know in this case lucky we
know what is the expression of the exact
score this you don't know images in this
case we know so we can EXA we can say
okay the score I know that it's linear
so what I should do I should use the
linear score and I should try to get
from the data what is the Matrix W that
I have to reuse so how would you do it
well you take this you put you plug it
in the empirical loss and then you use
exactly the form of the I told you well
you see so this is lar in X you just
develop you differentiate with to W and
at the end in this case you get an exact
expression for the W the empirical W
that you get from the data and the
empirical W what you can express it in
term of matrices which are nothing else
that matrices that tells you coari
empirical coari of the process so for
example this Matrix D well this you st
over so for each data you have a
trajectory and so what you're doing here
you are summing over the T trajectory x
i m XJ M so this is nothing as that the
empirical
ciance
at of the
process and then M similar is a memory
memory Matrix so if you get D you get M
while you put here and this is your
empirical estimation of w now the
question is well this is what you get
empirically this is actually a lucky
case because you know even what is the
exact score so there is no problem of
approximation plus now the question is
if you are in high dimension how much
data I need actually to get that this W
is
equal to this W which is the exact
one now now you can see that you can
start to so the question that you can
ask what is the RO of the dimension how
many data one needs to get a good
diffusion model and you can start to see
how the dimension can play a role and
the reason is that imagine that you're
in fixed Dimension if you're in fixed
Dimension and this Matrix C if you want
to estimate it in a good way when you
just take very large and then since this
for different piece of different
trajectory are just independent R of
variable so you just use the central
imerial this will converge it will
converge to the correct Co variance and
if you converge to the correct variance
then say for and then you get get back
the true Matrix now the problem is is
the dimension is large this is a very
large Dimension is a very large Matrix
and you know that well for very large
Matrix one has to be careful so if you
look at this this is nothing else than a
wish Matrix for people that like
matrices and in case of wish matrices is
known that if the dimension of the
Matrix is of the same order of P of
number of data
which then I mean this empirical is
quite different from from the true M so
you can you can start to feel the
dimension will play a role and there
will be a competition between Dimension
and the number of so let's see so at the
end in this
it's nothing else that the Rand the
problem to understand the competition
between Dimension and number of
parameters
yes are
not and yes
yeah yeah yeah are com from the
same not for what going to start in
general the process in yet but for what
I'm going to T so the different reges
now it's not important but you're right
I want to stud to study the backward the
backward diffusion in principle yes
isation all right so now what is this
mat Theory problem and I want just to
give you the key key idea that allow you
to understand what's going on so let's
let's focus again on this Matrix and so
the let me tell you exactly what I told
for if I consider just the element
I then when P
isar well I know that at a certain point
this will convert to the good value CT
which is the
truear and then if p is not INF there
will be a flation what is the order of
this flation just
one now what I can do I can rescale
actually this one root of p i rescale it
this way we say I we write it by square
root of in the denominator the square
root of n in the numerator and so it
means that this I WR
as which
is I hope it's clear I just put this one
square root of n on top and so there is
a square root of n on the bottom now the
interesting thing is that if I write
this way now the Matrix DT is C The
Matrix that I'm interested in
that I would like to to have as an
estimation and then I have an error and
this error is root ofid by P typ Matrix
R now the Matrix R has the correct
scaling in large dimension in such a way
that the density of
Val so if you for example if you think
about
the matrices to getc
of one you must have value as the
elements
areal so if I write it this way now well
this is a problem that has been studied
a lot in recent years with for simple
matrices so you take a deterministic mat
C of T and this R is the mat for
the which is not the case
Cas then this is called by matx model
and there been a lot of works this from
physics and then from math that tell you
that well if p is much larger than n
the density of values of B is exactly
the same of theity of Val of when goity
but the vectors are completely different
if p is much larger than n s so if you
have much much more data than Dimension
so Dimension Square then really you have
a convergence between e and C meaning
that vales of c are the same value of D
are to sub terms and Vector are also
oriented exactly the good way okay so
this is what happens for the de Matrix
Plus goe in our case if you see what C I
mean it's we are in a different case
because this is a more wish Matrix and
this is a bit more complicated but
that's the idea so what we have that the
matx that we want to estimate is
corrupted by random Matrix times
something which is root of n/ p and we
can use the same kind of arent that we
used for the model so cutting on story
short so what you get in this case is
that you find now in the generative
diffusion prion high diens you can find
three reges so what the diffusion model
do it does is that it's going to
generate gion data this is clear because
it's start from G data then the Lan
equation gives you g trajectory then you
go back again with the G trory and so it
will be G and so will be mean zero and
the
coari and the question is how these
coari resemble the true coari of that
you have AAL Z and so what you find is
that if the number of data is much less
than Dimension then I the model is
clearly wrong this Co matx has nothing
to do with two Matrix just diffusion
model just
prodense now there is the second regime
which is the number of dat is much
larger than Dimension but is less than a
dimension Square
then you get the same of values of this
Matrix that's zero but different vectors
so what this means is that if you look
at the generative process if you look at
quantities like this x x/ n so this
quantity which is nothing else that the
trace is related to the trace of C which
is related to the integral of the
spectral of the lens values by Lambda so
this would be correctly reproduced so
the C the global strength of flation
will be correctly reproduced but if you
want to know what is the direction which
have the flation so Vector then it could
be wrong and also the interesting thing
which is more for mathematician that was
suggested to us
from anous refere is that the so you can
look at the distance between High
probity
distribution and these distance have
been discussed a lot in recent years
especially in connection with optimal
trans
and what it is distance in this case you
find in this distance between this High
dimensional and high
dimension process vanish where from Z
here now let's look at the third regime
the third regime which number of par is
very large is larger than the dimension
Square in this case really the
generative process is good Vector values
are good so everything the direction is
correct and in this case what you get is
that not only
the zero but also the total variation
distance which is a very restri it's
very requiring distancing Andor
distribution in high dimension Al thises
so I think what is interesting here is
that first it's a very simple model but
you start to see first how the dimension
play a role and how this is in
competition with the number of data and
you can also see that actually when you
say Genera models is good well it
depends what looking for so if you're
looking to this kind of variable then
it's good but if you're looking really
to the direction of equation so this reg
is yes yeah and the time doesn't play
any I stop before forgetting complet the
initial condition that yes so in this
case we didn't look at the time but this
is just what I'm going to do in the next
SL but you're right here in this we just
went time or very Lou
we but that's an important isue
but
yes even if vectors wrong still if you
are looking at any direction looking at
the vant direction you already
have to take a direction yes Direction
random yes yeah yes so do you have a
good VAR in all
Direction I mean yeah if you take a
random
Direction it's more less AAL to all all
vect so It's Tricky I mean you pick the
direction if you pick the direction and
you take it up random then you will be
good I agree but if you want to know
what are the important
direction
that even you
have I good I
can I don't think so I mean it's one VOR
is really
completely de compos of all the other
ones it's just because well yeah I mean
they're very sensible to peration so
they
complet all right so let's now so this
is simple case I mentioned so but we we
wanted to go a little bit beyond this
want to add a distribution of data which
is and in order to do
this what kind of probability
distribution Dimension can
have are the ones that come from from
the one you stud phys especially at
have a transition so model you can have
many things in mind so Ian it's I models
I mean G liquids so for the say any
standard physics distribution in the
when the system PL it's the protion high
dimension and we want to be in the low
case in which the Symmetry is broken so
what this means is that you while we
give uh to to the diffusion model we
give different configuration some
that we are considering thetic system
will be positive magnetized will be
negative magnetized we transort
everything in white noise and then the
system should be able actually well to
generate some configuration which are
positive and some configuration which
are negative and of course in our head
there was also the idea that the
different B in circumstance maybe could
to them different C if I have a dog and
a cat and then transform white and then
the have to be able
in
C and so just to give you I me this is a
very simple one dimensional example in
which while you have
two here you apply the diffusion
transform just one and the system has to
be able to go back and break the syet so
this the diffusion model and what we ask
is how the Sy is doing this but now in
very high
dimension okay
okay
so again we the tradition physics what
we can do is in first thing we can look
at the simplest model ever and the
simplest model is of f transition is the
CIS model so which we take
iing a which are plus or minus one and
we are full I everyone interact with
everyone else interaction one and if is
enough there
are one postive St with magnetization
and St and now the thing that we do we
want to play with we put a magnetic
field which is very small so here one
because what we want to is that we want
to that the weight of the two states the
plus and minus is not and as we have
usual case but it's going to be I don't
know one3 and 23 because what we want to
see is that what we also have in mind is
Imagine That I train the system when I
train a diffusion model when I we train
there are some I don't know some more
images of one class than more images of
the other class is the actually
diffusion model is a is aable actually
to get this kind
of information right so the way the
statistical we of the different classes
because somehow I mean our was that
maybe you get it right and you always
see cut and dos but maybe actually the
uh the weights of which you get the cut
and do maybe it's wrong or maybe it's
more difficult to get and we to see that
this be the case so and then the other
thing that we want to ask is so this is
why we take this h n and is the
dimension always and so the other thing
we want to know is in this case exactly
Sol let's see how actually the infusion
model right this case is going to
generate the bre okay so this
oops it's simple enough to be completely
solved so in this case we don't play the
same game than before in the sense that
we don't work for the moment with a
fixed number of data so we just look at
what is the exact score so how actually
diffusion model that works perfectly how
works so in this case we can compute the
score exactly so I don't give you the
expression of the score but I just uh so
I just tell you what that Zoom what
happens at the beginning of the backward
process you start with white noise and
you start to run backward the diffusion
process what kind of uh L equation we
have and how it's able actually to
reconstruct the images so well first at
the beginning of the backward process th
over I of XI while
X by so this
is so if you divide
of you get a variable which is than one
and on this variable you can actually
write is the evolution of this varable
here is simple enough also
[Music]
newy and here you can compute exactly
the V the potential and the potential is
well there is clearly Ex two different
kind depending whether root
of so root of Dimension time exponential
minus so
is very large at the beginning of the
back process and it's very small at the
end of the back process because
we so what you see is that at the very
beginning of the backward process here
what you get is that b is an inverted
potential like this now since there is
an H is not around zero but it's
actually an inverted
Parabola Center somehow to the right in
this case so what happens now is that
when you start the process so me will be
somewhere here then because of it will
start to diffuse but what if it's let
say from this side Ty go down increase
in this direction on this side increase
this direction and since the parabola is
slightly shif to the right in this case
because H is positive so we will have
more trajectory that we go to the left
TR go to the right so this is how the
system start start to have different
weights and now when the of expon is
much larger than one this is the
potential that you get first thing that
you can see is that H is not there is
not there anymore so there is no more
any information in the backward
diffusion process about the Ws and here
what happens is that here this is the m
the Symmetry there are some trajectory
that are going down this way St are
going down this way the barrier here has
become very very
large so it's never going to go back so
this IM has been broken and really the
parameter at which you have the Symmetry
breaking is this is
[Music]
exponential so here we see really how
the system bre and we can also show that
with this kind of equation this
potential have exact score so system
reconstruct exactly the weights but what
is important you see that reconstruct
good way the weights you have to be very
I the model has to work very well and
the Very beginning of the so this is
when everything is deed in term of
breaking and the weights of the
different glasses in the generating
process this is what happen model can
can one be a more General than this and
the answer is yes so actually you can do
this you can get this result in any kind
of statis physics model so you don't
need
to the particle models it's and the ni
thing is and the way to do it is the
follow so here is the way which I write
the backward process is just inage of
equation here is the score now the very
interesting thing about the score and
this is also why I think all this is
related to how thermodynamics and also
all this idea which came from Jo is
that if you look at what it is a there a
people part which do ex and then what
here is this is derivative of the free
energy with respect to external mag
field so this is the free energy
of you know the distribution at the
beginning then you the energy and you
the energy as function of the external
mag if you know this energy then you
know the score and what you have to do
you take the derivative of free energy
and you evaluate with an external
magnetic field which is just equal to X
expon the of course that don't
know then you know what is the equation
the that all to
go now why this is helpful is because
now if we study a case in which we have
symmetry breaking and we have weights
which are slightly different between one
phase and the other and we are at the
very beginning of the back process so
the external magnetic field is very
small you see
expon is very large so we can actually
write the exponential minus v i can th
over over different states that you have
and then while the the external magnetic
field is very small it's like we
breaking the symmetry so you have th
over I of c and this is the
magnetization
alation of the
of and this is something is completely
always do it
it which is the case at the beginning of
the process Now using this expression
you in here and you have a general
expression of the scope at the very
beginning of the back process and again
you find something which is a
generalization that I told you before
this is going to be a some of Al of the
weights want to reconstruct exponential
of new Alx new Al X is the projection of
the vector X over the direction of the
magnetization that you have in state
Alpha and then you have this factor
which is
exp so just cutting the long story short
is similar to what we had before so you
will see that the weight matter and you
really construct the way that when root
of n exponential minus is more one and
when this Factor becomes very large what
then symmetry is broken and the system
has committed then you have lost any
information about the Ws so in this
example if you think about this this a
very L Dimension example which just one
dimension but what it means is that at
the very beginning the system will have
this pH transition the PA so the system
will commit some will go this way
another will go this way and once this
region which
correspond is then the system is Comm
you never go back you go on this way on
this way so this clearly tells us that
as was asking is it's very important
when you stop the L equation in the
forward process if you stop it too uh
not not far enough you are in trouble if
you stop it here well probably it's
going to be very B for the generation
process and what is the time in which
this depends on the dimension of the
data so well the question is what about
images that all this work for images so
for for images we don't know what is
probability distribution of of a but one
can do
experiment and so actually there's been
done some experiments by the
BNS so what what he did
is I took 10 and if they consider two
classes of 10 in principle each class is
represented with one
with classes that you can construct an
artificial data set in which in one
class is is there I don't know a little
bit more than the other class and then
you try to see whether the diffus model
is able to first to get back correctly
the images and second to get correctly
the way
they and the answer is well he he really
saw exactly what without doing any
analys in term of Dimension but he
clearly saw that the system is able
actually to reconstruct Imaging which
are not bad but to get the weight
correctly you have to go to long time
and to be extremely precise in
estimation of
score very long time then there is
another numerical evidence which come
from a paper which is very
recent which is June June
2023 uh so here what they do you cannot
see but not important the what they do
is that they do exactly what I tell you
so they run the L process time and then
they go back and this is a value which
is a um so it's is low if actually the
system is working well and it's high is
not working well something a matric that
is used to to see how much the probity
distribution of the generator is s to
the distribution of the in and here what
you have that maybe you cannot see on
the xaxis is the time at which they stop
so here is a very long time here is a
very short time and here is also what as
you see what they say is that there is a
certain time at which
clearly you go below this time then this
system is going to be very bad you stop
it too early and there is another go
beyond this time that
is and here
is an example so here is when you stop
is the quality of image when you stop
the quality of image when you
stop indirect evidence it's not a direct
evidence but it's an evidence that a
specific time scale at which things
really change and then what they did
also because they were also triggered by
the idea of symmetry breaking is that
they compute so here you have two
classes and they compute actually the
potential which in the direction which
is a linear interpolation between one
image and the other so they are not it's
the potential that you have in PR in the
backward process in principle is a
function of many many variables but you
can always look at this function just in
One Direction and so what they what they
claim here is that inde this time scale
that they find here is the time scale
over which uh this potential go from
let's say two Wells to one one well so
in this case remember what I showed you
before was the potential was
inverted
here yes only
for very
time if you go to very L time it's
no
yes
one this one
yeah this one
yeah right
yeah is going down yes so do we know
something that is going down if you go
to no
yes so I don't know I don't know this
related some kind of over so reality I
don't know don't
know so I know that there are problems
related to the let so this in this Cas
when go it
that okay no I
think probably this is this is when you
stop at very early time because it's the
inverted time so here is time here
is here the go down it's probably just
nrow I'm
sorry they they claim that actually this
will be the the potential I show you
before but it's inted and also their
case just one Dimensions it's much more
complicated things but they claim to
observe this kind of symmetry breaking
that I told you in reality I'm not sure
that it I it's what you can do if you
numerically but I'm not sure that it's
exactly at the same time that we have
here but it's it's some evidence that
they
have all right so now I can conclude so
what I wanted to do today is to present
the first study of high dimensional
diffusion model and the two message May
is one is that clearly we see that there
is a two different time regimes is the
time regime related to I think imry
breaking and probably IM the formation
of different classes and here it's
important if you don't to get the
weights right so getting the weight of
the different classes would be for
example very important for all
discussion thess in machine learning and
then actually once the system is
committed has committed and then then
just each one of class or each phase if
you have SC physics model so we were
able to study the simp case which is
dimensional G the competition between
number of data
and and dimension and they also discuss
what does it mean that the diffusion
model is good well we have SE that well
it depends what you ask are some some
some regim in which diffusion model can
be considered to be good if you look at
some observables but then if you want to
be extremely then you have have a very
large number of of data now there are I
say well I just do three perspective
which are three things which we we are
working on but I think there are many
many more uh so the first thing is in
all my f I always knew what was the
exact score and I the C case I then use
empirical anization to get an
approximation of this exact score but
then in images you don't know what is
the exact score so there is clearly
discussion about what happens if you use
well an approximation for this exact
score which maybe it's not good enough
and why actually the approximation that
have been used this kind of AR which I
show you kind of deep net which goes
down goes up again why good so there is
a discussion to do and things to think
about what are the good approximation
classes what are the good architecture
and why they are good given the data
that we have so what what mat discussed
before so then the I think there is in
this context so it would be nice
actually to discuss the competition
between the number of data and dimension
so in cases which are more more
difficult than just the G and the
easiest would be to
repeat simple distribution
like so in the case of the C Cas I just
exact score but of course I can play the
same game so I can take an empirical
sorry I can take empiric anization that
F the parameter and give how much data I
need that large dimension in order that
Theus model works well the one can go a
little bit beyond to model which are
better for
statis and then actually thenk can go
towards realistic application so for
problems actually but there is realistic
application for example it's in painting
which is the case in which you just know
a part of the image and then you want
that your diffusion model actually
create an image whichit well just is is
more P something it can be discussed and
very nice
relationship point to set inal physics
and then there is the copyright problem
copy problem that discuss so here
clearly there is for application you to
people from De mind what they really
don't want is that your diffusion model
recreate an image has been used for
because then huge problem on copyri and
it's true that if you if you think about
if you the only things that you know is
the of data but it could be that the
original distribution is just Del
function exactly localized on data so if
your model was perfect should be able to
rec would reconstruct that would
reconstruct the data from
which but it doesn't do well actually in
some case it does so understanding when
it's going to do it or when it's not
going to do it I think it's a very
interesting
very problem on which people are really
very very interested
right can you interpret the P of x that
you get as the effective effective model
with
temp so it's like adding temp model and
so the next is when you say that the
time is not enough to achieve the
per because you are still below thetical
temper so you then go up to the par phas
so you can recover the whole
distribution
correctly um no I I don't think it can
be actually S as temp because in a sense
only when you are at times Liv which
scale like Logan the system has always
the broken pH so what yeah we are
putting noise on top of it but you're
not putting the noises I mean you would
do it C The Temper I think it's a
different kind of noise I think there
might be
actually an interpretation with cost
raining more more than temperat so if
you I think it's if you add noise but
which depends on scale then adding the
noise would be more like a cause rain so
in this case you're just looking I
[Music]
mean but I would think it's more like a
cing than CH in any case the critical
time that you need in order to make does
it corresponds to when for example you
have simple two
G distribution
G time
get so can we relate the optimal time
time the
[Music]
distribtion but
then
Rel
and
because what you do so what you do here
is that you learn the diffusion model on
images and then what you want to do is
that you provide you have want to have
me say okay give you just a small piece
of image you should create generate New
Image which which which mat well with
this small piece but you don't want to
reconstruct the image which has this
piece inide so again you don't want to
reproduce the so you just want toce the
new kind new you want new Genera model
if you want but that produce images that
much well with this with this
part so you told us about the time SC
diens TR you may also have a Time SC to
do with when the data
point diffusion is enough br
yes think
it's yeah so I think it's so the time in
which you have a blurred so I think this
this time scale it's more related
to the copyright problem that there is
the time at which you realize that your
distribution is just makes an average
over over this and small time so they
really
different thanks for besides the
mathematics can you give us some
intuition on why the first stages of the
backward for process are so crucial to
guess the right probability distribution
because when you measure the score I get
the feeling that if you measure for long
enough but the end of the process which
is the beginning of backwards is just
asymptotic GA going ga how how does it
contain so much information that when
you go backwards you miss that you are I
think it's the reason is that if you
think in term of so the data so the data
has so thinking about G process which is
so they have some
direction which you have very strong
cation and this direction so if you have
noise these are the ones that will
disappear last so I think it also what
you see when you see the images you
start to see the form of the image at
the first thing and then there are all
the details that AR right so I think
it's this one thing which is a very big
picture that has space has a very strong
strong component of the sign and this is
the one that disappear lot so it's the
one the backward will first so this is
also something to do with the fact that
you mentioned I think at the beginning
that when you learn the score with the
net you put more weight on the
last or first no actually no actually
what they do they don't think about this
so so what they do is they put they they
put okay know they play with the weight
at the beginning and they not at the at
the end and they play with the weight at
the beginning and the reason is that
they don't want to have this copy
problem they don't want collapse back to
to the original
data and instead they don't do anything
but in principle one might be interested
in guing very well end
process yes but then but then for
example they don't ask so I think there
was no paper that asked the question
typically see the system is good because
well the image are good and they don't
think for example to get the weights
right having the idea that get the
weights right then then you want to put
a lot of precision at the
beginning
Browse More Related Video
5.0 / 5 (0 votes)