What are Diffusion Models?
Summary
TLDR扩散模型是一种新兴的生成模型,通过模拟数据从清晰图像逐渐加入高斯噪声至纯噪声的过程,然后逆向去除噪声以恢复原始图像。这种模型在图像生成和条件设置方面表现出色,甚至在某些任务中超越了生成对抗网络(GANs)。视频解释了扩散模型的基本机制,包括正向扩散过程、逆向去噪过程以及如何通过优化变分下界来训练模型。此外,还探讨了如何将扩散模型应用于条件生成和图像修复等任务。
Takeaways
- 🌟 扩散模型(Diffusion Models)是生成模型领域的新兴方法,尤其在图像生成方面取得了显著进展。
- 🚀 扩散模型通过逐步添加高斯噪声将图像转换为纯噪声样本,然后学习逆向过程以恢复原始图像。
- 🔄 正向扩散过程模拟数据逐渐变为噪声,而逆向过程则尝试从噪声中恢复数据。
- 📈 扩散模型使用马尔可夫链(Markov Chain)来描述正向噪声添加过程,每一步的分布仅依赖于前一步的样本。
- 🎯 逆向过程中,模型通过学习去除噪声的步骤,逐步将样本引导回数据流形,生成合理的样本。
- 🔧 在训练过程中,扩散模型的目标是最大化一个下界(变分下界),而不是直接最大化模型对数据的密度分配。
- 🔄 扩散模型可以条件性地生成样本,例如根据类别标签或文本描述来生成图像。
- 🖼️ 扩散模型在图像修复(inpainting)任务中也表现出色,通过专门训练的模型来填补图像中的缺失部分。
- 📊 扩散模型与其他生成模型(如生成对抗网络GANs)相比,在某些任务上表现出更好的性能。
- 🛠️ 扩散模型的训练依赖于对正向过程中每一步的噪声水平的准确学习。
- 🔗 扩散模型与概率流(Probability Flow ODE)有紧密联系,后者通过数值积分近似对数似然。
Q & A
什么是扩散模型?
-扩散模型是一种生成模型,它通过模拟数据从清晰样本逐渐加入噪声直到变成纯噪声的过程,然后再逆向操作,逐步去除噪声以恢复原始样本,从而学习数据的分布。
扩散模型在图像生成领域取得了哪些成就?
-扩散模型在图像生成领域取得了显著的成功,它们在某些任务上甚至超过了生成对抗网络(GANs)等其他类型的生成模型。例如,最近的扩散模型在感知质量指标上超越了GANs,并且在将文本描述转换为图像、绘画和图像操作等条件设置中表现出色。
扩散模型的正向过程是如何定义的?
-扩散模型的正向过程是通过一个马尔可夫链来定义的,其中每一步的分布仅依赖于前一步的样本。这个过程逐渐向图像中加入噪声,最终形成一个纯噪声分布。
为什么扩散模型使用小步长的正向过程?
-使用小步长的正向过程意味着学习逆向过程不会太困难。小步长减少了每一步的不确定性,使得模型能够更准确地推断出前一步的状态。
扩散模型的逆向过程是如何工作的?
-扩散模型的逆向过程是一个学习过程,它通过一个参数化的反向马尔可夫链来逐步去除噪声,目标是恢复到数据的分布。这个过程通过最大化一个变分下界来训练,而不是直接最大化似然函数。
扩散模型的训练目标是什么?
-扩散模型的训练目标是最大化变分下界(也称为证据下界),这是一个关于边际对数似然的下界。这个目标包括一个重建项和一个KL散度项,分别鼓励模型最大化数据的期望密度和使近似后验分布与潜在变量的先验分布相似。
扩散模型如何实现条件生成?
-扩散模型可以通过将条件变量(如类别标签或句子描述)作为额外输入在训练期间进行条件生成。在推理时,模型可以使用这些条件信息来生成特定于条件的样本。
扩散模型在图像修复(inpainting)任务中的表现如何?
-扩散模型在图像修复任务中取得了成功。通过专门针对此任务进行微调的模型,可以在给定完整上下文的情况下更好地填充图像中缺失的部分,从而避免边缘伪影。
扩散模型与变分自编码器(VAEs)有什么相似之处?
-扩散模型与VAEs相似之处在于它们都可以被视为潜在变量生成模型,并且它们都使用变分下界作为训练目标。然而,扩散模型的正向过程通常是固定的,而逆向过程是学习的重点。
扩散模型的连续时间形式可以产生什么?
-扩散模型的连续时间形式可以产生所谓的概率流(probability flow ODE),这允许通过数值积分来近似对数似然,从而提供了一种不同于变分下界的密度估计方法。
扩散模型与分数匹配模型有什么关系?
-扩散模型与分数匹配模型之间存在紧密的联系。分数匹配模型中的分数实际上等于扩散模型中预测的噪声,直到一个缩放因子。因此,我们可以将扩散模型中的去噪过程视为近似地跟随数据对数密度梯度的过程。
Outlines
🌟 扩散模型的基本概念
本段介绍了扩散模型的基本原理,即通过向图像中添加高斯噪声并重复该过程,最终得到一个纯噪声样本。然后,扩散模型的目标是从这个纯噪声图像开始,逐步去除噪声,恢复出原始的清晰图像。这种方法在图像生成领域取得了成功,并且在某些任务上超越了其他生成模型,如生成对抗网络(GANs)。视频将探讨扩散模型的基本机制及其在不同生成设置中的适应性。
📈 扩散模型的正向与逆向过程
这一部分详细解释了扩散模型中的正向和逆向过程。正向过程通过逐步添加噪声将样本推向数据流形之外,而逆向过程则训练模型回到数据流形,生成合理的样本。介绍了如何通过最大化一个下界来优化模型,这个下界是变分下界,它包括重构项和KL散度项。此外,还讨论了如何将扩散模型视为潜在变量生成模型,并解释了训练目标的设定。
🛠️ 扩散模型的训练与实现
本段讨论了扩散模型的训练细节,包括如何实现逆向步骤、如何处理噪声变量以及如何通过重参数化来提高样本质量。作者选择了将逆向步骤的方差设置为时间特定的常数,并提出了一种简单的变分下界版本,该版本在训练中关注更具挑战性的噪声步骤。此外,还探讨了如何使扩散模型有条件地采样,例如使用类别标签或文本描述,以及如何进行图像修复(inpainting)等条件生成问题。
🚀 扩散模型的进展与前景
最后一段总结了扩散模型的进展和未来方向。扩散模型在生成任务中的速度较慢,但通过持续的研究,正在努力提高采样速度。扩散模型还与概率流(probability flow ODE)和分数匹配模型有紧密联系,这些模型通过数值积分近似对数似然。扩散模型的发展势头强劲,其在图像生成和密度估计基准测试中的性能令人兴奋。
Mindmap
Keywords
💡扩散模型
💡高斯噪声
💡生成对抗网络(GANs)
💡条件生成
💡马尔可夫链
💡变分自编码器(VAEs)
💡变分下界
💡KL散度
💡重采样
💡自编码器
💡修复
💡概率流
Highlights
扩散模型(diffusion models)是生成建模领域中的一种新兴方法,尤其在图像生成方面取得了显著成功。
扩散模型在某些任务上已经超越了生成对抗网络(GANs)等其他类型的生成模型。
扩散模型的基本思想是模拟一个逐渐添加高斯噪声的过程,然后学习如何逆转这个过程,从噪声图像恢复出清晰的图像。
正向扩散过程被设计成一个马尔可夫链,其中每一步的分布仅依赖于前一步的样本。
扩散模型的反向过程被训练来产生回到数据流形的轨迹,从而生成合理的样本。
扩散模型的训练目标不是直接最大化似然目标,而是最大化一个下界,即变分下界或证据下界。
扩散模型可以视为一种潜在变量生成模型,其中正向过程类似于编码器从数据生成潜在变量,反向过程类似于解码器从潜在变量重构数据。
扩散模型的训练只需要训练一个网络,与需要联合训练两个网络的VAE不同。
扩散模型的反向步骤被参数化为高斯分布,这有助于减少训练过程中的方差。
扩散模型可以实现条件采样,例如根据类别标签或文本描述生成图像。
扩散模型可以通过特殊训练来指导采样,而不依赖于第二个网络。
扩散模型在图像修复(inpainting)任务中取得了成功,通过微调特定于该任务的模型来获得更好的结果。
扩散模型与其他退化模型相比,受限于慢速的马尔可夫链,这与GANs的单次前向传递形成对比。
扩散模型的连续时间形式可以产生概率流(probability flow ODE),通过数值积分近似对数似然。
去噪扩散模型与分数匹配模型之间存在紧密联系,后者通过梯度引导的马尔可夫链生成样本。
扩散模型的发展势头强劲,其在多个领域的应用前景令人兴奋。
Transcripts
imagine we take an image and add a bit
of gaussian noise to it
then do this again
if we repeat this enough times
eventually we'll have an unrecognizable
picture of static a sample of pure noise
now what if we could figure out how to
undo this process
that is start from a noise image
gradually remove the noise and end up
with a coherent image
this is the basic idea behind diffusion
models an approach gaining traction and
generative modeling it had success
particularly in the domain of image
generation and they are starting to
rival and in some cases surpass other
kinds of generative models you may be
familiar with on certain tasks
for example recent diffusion models have
outperformed generative adversarial
networks known as gans in perceptual
quality metrics and they've also shown
impressive performance in various
conditional settings such as converting
text descriptions to images
in painting
and manipulation
in this video we'll try to understand
the basic mechanism behind diffusion
models and how they can be adapted to
different generative settings
we'll start with a sample from some
target data distribution like an image
from a training set
let's call this x0
now let's define a forward diffusion
process that gradually adds noise to the
image over big t time steps
our model will be tasked with starting
at x big t and undoing this noise
through what we'll call the reverse
process
the forward process which we'll denote
with q takes the form of a markov chain
where the distribution at a particular
time step only depends on the sample
from the immediately previous step
so we can write out the distribution of
corrupted samples conditioned on the
initial data point x0 as the product of
successive single step conditionals
in the case of continuous data each
transition is parameterized as a
diagonal gaussian
beta t here is the variance at a
particular time step t
typically these variances are treated as
hyperparameters and follow a fixed
schedule for a particular training run
beta generally increases with time and
is restricted to be between zero and one
meaning that this coefficient radical
one minus beta t
will likewise be non-zero but less than
one bringing the mean of each new
gaussian closer to zero
in the limit as t approaches infinity
q will approach a gaussian centered at
zero with identity covariance losing all
information about the original sample
in practice the total number of steps
big t is on the order of a thousand
using a large albeit finite number of
steps allows us to set the individual
variances beta t to be very small while
still approximately maintaining the same
limiting distribution
but why do we want to use a small step
size what's the benefit
well it means that learning to undo the
steps of the forward process won't be
too difficult
let's consider a simple case in one
dimension
suppose we were given the distribution
of a forward process sample at time t
minus one and it resembled a mixture of
gaussians with two modes
we then observe x t
and want to infer the posterior
distribution over x t minus one
that is we'd like to determine where did
the chain likely come from in order to
arrive at x t
what was the previous step of the chain
if the noise step that is q of x t given
x t minus 1 is allowed to be large then
we will be quite uncertain about the
location of x t minus 1. who knows where
we jumped from
but if the forward noise step is
restricted to be small there is much
less ambiguity about x t minus 1.
we could then be justified in modeling
the posterior of the forward step that
is q of x t minus 1 given x t
with a unimodal gaussian
eliminating the contribution from the
mode to the right
and in fact it can be shown
theoretically that in the limit of
infinitesimal step sizes the true
reverse process will have the same
functional form as the forward process
so diffusion models leverage this
observation parameterizing each learned
reverse step to also be a unimodal
diagonal gaussian
aside from the sample at time t the
model also takes t as input in order to
account for the forward process variance
schedule different time steps are
associated with different noise levels
and the model can learn to undo these
individually
like the forward process the reverse
process is set up as a markov chain
and we can write out the joint
probability of a sequence of samples as
a product of conditionals and the
marginal probability of x big t
so what is p of x big t here exactly
well it's the same as q of x big t the
pure noise distribution
so at inference time in order to
actually generate a sample we start from
a gaussian and begin sampling from the
learned individual steps of the reverse
process p of x t minus 1 given x t until
producing an x0
okay great so we've defined these
forward and reverse diffusion processes
the forward process is designed to
essentially push a sample off the data
manifold turning it into noise and the
reverse process is trained to produce a
trajectory back to the data manifold
resulting in a reasonable sample
but what objective will we actually be
optimizing is it some maximum likelihood
objective where we directly maximize the
density assigned to x0 by the model
well not exactly
if we try to calculate p of x0 we see
that we have to marginalize over all the
possible trajectories all the ways we
could have arrived at x0 when starting
from a noise sample
this unfortunately is intractable
but it turns out we can maximize a lower
bound
to do this let's view x1 through x big t
as latent variables and x0 as an
observed variable
allowing us to interpret a diffusion
model as a kind of latent variable
generative model
if we think back to another latent
variable model you may be familiar with
variational autoencoders commonly known
as vaes
we might get a hint about our training
objective
as a quick reminder in a vae we have an
encoder that produces a distribution
over latency given a data input x
and a decoder that reconstructs the data
by producing a distribution over data x
given a latent input z
so we can think of the forward process
in diffusion models as analogous to the
encoder producing latency from data
and the reverse process as analogous to
the decoder producing data from latents
now unlike a vae encoder the forward
process here is typically fixed
it's the reverse process that we focus
solely on learning
this means that only a single network
needs to be trained unlike in a vae
where two networks are trained jointly
so we can now borrow the basic training
objective used by vaes and a number of
other latent variable models
when we have a model with observations x
and latent variable z
we can derive what's called the
variational lower bound also known as
the evidence lower bound
a lower bound on the marginal log
likelihood p theta of x
we won't walk through the full
derivation here but the end result is a
likelihood term also known as a
reconstruction term subtracted by a kl
divergence term
the likelihood term encourages the model
to maximize the expected density
assigned to the data
while the kl divergence encourages the
approximate posterior q z given x to be
similar to the prior on the latent
variable p of z
as we saw earlier x0 will serve as the
observation in the diffusion model
framework while x1 through big t will
take the place of the latent variable z
here
let's substitute these in
alright now let's simplify a bit
we can expand the kl divergence to
combine the two terms into a single
expectation
and finally we can refactor the chain
probabilities into their individual
steps
now there's a nice property of the
forward process queue that we didn't
touch on earlier any arbitrary step of
the forward process can be sampled
directly in closed form
this is just because the sum of
independent gaussian steps is still a
gaussian
so at training time any term of this
objective can be obtained without having
to simulate an entire chain
likewise we can optimize this objective
by randomly sampling pairs of x t minus
one and x t
and maximizing the conditional density
assigned by the reverse step to x t
minus one
however because different trajectories
may visit different samples at time t
minus one on the way to hitting xt
the setup can have high variance
limiting training efficiency
to help with this we can rearrange the
objective as follows
let's examine each component
p of x big t is fixed it's just the
start of the reverse process the pure
noise distribution
and as we saw earlier the whole forward
process q is also treated as fixed
so we just have to worry about these two
terms to the right
here we have a sum of kl divergences
each between a reverse step and a
forward process posterior conditioned on
x0
one can prove with bayes rule that when
we treat the original sample x0 as known
like it is during training these q terms
are actually just gaussians
since the reverse step is already
parameterized as a gaussian each kl
divergence now is simply comparing two
gaussians and can be evaluated in closed
form
this helps reduce variance in the
training process because instead of
aiming to reconstruct monte carlo
samples
the targets for the reverse step become
the true posteriors of the forward
process given x0
there are a couple different ways we
could imagine implementing the reverse
step p theta in the paper denoising
diffusion probabilistic models
ddpm for short
the authors elect to set the reverse
process variances to time specific
constants as they found learning them to
lead to unstable training and lower
quality samples
so the reverse step network is solely
tasked with learning the means
they then suggest a reparameterization
that aims to have the network predict
the noise that was added rather than the
gaussian mean
first we can rewrite sampling from an
arbitrary forward step by using an
auxiliary noise variable epsilon
epsilon here has a constant distribution
independent of the forward time step t
and the reverse step model can be
designed to simply predict this epsilon
the authors also found that a simpler
version of the variational bound that
discards the term weights that appear in
the original bound led to better sample
quality
so compared to the original variational
lower bound their objective down weight
steps that have very small noise at
early time steps of the forward process
allowing training to focus on more
challenging greater noise steps
like other generative frameworks
diffusion models can be made to sample
conditionally given some variable of
interest like a class label or a
sentence description
one way to do this is to just feed the
conditioning variable y as an additional
input during training
in theory the model should learn to use
y as a helpful hint about what it should
be reconstructing in practice some work
has shown that further guiding the
diffusion process with a separate
classifier can help
in this setup we take a trained
classifier and push the reverse
diffusion process in the direction of
the gradient of the target label
probability with respect to the current
noise image
and we can do this not just with single
word labels but also with higher
dimensional text descriptions as well
of course one drawback of this technique
is the reliance upon a second network
an alternative approach eliminates this
reliance instead using special training
of the diffusion model itself to guide
the sampling
in the paper classifier free diffusion
guidance the conditioning label y is set
to a null label with some probability
during training
then at inference time the reconstructed
samples are artificially pushed further
towards the y conditional direction and
away from the null label
even though no new information is being
given to the model they found this to
produce higher quality samples under
human evaluation compared to classifier
guidance
inpainting is another conditional
generation problem where diffusion
models have had success
the naive way to perform in-painting
with diffusion models is to take a model
trained in the standard way and at
inference time replace known regions of
an image with a sample from the forward
process after each reverse step
now this works okay but can lead to edge
artifacts
the model is not being made aware of the
full surrounding context only a hazy
version of it
instead better results come from
fine-tuning a model specifically for
this task
we can randomly remove sections of
training images and have the model
attempt to fill them in conditioned on
the full clear context
we can compare diffusion models to some
other prominent degenerative models
for sampling tasks the fusion models are
somewhat limited by the slow markov
chain
this contrasts for example with gans
which can generate images in a single
forward pass
ongoing work aims to speed up sampling
in diffusion models
as we saw earlier diffusion models allow
us to calculate a variational lower
bound on the log likelihood similar to
vaes
in practice this lower bound can be
quite good and even competitive on
density estimation benchmarks which have
long been dominated by auto aggressive
models
going beyond lower bounds a continuous
time formulation of diffusion models can
give rise to what's called a probability
flow ode
this enables approximating log
likelihood via numerical integration
there's a close connection between
denoising diffusion models and what are
called score matching models and often
these are now grouped together into a
single class of models score here refers
to the gradient of the log of the target
probability density with respect to the
data
a score network is trained to estimate
this value
then a markov chain is set up to
actually produce samples from the learn
distribution
guided by this gradient
well it turns out the score can actually
be shown to be equivalent to the noise
that's predicted in the denoising
diffusion objective up to a scaling
factor
so we can think of undoing the noise in
a diffusion model approximately as
trying to follow the gradient of the
data log density
diffusion models are really gaining
momentum and it's been exciting to see
their progress
check out the links in the description
to learn more
thanks for watching
Browse More Related Video
How AI Image Generators Work (Stable Diffusion / Dall-E) - Computerphile
【生成式AI導論 2024】第18講:有關影像的生成式AI (下) — 快速導讀經典影像生成方法 (VAE, Flow, Diffusion, GAN) 以及與生成的影片互動
Text-to-GRAPH w/ LGGM: Generative Graph Models
【生成式AI導論 2024】第17講:有關影像的生成式AI (上) — AI 如何產生圖片和影片 (Sora 背後可能用的原理)
【生成式AI導論 2024】第4講:訓練不了人工智慧?你可以訓練你自己 (中) — 拆解問題與使用工具
Stable Cascade released Within 24 Hours! A New Better And Faster Diffusion Model!
5.0 / 5 (0 votes)