How I Understand Diffusion Models
Summary
TLDRThis video script delves into diffusion models, pivotal for generating images from text, animating, and creating 3D models. It outlines the training process, emphasizing the Maximal Likelihood goal and Evidence Lower Bound (ELBO). The script explains the encoding and decoding mechanisms, highlighting the importance of the reparametrization trick. It also covers techniques for conditional generation, high-resolution image synthesis, and acceleration methods like DDIM and distillation. The script concludes with a call for questions, inviting viewer engagement.
Takeaways
- 🎨 Diffusion models are powerful generative models used for creating images, animations, videos, and 3D models from textual descriptions or by specifying conditions.
- 🧠 The core of diffusion models lies in training a denoising network that can iteratively refine noise into high-quality images, starting from a simple Gaussian distribution.
- 📈 Maximal Likelihood is the training goal, aiming to maximize the sample probability of the model, which is achieved by minimizing the Kullback–Leibler divergence between the model's output and the true data distribution.
- 🔍 The Evidence Lower Bound (ELBO) is a key concept in training, serving as a lower bound on the log-likelihood and is optimized by maximizing it during the training of diffusion models.
- 🌐 The encoding process in diffusion models is characterized by progressively adding noise over multiple steps, while the decoding process involves removing noise to generate the final output.
- 🔄 The training of diffusion models involves three main objectives: prior matching, reconstruction, and denoising matching, which ensure the model's output aligns with the Gaussian prior and reconstructs the original image accurately.
- 🔮 Three interpretations of the training process include predicting the noisy image, the noise itself, or the score (gradient of the log-likelihood), each providing different insights into the model's operation.
- 📸 For conditional generation, such as creating images of cats wearing sunglasses from a text prompt, diffusion models can be guided using either a classifier or a classifier-free approach.
- 🖼️ High-resolution image generation can be achieved through various methods like cascade models, latent diffusion models, or end-to-end models, each with its own approach to scaling up image resolution.
- ⏱️ To accelerate the sampling process and make diffusion models more practical, techniques such as DDIM, distillation, and consistency models are used to reduce the number of required denoising steps.
Q & A
What are diffusion models known for?
-Diffusion models are known for their ability to generate high-quality images from text descriptions, animate people, create videos, and produce 3D models.
What are the four main topics covered in the video regarding diffusion models?
-The four main topics covered are Training, Guidance, Resolution, and Speed.
How does a generative model translate noise samples into high-quality images?
-A generative model aims to find parameters of a decoder that can transform a simple Gaussian distribution into a complex natural image distribution.
What is the training goal in terms of data distribution?
-The training goal is to maximize the sample probability, known as Maximal Likelihood, which involves approximating the true data distribution by collecting many data samples.
How is the log-likelihood value simplified in the context of diffusion models?
-The log-likelihood value is simplified by applying a log function to turn the product into a sum, which helps rewrite it as an expectation.
What is the Evidence Lower Bound (ELBO) and why is it significant?
-ELBO is a lower bound of the log-likelihood value and is significant because it measures the statistical evidence for the model. It is used to train models like Variational Auto-Encoders (VAEs) and diffusion models by maximizing this bound.
How does the encoding process in diffusion models differ from traditional encoding?
-In diffusion models, the encoding process adds noise to the image progressively over multiple steps, rather than encoding the observation in one step like in traditional models.
What are the three terms in the objective function for training a diffusion model?
-The three terms are: 1) Prior matching, ensuring the latent distribution is similar to the Gaussian distribution at the end of diffusion steps. 2) Reconstruction, similar to the VAE's reconstruction term. 3) Denoising matching, focusing on the difference between the noisy and less noisy images.
How does the reparametrization trick help in modeling noisy images?
-The reparametrization trick allows a random variable to be expressed as a deterministic function of a noise variable, which helps in modeling noisy images by representing them as a function of a clean image and a noise variable.
What is the role of the denoising network during the training of a diffusion model?
-The denoising network models the probability of a less noisy image given a noisy image and guides the generation process by predicting the direction to maximize the log probability, which is crucial for both unconditional and conditional generation.
How can high-resolution images be generated using diffusion models?
-High-resolution images can be generated using methods like cascade models, Latent Diffusion Models (LDM), or end-to-end diffusion models, each involving different approaches to upscaling or encoding images before decoding to the desired resolution.
What are some methods to accelerate the sampling process in diffusion models?
-Acceleration methods include using Deterministic Divergence-free Interpolation (DDIM) for a deterministic generative process, distillation to reduce the number of sampling steps, and consistency models to ensure the model predicts the same origin for any point on the path.
Outlines
🧠 Understanding Diffusion Models
This paragraph introduces diffusion models, emphasizing their versatility in generating high-quality images, animations, videos, and 3D models from textual descriptions. The video aims to demystify how these models function by focusing on four key areas: training, guidance, resolution, and speed. It begins with a foundational concept, the Gaussian distribution, which is used to sample noise and transform it into complex natural image distributions. The training goal is to maximize the sample probability through Maximal Likelihood, which is simplified using logarithmic functions and expectations. The concept of Kullback–Leibler divergence is introduced to explain the minimization of similarity between the generative model's distribution and the true data distribution. The paragraph also discusses the challenges in computing log-likelihood and introduces the concept of an encoder to capture latent variable probabilities. The discussion concludes with the introduction of Evidence Lower Bound (ELBO), a key metric in training generative models like Variational Auto-Encoders (VAEs) and diffusion models.
🔍 Deep Dive into Denoising and Training
The second paragraph delves into the specifics of the encoding and decoding processes within diffusion models, highlighting the progressive addition and removal of noise. It discusses the training process, which involves maximizing the Evidence Lower Bound (ELBO), similar to VAEs. The encoding process is described as a series of transition probabilities, leading to a noise distribution after multiple steps. The paragraph then breaks down the training objective into three components: prior matching, reconstruction, and denoising matching. It explains how the denoising network is trained to predict the 'less noisy' image, using the reparametrization trick to express a noisy image as a function of a clean image and a noise variable. The discussion also covers the use of Bayes' rule and the concept of KL-divergence in training the denoising network, ultimately aiming to minimize the distance between the predicted and ground truth means of the distributions.
🖼️ Enhancing Image Generation with Guidance
Paragraph three explores the concept of conditional generation in diffusion models, focusing on how to guide the model to produce specific content. It introduces the idea of 'classifier guidance' where an additional classifier is used to guide the generation process. The paragraph also discusses the use of 'classifier-free guidance' which eliminates the need for a separate classifier by training a single conditional denoising network. The discussion then shifts to methods for generating high-resolution images, including cascade models, Latent Diffusion Models (LDM), and end-to-end approaches. Each method is briefly explained, highlighting their advantages and the trade-offs involved. The paragraph concludes with various techniques to accelerate the sampling process, making diffusion models more practical for real-world applications.
⏩ Accelerating Diffusion Models
The final paragraph focuses on methods to speed up diffusion models, which are typically slow due to the need for multiple evaluations of the denoising network. It introduces DDIM, a method that constructs non-Markovian diffusion processes to maintain quality with fewer steps. The paragraph also discusses distillation techniques, where a student model is trained to reproduce the output of a teacher model using fewer steps, effectively halving the number of sampling steps. Further acceleration is achieved through progressive distillation and latent diffusion models. The discussion includes the use of LoRA (Low-Rank Adaptation) to reduce the number of parameters in fine-tuning models, making the process more efficient. The paragraph concludes with a mention of consistency models and score distillation, which further enhance the speed and quality of diffusion models for tasks like text-to-image generation.
Mindmap
Keywords
💡Diffusion models
💡Generative model
💡Maximal Likelihood
💡Kullback–Leibler divergence
💡Encoder
💡Evidence Lower Bound (ELBO)
💡Variational Auto-Encoder (VAE)
💡Denoising
💡Reparametrization trick
💡Classifier guidance
💡High-resolution images
Highlights
Diffusion models are used for generating images from text, animating people, creating videos, and 3D models.
The generative model's goal is to transform simple Gaussian noise into high-quality images.
Maximizing sample probability, or Maximal Likelihood, is the training goal for approximating the true data distribution.
The log-likelihood is simplified using a log function, turning products into sums to rewrite as an expectation.
Maximizing likelihood is equivalent to minimizing the Kullback–Leibler divergence between the generative model's distribution and the data distribution.
An encoder is introduced to capture the latent variable probability given an observation.
The Evidence Lower Bound (ELBO) is used as a lower bound of the log-likelihood value for training.
Diffusion models encode images in multiple steps by progressively adding noise, unlike VAEs which encode in one step.
The encoding process in diffusion models is defined as a product of transition probabilities.
The training objective for diffusion models includes prior matching, reconstruction, and denoising matching.
The reparametrization trick is used to rewrite a random variable as a deterministic function of a noise variable.
The denoising network is trained to predict the mean of a less noisy image given a noisy image.
The training objective can be interpreted in three ways: predicting the noisy image, the clean image, or the noise.
Classifier guidance is used for conditional generation by adding an adversarial gradient of a classifier.
Classifier-free guidance is an alternative to classifier guidance, training a single conditional denoising network.
High-resolution image generation can be achieved through cascade models, latent diffusion models, or end-to-end methods.
DDIM allows for non-Markovian diffusion processes, enabling faster and simpler sampling.
Distillation techniques can be used to accelerate diffusion models, reducing the number of sampling steps.
LoRA is a method to reduce the number of parameters in fine-tuning the denoising network for style generation.
Score distillation is a recent method that trains a student denoising model to reproduce the output of a teacher model more quickly.
Transcripts
Diffusion models excel in many applications such as generating
beautiful images from texts, animating a person,
creating videos
and 3D models.
We are going to learn and understand how diffusion models work.
In this video, we'll cover four main topics: Training, Guidance,
Resolution, and Speed.
Let's start with a simple Gaussian distribution for easy sampling.
We want to design a generative model translating these noise samples
into high-quality images.
More specifically, we want to find the parameters of this decoder
that can transform a simple Gaussian distribution
into a complicated natural image distribution.
However, we do not know what the true data distribution is
and can only approximate it by collecting lots of data samples.
Our training goal is maximize the sample probability
known as Maximal Likelihood.
Let's see what this maximal likelihood is trying to do.
We first apply a log function to simplify the expression,
turning the product into a sum.
This help us rewrite it as an expectation.
Here is the definition.
If we subtract a constant
that has nothing to do with the parameter theta,
we find that this is just the Kullback–Leibler
divergence between the distribution from our generative model
and the data distribution.
So maximizing the likelihood means minimizing
the similarity between these two distributions.
Sounds great, but it's hard to compute this log-likelihood value.
Either it involves integrating out all latent variable z's
or assuming that we know the ground truth later encoder.
Let's see what we can do.
We first introduce
an encoder capturing a latent variable probability, given an observation.
This term is one.
since we integrate our all latent variables Z's.
Let's move the log likelihood inside the integral and express this
as an expectation.
Here we can apply the Bayes' rule and multiply a dummy term.
Next, we swap these posterior probability terms and separate them.
We now recognize the second term is a KL-divergence
between our encoder Q of Z given X and a ground truth encoder P of Z.
given X.
We don't know this value because we don't have the access
to the ground truth encoder P of Z given X.
But we do know the KL-divergence is non-negative.
This means that the first time is the lower
bound of the log likelihood value.
Since the log-likelihood measures the statistical evidence for our model.
This term is known as evidence lower bound or ELBO.
One in type or later variable models is Variational Auto-Encoder (VAE),
where they parameterize
both the encoder and the decoder as Neural Networks.
We can train both the encoder and decoder by maximizing the ELBO.
Diffusion models are also latent variable models.
But instead of encoding the observation
X in one step, it encodes the image in multiple
steps by progressively adding more and more noise.
Similarly, the decoding process progressively remove the noise
to generate a sample.
In VAE, we have observed variable X and later variable Z.
Similarly, in diffusion models, we call the clean image
X0 and the latent X1 to XT.
We can train the diffusion model in the same way as we train
a VAE by maximizing the evidence lower bound.
Okay, let's
first take a look at what the encoding process looks like.
We can write a encoding process as a product of transition probabilities.
We define a transition probability at each time, step as a distribution
where the mean is the image from the previous time step XT minus one
scale by a scalar that's less than one and some variance.
This encoding process ensures that the latent variables
become a noise after many time steps.
So with some derivation, our objective has three terms:
1) Prior matching, 2) reconstruction and 3) denoisy matching.
The first term says that the latent distribution will be similar
to the Gaussian distribution at the end of the diffusion steps.
This is automatically satisfied by our forward diffusion process.
The second term is similar to the reconstruction term
in the Variational Autoencoder and is simple to compute.
I want to focus on how we can maximize this denosing matching term
or minimize this one.
Here we see three probability distributions.
First, what's a probability of a noisy image at timestamp t?
Given a clean image x0?
All we know is the transition probability of x_t given x_{t-1} T
minus one.
To do this, we need to know the reparametrization trick.
This trick helps rewrite a random variable x
as a deterministic function of a noise variable epsilon.
Intuitively, we can represent a Gaussian distribution by scaling
the epsilon by the standard deviation and shifting the mean.
With this trick we can express x1, x2 and so on.
Plugging x1 into the second equation, lead to this expression.
Now we can simplify this because the sum of two
independent Gaussian variables is also a Gaussian.
Doing this recursively, we can write a noisy image
at timestamp t as a function of clean image x_0 and noise variable epsilon.
This means that we can directly sample from this Gaussian distribution.
The second term says the following: Suppose
we know the clean image x_0
and the noisy version of it after t forward diffusion steps,
what's the probability of a "less noisy" image?
x_{t-1}?
This tells us how to denoise a noisy image
when knowing the ground truth clean image x_0.
We use this to guide our denoising network that models
the probability of a less noisy image x_{t-1} given a noisy image x_t.
Here is an
actual photo of what's happening when training a diffusion model.
To derive this term, we apply the definition
of conditional probability and Bayes' rule.
Here we know exactly what is three probabilities are.
After some calculations, we find that it is also a Gaussian distribution.
The mean lies on the line between a noisy image and a clean image x_0.
We can also compute the variance in closed form.
Here, the probability from our denoising network is also a Gaussian.
Since both are Gaussian
distributions with the same variance,
minimizing the KL-divergence term is equivalent to minimizing
the distance between the means of the two distributions.
The process looks like this: We sample a clean image
x_0 from the dataset and a noise image from a Gaussian distribution
with zero mean and unit variance.
We encode the clean image with forward diffusion to get a noisy image x_t .
We then compare the L2 loss between the predicted and ground truth mean.
By looking at this training objective, we can have three interpretations.
First, from the ground truth mean, we see a linear combination of noisy
image x_t and clean image x_0.
But why do we ask the denoising network to predict noisy image
that we already know from the input?
Therefore, we express the estimated mean as a form of the ground truth one
and only ask the model to predict a clean image x_0.
Second, from the forward diffusion process,
we know the relationship between the noisy image,
the clean image, and the added noise.
We can express the clean image x_0 as a function
of a noisy image x_t and a noise epsilon.
When we plug this in and we arrive at this new form of ground truth mean.
Similarly, we can match the form of the
estimated mean with the ground truth one,
and only ask the the denoising network to predict the noise.
To discuss
the third interpolation, we need to use the Tweetie's formula.
The formula states that if we observe a sample
z from a Gaussian distribution, the posterior expectation of the mean
is the sample plus a correction term involving the gradient
of the log likelihood or the score of the estimate.
Let's apply the formula to our forward diffusion probability.
Replace the mean here and we now have this expression.
When we replace the clean image x_0, we arrive at this equation
involving the score.
Similarly, we can parameterize our mean estimate using the same form.
How can we compute this score?
From a simple derivation, it turns out it's just
the noise multiplied by a negative scalar.
Intuitively, since the noisy image x_t comes from adding a noise
to the clean image x_0, moving the opposite direction of the noise
to denoise an image naturally increases a log-probability.
Here's the formulation of score based models.
Let's visualize the process in a simple 2-D plot.
We take a clean image x_0 from our training dataset.
Select a timestamp T and scale the clean image.
We then sample a random noise and scale it.
Adding these two up gives the noisy image x_t.
We can train the diffusion model to remove
all the noise and directly predict a clean image.
However, this is challenging so that the noise image may
still be low quality.
So instead we only take a small step along this line.
Alternatively, we can ask the network to predict
what noise has been added to this image
and take an opposite direction for a small step.
Or we can ask the network to predict the score.
In all three cases, we arrive at exactly the same distribution
for the less noisy image x_{t-1} given a noisy image x_t.
We can sample
from this distribution and repeat the process.
This allows us to create a path from noise
to clean image.
After training our diffusion model,
we can use the denoising network to generate many samples.
This is called unconditional generation.
But perhaps we want to ask the model to generate specific contents
like cat images.
Or if we want to see more cats wearing sunglasses using a text prompt.
Let's revisit the score estimation interpretation of diffusion models.
At time step t,
we use our denoising network to predict the log-likelihood
gradient, which is the direction to maximize the log probability.
For conditional generation, we want to predict the conditional score.
By applying the Bayes' rule,
the conditional score consists of an unconditional score
and a adversarial gradient of a classifier.
We can score a adversarial gradient by a positive factor of gamma.
This is great because we can reuse the unconditional model
and use additional classifier to guide the generation.
We call this "classifier guidance".
But we'll have to train another classifier
because an off-the-shelf classifier is usually trained with clean images.
Luckily, we can use
the predicted noise to estimate a clean image.
This estimated clean image is a bit blurry,
but off-the-shelf classifiers are usually fairly robust to this.
This is nice, but do we really need to use an additional classifier?
By applying the Bayes' rules to the second term, we see that it
consists of an unconditional and a conditional score.
Plugging it back to our equation,
we get "classifier free guidance".
But training two denoising networks is expensive,
so we train a single conditional denoising network and use
null condition to represent the unconditional model.
Here is the comparisons between unguided
and guidance samples with classifier-free guidance.
How do we generate high resolution images?
There are probably three types of methods.
The first one is using cascade.
We first use an diffusion model to generate a low-resolution
image, say 64 by 64.
We then train a separate diffusion
model that upscale the low resolution image to a higher resolution.
The second type of approach is Latent Diffusion Models (LDM).
We first train an Variational Autoencoder
that encodes a high-resolution image into a low-resolution latent code.
We train both the encoder and decoder using the reconstruction loss,
some with a adversarial loss to ensure sharp results
and the regularization loss on the latent.
Now we can train our diffusion model efficiently in the latent space.
Once we get a clean latent,
we use the decoder to map it back to a high-resolution image.
In both cascade and latent diffusion models,
we need to train several models separately.
End-to-end methods aim to generate high-resolution images
with a single diffusion model.
Several promising ideas have been proposed,
such as adjusting the noise schedules for high-resolution image generation,
multiscale loss, and progressive training.
Division models are very slow
because we need to evaluate the denoising network
several hundreds or even 1000 times to get a good sample.
Here are some methods that accelerate the sampling
to make diffusion models more practical.
Let's first review the training objective of DDPM.
We train our denoising network to predict the noise.
If we look at this training objective and don't care about
what the waiting term is, we just need to ensure that: 1)
the forward diffusion model remains the same.
2) The mean of the ground truth denoising step is a linear
combination of the noisy image x_t and the added noise epsilon.
The meaning of the estimated mean is of the same form.
These three constraints do NOT assume the transition
probability to be a Markovian process.
In this DDIM paper, they construct a class of non-Markovian
diffusion processes and find a and b that satisfy these constraints.
This gives us these two Gaussian distributions.
Interestingly,
the sigma_t can be set to arbitrary values.
By setting them to zero, we get a deterministic generative process.
The only randomness comes from the initial sample noise.
Here is the quantitative evaluation.
With 1000 denoting steps that the DDPM did perform better.
But when we reduce the number of denoising steps, the quality
of the DDPM quickly degrades, while the results from DDIM remain decent.
The best thing is that we don't need to retrain the model.
We just need to take the model train with the DDPM objective
and accelerate it with DDIM sampler.
But even with the DDIM simpler, it still requires quite a few steps.
Let's further reduce the number of steps using distillation.
We can use a pre-trained model as a teacher and teach a student
denoising network to use one sampling step to reproduce the output
of a teacher network using two sampling steps.
So after this distillation process, we halve the number of sampling steps.
We can ask the student model to be a new teacher model
and repeat the process until we reach the target sampling steps.
Another
idea is to distill classifier-free guided diffusion
that requires evaluating both conditional
and unconditional models into one single model.
We can further apply previous ideas to make it faster,
such as progressive distillation and latent diffusion models.
We can distill a pre-trained
invoicing network using consistency models.
The main idea is to train a model so that for any points
on the path, the model predicts the same origin.
This supports single steps generation
as well as multi-step generation.
Applying consistency distillation in the latent space gives us
high-resolution image generation using only a few steps.
We can also extend the idea of latent consistency model with LoRA.
But what's LoRA?
Given a pretrained model,
sometimes we want to generate contents of particular style,
such as pixel art, LEGO, IKEA instructions, and anime.
We can achieve these
styles by fine-tuning the base model with additional data.
However, this is computationally expensive and require high storage.
In many cases, it turns out we only need to fine
tuned across attentional layers in the denoising network.
We freeze the pre-trained weight W_0 and optimize the residual parameters,
but there are still a lot of parameters to update.
We can reduce the number of parameters
using low rank approximation on the weight matrix.
Therefore, we can accelerate the consistency distillation with LoRA.
Here the acceleration vector is the parameter difference
between the distilled and the base model.
More interestingly, we can accelerate other fine-tune models
by linearly combining the acceleration and the style vectors.
Here are some examples.
Another recent distillation method train the student denoising model
using score disillation.
However,
these predictions are usually blurring.
Their main idea is to apply
an adversarial loss by training a discriminator.
Here are some results.
After this distillation, these models can achieve text-to-image
generation at interactive rate.
To sum up, we covered the training objective
for the diffusion models and their three interpretations.
How we can guide the generation with classifier
and classifier-free guidance.
How we can synthesize high-resolution images
using cascade, latent, and end-to-end diffusion models,
and how we can speed up the sampling using DDIM
simpler and various distillation techniques.
Please comment below if you have any questions.
Thanks for learning with me.
I will see you next time.
浏览更多相关视频
Why Does Diffusion Work Better than Auto-Regression?
Text to Image generation using Stable Diffusion || HuggingFace Tutorial Diffusers Library
The Ultimate Guide to A1111 Stable Diffusion Techniques
Explained simply: How does AI create art?
📣 Anteprima in Italia: Ideogram 2.0 è una bomba [Tutorial]
How Generative Text to Video Diffusion Models work in 12 minutes!
5.0 / 5 (0 votes)