How I Understand Diffusion Models

Jia-Bin Huang
8 Jan 202417:38

Summary

TLDRThis video script delves into diffusion models, pivotal for generating images from text, animating, and creating 3D models. It outlines the training process, emphasizing the Maximal Likelihood goal and Evidence Lower Bound (ELBO). The script explains the encoding and decoding mechanisms, highlighting the importance of the reparametrization trick. It also covers techniques for conditional generation, high-resolution image synthesis, and acceleration methods like DDIM and distillation. The script concludes with a call for questions, inviting viewer engagement.

Takeaways

  • 🎨 Diffusion models are powerful generative models used for creating images, animations, videos, and 3D models from textual descriptions or by specifying conditions.
  • 🧠 The core of diffusion models lies in training a denoising network that can iteratively refine noise into high-quality images, starting from a simple Gaussian distribution.
  • 📈 Maximal Likelihood is the training goal, aiming to maximize the sample probability of the model, which is achieved by minimizing the Kullback–Leibler divergence between the model's output and the true data distribution.
  • 🔍 The Evidence Lower Bound (ELBO) is a key concept in training, serving as a lower bound on the log-likelihood and is optimized by maximizing it during the training of diffusion models.
  • 🌐 The encoding process in diffusion models is characterized by progressively adding noise over multiple steps, while the decoding process involves removing noise to generate the final output.
  • 🔄 The training of diffusion models involves three main objectives: prior matching, reconstruction, and denoising matching, which ensure the model's output aligns with the Gaussian prior and reconstructs the original image accurately.
  • 🔮 Three interpretations of the training process include predicting the noisy image, the noise itself, or the score (gradient of the log-likelihood), each providing different insights into the model's operation.
  • 📸 For conditional generation, such as creating images of cats wearing sunglasses from a text prompt, diffusion models can be guided using either a classifier or a classifier-free approach.
  • 🖼️ High-resolution image generation can be achieved through various methods like cascade models, latent diffusion models, or end-to-end models, each with its own approach to scaling up image resolution.
  • ⏱️ To accelerate the sampling process and make diffusion models more practical, techniques such as DDIM, distillation, and consistency models are used to reduce the number of required denoising steps.

Q & A

  • What are diffusion models known for?

    -Diffusion models are known for their ability to generate high-quality images from text descriptions, animate people, create videos, and produce 3D models.

  • What are the four main topics covered in the video regarding diffusion models?

    -The four main topics covered are Training, Guidance, Resolution, and Speed.

  • How does a generative model translate noise samples into high-quality images?

    -A generative model aims to find parameters of a decoder that can transform a simple Gaussian distribution into a complex natural image distribution.

  • What is the training goal in terms of data distribution?

    -The training goal is to maximize the sample probability, known as Maximal Likelihood, which involves approximating the true data distribution by collecting many data samples.

  • How is the log-likelihood value simplified in the context of diffusion models?

    -The log-likelihood value is simplified by applying a log function to turn the product into a sum, which helps rewrite it as an expectation.

  • What is the Evidence Lower Bound (ELBO) and why is it significant?

    -ELBO is a lower bound of the log-likelihood value and is significant because it measures the statistical evidence for the model. It is used to train models like Variational Auto-Encoders (VAEs) and diffusion models by maximizing this bound.

  • How does the encoding process in diffusion models differ from traditional encoding?

    -In diffusion models, the encoding process adds noise to the image progressively over multiple steps, rather than encoding the observation in one step like in traditional models.

  • What are the three terms in the objective function for training a diffusion model?

    -The three terms are: 1) Prior matching, ensuring the latent distribution is similar to the Gaussian distribution at the end of diffusion steps. 2) Reconstruction, similar to the VAE's reconstruction term. 3) Denoising matching, focusing on the difference between the noisy and less noisy images.

  • How does the reparametrization trick help in modeling noisy images?

    -The reparametrization trick allows a random variable to be expressed as a deterministic function of a noise variable, which helps in modeling noisy images by representing them as a function of a clean image and a noise variable.

  • What is the role of the denoising network during the training of a diffusion model?

    -The denoising network models the probability of a less noisy image given a noisy image and guides the generation process by predicting the direction to maximize the log probability, which is crucial for both unconditional and conditional generation.

  • How can high-resolution images be generated using diffusion models?

    -High-resolution images can be generated using methods like cascade models, Latent Diffusion Models (LDM), or end-to-end diffusion models, each involving different approaches to upscaling or encoding images before decoding to the desired resolution.

  • What are some methods to accelerate the sampling process in diffusion models?

    -Acceleration methods include using Deterministic Divergence-free Interpolation (DDIM) for a deterministic generative process, distillation to reduce the number of sampling steps, and consistency models to ensure the model predicts the same origin for any point on the path.

Outlines

00:00

🧠 Understanding Diffusion Models

This paragraph introduces diffusion models, emphasizing their versatility in generating high-quality images, animations, videos, and 3D models from textual descriptions. The video aims to demystify how these models function by focusing on four key areas: training, guidance, resolution, and speed. It begins with a foundational concept, the Gaussian distribution, which is used to sample noise and transform it into complex natural image distributions. The training goal is to maximize the sample probability through Maximal Likelihood, which is simplified using logarithmic functions and expectations. The concept of Kullback–Leibler divergence is introduced to explain the minimization of similarity between the generative model's distribution and the true data distribution. The paragraph also discusses the challenges in computing log-likelihood and introduces the concept of an encoder to capture latent variable probabilities. The discussion concludes with the introduction of Evidence Lower Bound (ELBO), a key metric in training generative models like Variational Auto-Encoders (VAEs) and diffusion models.

05:02

🔍 Deep Dive into Denoising and Training

The second paragraph delves into the specifics of the encoding and decoding processes within diffusion models, highlighting the progressive addition and removal of noise. It discusses the training process, which involves maximizing the Evidence Lower Bound (ELBO), similar to VAEs. The encoding process is described as a series of transition probabilities, leading to a noise distribution after multiple steps. The paragraph then breaks down the training objective into three components: prior matching, reconstruction, and denoising matching. It explains how the denoising network is trained to predict the 'less noisy' image, using the reparametrization trick to express a noisy image as a function of a clean image and a noise variable. The discussion also covers the use of Bayes' rule and the concept of KL-divergence in training the denoising network, ultimately aiming to minimize the distance between the predicted and ground truth means of the distributions.

10:03

🖼️ Enhancing Image Generation with Guidance

Paragraph three explores the concept of conditional generation in diffusion models, focusing on how to guide the model to produce specific content. It introduces the idea of 'classifier guidance' where an additional classifier is used to guide the generation process. The paragraph also discusses the use of 'classifier-free guidance' which eliminates the need for a separate classifier by training a single conditional denoising network. The discussion then shifts to methods for generating high-resolution images, including cascade models, Latent Diffusion Models (LDM), and end-to-end approaches. Each method is briefly explained, highlighting their advantages and the trade-offs involved. The paragraph concludes with various techniques to accelerate the sampling process, making diffusion models more practical for real-world applications.

15:03

⏩ Accelerating Diffusion Models

The final paragraph focuses on methods to speed up diffusion models, which are typically slow due to the need for multiple evaluations of the denoising network. It introduces DDIM, a method that constructs non-Markovian diffusion processes to maintain quality with fewer steps. The paragraph also discusses distillation techniques, where a student model is trained to reproduce the output of a teacher model using fewer steps, effectively halving the number of sampling steps. Further acceleration is achieved through progressive distillation and latent diffusion models. The discussion includes the use of LoRA (Low-Rank Adaptation) to reduce the number of parameters in fine-tuning models, making the process more efficient. The paragraph concludes with a mention of consistency models and score distillation, which further enhance the speed and quality of diffusion models for tasks like text-to-image generation.

Mindmap

Keywords

💡Diffusion models

Diffusion models are a class of generative models used in deep learning for generating data samples, such as images, that resemble real-world data. In the context of the video, they are highlighted for their ability to create high-quality images from text descriptions, animate people, and even produce 3D models. The script explains that diffusion models work by gradually adding noise to an image over multiple steps and then learning to reverse this process to generate new images. This concept is central to understanding the video's theme of how these models are trained and used for various applications.

💡Generative model

A generative model is a type of machine learning model that can generate new data samples similar to the training data. In the video, the generative model's goal is to transform simple noise samples into complex, natural images. The script mentions designing a generative model that translates noise into high-quality images, which is a key aspect of diffusion models' functionality.

💡Maximal Likelihood

Maximal Likelihood is a statistical method used to estimate the parameters of a probability distribution or statistical model. The video script discusses the training goal of maximizing the sample probability, known as Maximal Likelihood, to approximate the true data distribution. This concept is crucial for understanding how diffusion models are trained to generate images that closely match the training data.

💡Kullback–Leibler divergence

The Kullback–Leibler (KL) divergence is a measure used in statistics to determine how one probability distribution diverges from a second, expected probability distribution. In the script, it is mentioned that maximizing the likelihood is equivalent to minimizing the KL-divergence between the generative model's distribution and the data distribution, which helps in understanding the optimization process in diffusion models.

💡Encoder

In the context of the video, an encoder is a component of a generative model that captures the latent variable probability given an observation. The script explains that an encoder is introduced to help with the transformation of noise samples into images, and it plays a significant role in the training process of diffusion models by capturing the distribution of latent variables.

💡Evidence Lower Bound (ELBO)

ELBO, or Evidence Lower Bound, is a lower bound on the log likelihood of the data in probabilistic models. The script discusses how the first term in the objective function of a diffusion model is the ELBO, which is used to train the model by maximizing this lower bound. This concept is essential for understanding the training dynamics of diffusion models.

💡Variational Auto-Encoder (VAE)

A Variational Auto-Encoder is a type of generative model that learns to encode and decode data by maximizing the evidence lower bound. The video script compares diffusion models to VAEs, noting that they both use latent variable models but differ in their approach to encoding and decoding. VAEs are mentioned as a way to parameterize both the encoder and decoder as neural networks, which helps in training the model.

💡Denoising

Denoising in the context of the video refers to the process of reducing noise in an image to generate a cleaner version. The script explains that diffusion models encode images by adding noise and decode by progressively removing it. Denoising is a key process in the decoding phase of diffusion models, where the model learns to predict a 'less noisy' image given a noisy one.

💡Reparametrization trick

The reparametrization trick is a technique used in variational inference to express a random variable as a deterministic function of a noise variable. The script mentions this trick in the context of expressing a noisy image as a function of a clean image and a noise variable, which is crucial for the training of diffusion models as it allows for the manipulation of random variables in a deterministic manner.

💡Classifier guidance

Classifier guidance is a technique used in conditional generation where an additional classifier is used to guide the generation process. The video script discusses how this method can be used to generate specific content, such as cat images or images with text prompts, by modifying the generative process to include a classifier that guides the model towards the desired output.

💡High-resolution images

High-resolution images are detailed images with a high pixel count, which are more complex to generate due to the amount of detail they contain. The script covers methods for generating high-resolution images using diffusion models, such as cascade models, latent diffusion models, and end-to-end methods. These techniques are important for applications where the quality and detail of the generated images are crucial.

Highlights

Diffusion models are used for generating images from text, animating people, creating videos, and 3D models.

The generative model's goal is to transform simple Gaussian noise into high-quality images.

Maximizing sample probability, or Maximal Likelihood, is the training goal for approximating the true data distribution.

The log-likelihood is simplified using a log function, turning products into sums to rewrite as an expectation.

Maximizing likelihood is equivalent to minimizing the Kullback–Leibler divergence between the generative model's distribution and the data distribution.

An encoder is introduced to capture the latent variable probability given an observation.

The Evidence Lower Bound (ELBO) is used as a lower bound of the log-likelihood value for training.

Diffusion models encode images in multiple steps by progressively adding noise, unlike VAEs which encode in one step.

The encoding process in diffusion models is defined as a product of transition probabilities.

The training objective for diffusion models includes prior matching, reconstruction, and denoising matching.

The reparametrization trick is used to rewrite a random variable as a deterministic function of a noise variable.

The denoising network is trained to predict the mean of a less noisy image given a noisy image.

The training objective can be interpreted in three ways: predicting the noisy image, the clean image, or the noise.

Classifier guidance is used for conditional generation by adding an adversarial gradient of a classifier.

Classifier-free guidance is an alternative to classifier guidance, training a single conditional denoising network.

High-resolution image generation can be achieved through cascade models, latent diffusion models, or end-to-end methods.

DDIM allows for non-Markovian diffusion processes, enabling faster and simpler sampling.

Distillation techniques can be used to accelerate diffusion models, reducing the number of sampling steps.

LoRA is a method to reduce the number of parameters in fine-tuning the denoising network for style generation.

Score distillation is a recent method that trains a student denoising model to reproduce the output of a teacher model more quickly.

Transcripts

play00:00

Diffusion models excel in many applications such as generating

play00:03

beautiful images from texts, animating a person,

play00:07

creating videos

play00:10

and 3D models.

play00:11

We are going to learn and understand how diffusion models work.

play00:14

In this video, we'll cover four main topics: Training, Guidance,

play00:18

Resolution, and Speed.

play00:25

Let's start with a simple Gaussian distribution for easy sampling.

play00:28

We want to design a generative model translating these noise samples

play00:31

into high-quality images.

play00:33

More specifically, we want to find the parameters of this decoder

play00:37

that can transform a simple Gaussian distribution

play00:39

into a complicated natural image distribution.

play00:42

However, we do not know what the true data distribution is

play00:45

and can only approximate it by collecting lots of data samples.

play00:49

Our training goal is maximize the sample probability

play00:53

known as Maximal Likelihood.

play00:55

Let's see what this maximal likelihood is trying to do.

play00:58

We first apply a log function to simplify the expression,

play01:01

turning the product into a sum.

play01:03

This help us rewrite it as an expectation.

play01:06

Here is the definition.

play01:08

If we subtract a constant

play01:09

that has nothing to do with the parameter theta,

play01:11

we find that this is just the Kullback–Leibler

play01:13

divergence between the distribution from our generative model

play01:16

and the data distribution.

play01:18

So maximizing the likelihood means minimizing

play01:21

the similarity between these two distributions.

play01:24

Sounds great, but it's hard to compute this log-likelihood value.

play01:27

Either it involves integrating out all latent variable z's

play01:31

or assuming that we know the ground truth later encoder.

play01:34

Let's see what we can do.

play01:35

We first introduce

play01:36

an encoder capturing a latent variable probability, given an observation.

play01:41

This term is one.

play01:42

since we integrate our all latent variables Z's.

play01:45

Let's move the log likelihood inside the integral and express this

play01:49

as an expectation.

play01:50

Here we can apply the Bayes' rule and multiply a dummy term.

play01:54

Next, we swap these posterior probability terms and separate them.

play01:59

We now recognize the second term is a KL-divergence

play02:02

between our encoder Q of Z given X and a ground truth encoder P of Z.

play02:06

given X.

play02:07

We don't know this value because we don't have the access

play02:10

to the ground truth encoder P of Z given X.

play02:13

But we do know the KL-divergence is non-negative.

play02:17

This means that the first time is the lower

play02:18

bound of the log likelihood value.

play02:21

Since the log-likelihood measures the statistical evidence for our model.

play02:25

This term is known as evidence lower bound or ELBO.

play02:29

One in type or later variable models is Variational Auto-Encoder (VAE),

play02:33

where they parameterize

play02:34

both the encoder and the decoder as Neural Networks.

play02:37

We can train both the encoder and decoder by maximizing the ELBO.

play02:42

Diffusion models are also latent variable models.

play02:44

But instead of encoding the observation

play02:47

X in one step, it encodes the image in multiple

play02:50

steps by progressively adding more and more noise.

play02:53

Similarly, the decoding process progressively remove the noise

play02:57

to generate a sample.

play02:59

In VAE, we have observed variable X and later variable Z.

play03:03

Similarly, in diffusion models, we call the clean image

play03:05

X0 and the latent X1 to XT.

play03:09

We can train the diffusion model in the same way as we train

play03:12

a VAE by maximizing the evidence lower bound.

play03:16

Okay, let's

play03:16

first take a look at what the encoding process looks like.

play03:20

We can write a encoding process as a product of transition probabilities.

play03:24

We define a transition probability at each time, step as a distribution

play03:28

where the mean is the image from the previous time step XT minus one

play03:32

scale by a scalar that's less than one and some variance.

play03:36

This encoding process ensures that the latent variables

play03:39

become a noise after many time steps.

play03:42

So with some derivation, our objective has three terms:

play03:45

1) Prior matching, 2) reconstruction and 3) denoisy matching.

play03:50

The first term says that the latent distribution will be similar

play03:53

to the Gaussian distribution at the end of the diffusion steps.

play03:56

This is automatically satisfied by our forward diffusion process.

play04:00

The second term is similar to the reconstruction term

play04:02

in the Variational Autoencoder and is simple to compute.

play04:06

I want to focus on how we can maximize this denosing matching term

play04:09

or minimize this one.

play04:11

Here we see three probability distributions.

play04:13

First, what's a probability of a noisy image at timestamp t?

play04:16

Given a clean image x0?

play04:18

All we know is the transition probability of x_t given x_{t-1} T

play04:22

minus one.

play04:23

To do this, we need to know the reparametrization trick.

play04:26

This trick helps rewrite a random variable x

play04:29

as a deterministic function of a noise variable epsilon.

play04:32

Intuitively, we can represent a Gaussian distribution by scaling

play04:36

the epsilon by the standard deviation and shifting the mean.

play04:39

With this trick we can express x1, x2 and so on.

play04:43

Plugging x1 into the second equation, lead to this expression.

play04:47

Now we can simplify this because the sum of two

play04:50

independent Gaussian variables is also a Gaussian.

play04:53

Doing this recursively, we can write a noisy image

play04:56

at timestamp t as a function of clean image x_0 and noise variable epsilon.

play05:01

This means that we can directly sample from this Gaussian distribution.

play05:07

The second term says the following: Suppose

play05:09

we know the clean image x_0

play05:11

and the noisy version of it after t forward diffusion steps,

play05:15

what's the probability of a "less noisy" image?

play05:17

x_{t-1}?

play05:19

This tells us how to denoise a noisy image

play05:22

when knowing the ground truth clean image x_0.

play05:25

We use this to guide our denoising network that models

play05:28

the probability of a less noisy image x_{t-1} given a noisy image x_t.

play05:34

Here is an

play05:35

actual photo of what's happening when training a diffusion model.

play05:38

To derive this term, we apply the definition

play05:41

of conditional probability and Bayes' rule.

play05:43

Here we know exactly what is three probabilities are.

play05:47

After some calculations, we find that it is also a Gaussian distribution.

play05:51

The mean lies on the line between a noisy image and a clean image x_0.

play05:57

We can also compute the variance in closed form.

play06:00

Here, the probability from our denoising network is also a Gaussian.

play06:03

Since both are Gaussian

play06:05

distributions with the same variance,

play06:07

minimizing the KL-divergence term is equivalent to minimizing

play06:11

the distance between the means of the two distributions.

play06:15

The process looks like this: We sample a clean image

play06:18

x_0 from the dataset and a noise image from a Gaussian distribution

play06:22

with zero mean and unit variance.

play06:24

We encode the clean image with forward diffusion to get a noisy image x_t .

play06:29

We then compare the L2 loss between the predicted and ground truth mean.

play06:33

By looking at this training objective, we can have three interpretations.

play06:38

First, from the ground truth mean, we see a linear combination of noisy

play06:42

image x_t and clean image x_0.

play06:45

But why do we ask the denoising network to predict noisy image

play06:49

that we already know from the input?

play06:51

Therefore, we express the estimated mean as a form of the ground truth one

play06:56

and only ask the model to predict a clean image x_0.

play07:00

Second, from the forward diffusion process,

play07:03

we know the relationship between the noisy image,

play07:05

the clean image, and the added noise.

play07:08

We can express the clean image x_0 as a function

play07:11

of a noisy image x_t and a noise epsilon.

play07:14

When we plug this in and we arrive at this new form of ground truth mean.

play07:18

Similarly, we can match the form of the

play07:20

estimated mean with the ground truth one,

play07:23

and only ask the the denoising network to predict the noise.

play07:27

To discuss

play07:28

the third interpolation, we need to use the Tweetie's formula.

play07:31

The formula states that if we observe a sample

play07:34

z from a Gaussian distribution, the posterior expectation of the mean

play07:38

is the sample plus a correction term involving the gradient

play07:42

of the log likelihood or the score of the estimate.

play07:45

Let's apply the formula to our forward diffusion probability.

play07:49

Replace the mean here and we now have this expression.

play07:52

When we replace the clean image x_0, we arrive at this equation

play07:56

involving the score.

play07:58

Similarly, we can parameterize our mean estimate using the same form.

play08:03

How can we compute this score?

play08:04

From a simple derivation, it turns out it's just

play08:07

the noise multiplied by a negative scalar.

play08:10

Intuitively, since the noisy image x_t comes from adding a noise

play08:14

to the clean image x_0, moving the opposite direction of the noise

play08:18

to denoise an image naturally increases a log-probability.

play08:22

Here's the formulation of score based models.

play08:26

Let's visualize the process in a simple 2-D plot.

play08:29

We take a clean image x_0 from our training dataset.

play08:32

Select a timestamp T and scale the clean image.

play08:36

We then sample a random noise and scale it.

play08:39

Adding these two up gives the noisy image x_t.

play08:42

We can train the diffusion model to remove

play08:44

all the noise and directly predict a clean image.

play08:47

However, this is challenging so that the noise image may

play08:50

still be low quality.

play08:52

So instead we only take a small step along this line.

play08:55

Alternatively, we can ask the network to predict

play08:58

what noise has been added to this image

play09:01

and take an opposite direction for a small step.

play09:04

Or we can ask the network to predict the score.

play09:07

In all three cases, we arrive at exactly the same distribution

play09:11

for the less noisy image x_{t-1} given a noisy image x_t.

play09:16

We can sample

play09:16

from this distribution and repeat the process.

play09:20

This allows us to create a path from noise

play09:22

to clean image.

play09:28

After training our diffusion model,

play09:30

we can use the denoising network to generate many samples.

play09:33

This is called unconditional generation.

play09:36

But perhaps we want to ask the model to generate specific contents

play09:39

like cat images.

play09:41

Or if we want to see more cats wearing sunglasses using a text prompt.

play09:45

Let's revisit the score estimation interpretation of diffusion models.

play09:49

At time step t,

play09:50

we use our denoising network to predict the log-likelihood

play09:53

gradient, which is the direction to maximize the log probability.

play09:57

For conditional generation, we want to predict the conditional score.

play10:01

By applying the Bayes' rule,

play10:03

the conditional score consists of an unconditional score

play10:06

and a adversarial gradient of a classifier.

play10:09

We can score a adversarial gradient by a positive factor of gamma.

play10:14

This is great because we can reuse the unconditional model

play10:17

and use additional classifier to guide the generation.

play10:21

We call this "classifier guidance".

play10:23

But we'll have to train another classifier

play10:25

because an off-the-shelf classifier is usually trained with clean images.

play10:30

Luckily, we can use

play10:31

the predicted noise to estimate a clean image.

play10:35

This estimated clean image is a bit blurry,

play10:38

but off-the-shelf classifiers are usually fairly robust to this.

play10:42

This is nice, but do we really need to use an additional classifier?

play10:46

By applying the Bayes' rules to the second term, we see that it

play10:50

consists of an unconditional and a conditional score.

play10:54

Plugging it back to our equation,

play10:56

we get "classifier free guidance".

play10:59

But training two denoising networks is expensive,

play11:02

so we train a single conditional denoising network and use

play11:06

null condition to represent the unconditional model.

play11:10

Here is the comparisons between unguided

play11:12

and guidance samples with classifier-free guidance.

play11:21

How do we generate high resolution images?

play11:23

There are probably three types of methods.

play11:26

The first one is using cascade.

play11:28

We first use an diffusion model to generate a low-resolution

play11:31

image, say 64 by 64.

play11:35

We then train a separate diffusion

play11:36

model that upscale the low resolution image to a higher resolution.

play11:41

The second type of approach is Latent Diffusion Models (LDM).

play11:45

We first train an Variational Autoencoder

play11:47

that encodes a high-resolution image into a low-resolution latent code.

play11:51

We train both the encoder and decoder using the reconstruction loss,

play11:55

some with a adversarial loss to ensure sharp results

play11:58

and the regularization loss on the latent.

play12:00

Now we can train our diffusion model efficiently in the latent space.

play12:05

Once we get a clean latent,

play12:06

we use the decoder to map it back to a high-resolution image.

play12:10

In both cascade and latent diffusion models,

play12:13

we need to train several models separately.

play12:17

End-to-end methods aim to generate high-resolution images

play12:20

with a single diffusion model.

play12:22

Several promising ideas have been proposed,

play12:24

such as adjusting the noise schedules for high-resolution image generation,

play12:29

multiscale loss, and progressive training.

play12:38

Division models are very slow

play12:40

because we need to evaluate the denoising network

play12:42

several hundreds or even 1000 times to get a good sample.

play12:46

Here are some methods that accelerate the sampling

play12:48

to make diffusion models more practical.

play12:51

Let's first review the training objective of DDPM.

play12:54

We train our denoising network to predict the noise.

play12:57

If we look at this training objective and don't care about

play13:00

what the waiting term is, we just need to ensure that: 1)

play13:03

the forward diffusion model remains the same.

play13:06

2) The mean of the ground truth denoising step is a linear

play13:09

combination of the noisy image x_t and the added noise epsilon.

play13:14

The meaning of the estimated mean is of the same form.

play13:18

These three constraints do NOT assume the transition

play13:20

probability to be a Markovian process.

play13:23

In this DDIM paper, they construct a class of non-Markovian

play13:27

diffusion processes and find a and b that satisfy these constraints.

play13:33

This gives us these two Gaussian distributions.

play13:36

Interestingly,

play13:38

the sigma_t can be set to arbitrary values.

play13:41

By setting them to zero, we get a deterministic generative process.

play13:45

The only randomness comes from the initial sample noise.

play13:49

Here is the quantitative evaluation.

play13:51

With 1000 denoting steps that the DDPM did perform better.

play13:56

But when we reduce the number of denoising steps, the quality

play13:59

of the DDPM quickly degrades, while the results from DDIM remain decent.

play14:04

The best thing is that we don't need to retrain the model.

play14:06

We just need to take the model train with the DDPM objective

play14:10

and accelerate it with DDIM sampler.

play14:14

But even with the DDIM simpler, it still requires quite a few steps.

play14:19

Let's further reduce the number of steps using distillation.

play14:22

We can use a pre-trained model as a teacher and teach a student

play14:26

denoising network to use one sampling step to reproduce the output

play14:30

of a teacher network using two sampling steps.

play14:33

So after this distillation process, we halve the number of sampling steps.

play14:37

We can ask the student model to be a new teacher model

play14:40

and repeat the process until we reach the target sampling steps.

play14:45

Another

play14:45

idea is to distill classifier-free guided diffusion

play14:48

that requires evaluating both conditional

play14:51

and unconditional models into one single model.

play14:55

We can further apply previous ideas to make it faster,

play14:58

such as progressive distillation and latent diffusion models.

play15:03

We can distill a pre-trained

play15:04

invoicing network using consistency models.

play15:08

The main idea is to train a model so that for any points

play15:11

on the path, the model predicts the same origin.

play15:15

This supports single steps generation

play15:17

as well as multi-step generation.

play15:20

Applying consistency distillation in the latent space gives us

play15:23

high-resolution image generation using only a few steps.

play15:27

We can also extend the idea of latent consistency model with LoRA.

play15:31

But what's LoRA?

play15:32

Given a pretrained model,

play15:33

sometimes we want to generate contents of particular style,

play15:37

such as pixel art, LEGO, IKEA instructions, and anime.

play15:42

We can achieve these

play15:43

styles by fine-tuning the base model with additional data.

play15:46

However, this is computationally expensive and require high storage.

play15:51

In many cases, it turns out we only need to fine

play15:53

tuned across attentional layers in the denoising network.

play15:57

We freeze the pre-trained weight W_0 and optimize the residual parameters,

play16:02

but there are still a lot of parameters to update.

play16:04

We can reduce the number of parameters

play16:06

using low rank approximation on the weight matrix.

play16:10

Therefore, we can accelerate the consistency distillation with LoRA.

play16:14

Here the acceleration vector is the parameter difference

play16:17

between the distilled and the base model.

play16:20

More interestingly, we can accelerate other fine-tune models

play16:23

by linearly combining the acceleration and the style vectors.

play16:28

Here are some examples.

play16:30

Another recent distillation method train the student denoising model

play16:33

using score disillation.

play16:36

However,

play16:36

these predictions are usually blurring.

play16:39

Their main idea is to apply

play16:41

an adversarial loss by training a discriminator.

play16:45

Here are some results.

play16:48

After this distillation, these models can achieve text-to-image

play16:52

generation at interactive rate.

play17:01

To sum up, we covered the training objective

play17:03

for the diffusion models and their three interpretations.

play17:07

How we can guide the generation with classifier

play17:09

and classifier-free guidance.

play17:11

How we can synthesize high-resolution images

play17:14

using cascade, latent, and end-to-end diffusion models,

play17:18

and how we can speed up the sampling using DDIM

play17:20

simpler and various distillation techniques.

play17:24

Please comment below if you have any questions.

play17:27

Thanks for learning with me.

play17:28

I will see you next time.

Rate This

5.0 / 5 (0 votes)

関連タグ
Diffusion ModelsImage GenerationAI Animation3D ModelingMachine LearningDeep LearningGenerative ModelsAI TrainingConditional GenerationHigh-Resolution Images
英語で要約が必要ですか?