Denoising Diffusion Probabilistic Models Code | DDPM Pytorch Implementation

ExplainingAI
2 Dec 202325:52

Summary

TLDRThis video tutorial delves into the implementation of diffusion models, specifically DDPM, with a focus on training and sampling. It covers the mathematical foundations of diffusion processes, the architecture of the latest diffusion models, and the creation of a noise scheduler. The tutorial also details the construction of the model, including the use of sinusoidal position embeddings and the encoder-decoder architecture with residual blocks and self-attention. Practical aspects such as dataset preparation, training loop, and sampling method are discussed, with examples from training on MNIST and texture image datasets.

Takeaways

  • 📊 The video covers the implementation of diffusion models, starting with DDPM and moving towards Stable Diffusion with text prompts.
  • 🔍 The architecture used in the latest diffusion models is implemented, rather than the original one used in DDPM.
  • 🧩 The video dives into the different blocks of the model architecture before coding, focusing on the training and sampling parts of DDPM.
  • 🖼️ The diffusion model is trained on grayscale and RGB images, with the specific math of diffusion models covered as a refresher.
  • ⏱️ The diffusion process involves a forward process that adds Gaussian noise to an image step by step, eventually making it equivalent to a sample of noise from a normal distribution.
  • 🔄 The reverse diffusion process requires the model to learn to predict the mean and variance of the noise, aiming to minimize the KL Divergence between the ground truth and the model's prediction.
  • 🎯 The training method involves sampling an image at a time step T, a noise sample, and feeding the model the noisy version of the image to learn the reverse process.
  • 🛠️ The noise scheduler is implemented to handle the forward process of adding noise and the reverse process of sampling from a learned distribution.
  • 🏗️ The model architecture for diffusion models is detailed, with a focus on the requirements for the input and output shape and the necessity of incorporating time step information.
  • 🔧 The video concludes with an overview of the training and sampling code, showcasing the results of training the diffusion model on MNIST and texture images.

Q & A

  • What is the main focus of the video?

    -The video focuses on the implementation of diffusion models, specifically creating a Denoising Diffusion Probabilistic Model (DDPM) and discussing its training and sampling process.

  • What are the key components of a diffusion model covered in the video?

    -The video covers the architecture used in the latest diffusion models, the specific math required for implementation, the forward and reverse processes, and the training method involving sampling and loss computation.

  • How is the noise schedule implemented in the video?

    -The noise schedule is implemented using a linear noise schedule where beta linearly scales from 1 to 0.02 over a thousand time steps, and the alphas (1 - beta) and cumulative product terms are pre-computed for efficiency.

  • What is the role of the reverse process in diffusion models as explained in the video?

    -The reverse process in diffusion models is about learning to predict the original noise from a noisy image by minimizing the KL Divergence between the ground truth distribution and the model's predicted distribution.

  • How is the time step information incorporated into the model in the video?

    -Time step information is incorporated by using a time embedding block that converts integer time steps into a vector representation, which is then fused into the model via a linear layer after activation.

  • What architecture is used for the model in the video?

    -The model uses a U-Net architecture with downsampling blocks, mid blocks, and upsampling blocks, each containing ResNet blocks, self-attention blocks, and time step projection layers.

  • What is the significance of the sinusoidal position embedding in the video?

    -The sinusoidal position embedding is used to convert integer time steps into a fixed embedding space, which aids the model in predicting the original noise based on the current time step.

  • How is the training process of the DDPM described in the video?

    -The training process involves sampling an image at a random time step, adding noise based on the noise schedule, and then training the model to predict the original noise, using the mean squared error as the loss function.

  • What datasets are used for training the model as mentioned in the video?

    -The model is trained on the MNIST dataset for grayscale images and a dataset of texture images for RGB images.

  • How does the video demonstrate the sampling process from the learned model?

    -The video demonstrates the sampling process by starting with a noise sample and iteratively applying the reverse process using the model's noise predictions to gradually refine the image towards the original.

  • What are the computational requirements mentioned for training on larger images in the video?

    -The video mentions that training on larger images requires more patience and computational resources, suggesting the need for increased channels and longer training epochs for better results.

Outlines

00:00

🎥 Introduction to Implementing Diffusion Models

The speaker introduces the video's focus on implementing diffusion models, specifically starting with a DDPM (Denoising Diffusion Probabilistic Models) and later moving to stable diffusion with text prompts. The video aims to cover the training and sampling aspects of DDPM. The architecture implemented will be based on the latest diffusion models rather than the original DDPM. The speaker plans to delve into the different blocks of the architecture before coding and showcasing the results on grayscale and RGB images. A brief mention of the specific math required for implementation is made, suggesting viewers watch a linked diffusion math video for a deeper understanding. The diffusion process is explained as a forward process that adds Gaussian noise to an image step by step, eventually making it indistinguishable from pure noise. The reverse process is what the model learns, with the goal of predicting the mean and variance of the noise. The training method involves sampling an image at a time step T, adding noise, and feeding the noisy image to the model. The loss function is based on the mean squared difference between the predicted noise and the original noise sample.

05:10

🔍 Deep Dive into the Noise Scheduler and Model Architecture

The speaker discusses the creation of a noise scheduler, which is crucial for the forward and reverse processes in diffusion models. The scheduler computes the noisy version of an image given an original image, noise sample, and time step. It also computes the mean and variance for the reverse process, using the reparameterization trick for sampling. The implementation of a linear noise scheduler is described, where beta values are linearly scaled. The model architecture for diffusion models is explored, emphasizing the need for an architecture that can incorporate time step information. The speaker chooses to use a UNet-like architecture, similar to what is used in stable diffusion models, to allow for code reuse in future videos. The time embedding block, which represents time steps as vectors for the model, is explained. The video then delves into the specifics of the UNet model, including downsampling blocks, mid blocks, and upsampling blocks, each with their respective components like normalization, activation, and convolutional layers. The importance of fusing time step information within the model is highlighted.

10:11

🛠️ Coding the Diffusion Model's Components

The speaker begins coding the sinusoidal position embedding, which converts integer time steps into vector representations. This is followed by the implementation of the down block, which includes ResNet blocks, self-attention layers, and down-sampling layers. Each component's role in the model is explained, with a focus on how they contribute to the model's ability to process images at different resolutions. The code for the mid block and up block is also discussed, with the speaker noting that these blocks follow a similar structure to the down block but with up-sampling instead of down-sampling. The speaker emphasizes the configurability of the code, allowing for multiple layers and different configurations based on the model's requirements.

15:12

🖼️ Training and Sampling the Diffusion Model

The speaker outlines the training process for the diffusion model, starting with the setup of the dataset, model, and training configurations. The training loop is described, where batches of images are sampled, noise is added according to the noise scheduler, and the model's predictions are used to compute the loss function. Backpropagation is then performed to update the model's weights. The sampling process is also explained, where a random noise sample is generated, and the model is used to iteratively predict the original image by reversing the diffusion process. The speaker mentions training on the MNIST dataset and a texture dataset, providing results for both and discussing the training time and computational requirements.

20:18

🔚 Conclusion and Future Outlook

In conclusion, the speaker summarizes the video's content, which included the implementation of a diffusion model, the creation of a noise scheduler, and the development of a UNet-based model architecture. The speaker reflects on the training and sampling processes, providing insights into the model's performance on different datasets. The video ends with a teaser for future content, which will cover stable diffusion models. The speaker thanks the viewers for watching and encourages subscription for more content.

Mindmap

Keywords

💡Diffusion Models

Diffusion models are a class of generative models used in machine learning, particularly for generating images. They simulate the process of gradually adding noise to an image over time and then learning to reverse this process to generate new images. In the video, the creator discusses implementing a diffusion model called DDPM (Denoising Diffusion Probabilistic Models) and later moving to Stable Diffusion with text prompts, indicating the broad application of diffusion models in image generation tasks.

💡DDPM

DDPM stands for Denoising Diffusion Probabilistic Models, a type of diffusion model that learns to generate images by gradually removing noise from a corrupted version. The video focuses on the implementation of DDPM, including its architecture and training process. The script mentions creating a DDPM model and delving into its different blocks, which are essential for understanding the model's functionality.

💡Stable Diffusion

Stable Diffusion is an advanced form of diffusion model that is capable of generating high-quality images. While the video primarily focuses on DDPM, the mention of moving to Stable Diffusion with text prompts in later videos suggests an evolution in the discussion towards more sophisticated models that can incorporate textual descriptions to generate images, showcasing the progression in the field.

💡Noise Schedule

A noise schedule in diffusion models refers to the predefined pattern in which noise is added to the data over time. The script describes the use of a linear noise schedule where beta values linearly scale from 1 to 0.02 over a thousand time steps. This schedule is crucial for the forward process of the diffusion model, as it determines how the original image is gradually transformed into noise.

💡Time Step

In the context of diffusion models, a time step represents a point in the diffusion process where the image is at a certain level of noise. The video discusses the importance of time step information, which is used by the model to predict the original noise at each step. The script mentions the creation of a time embedding block that converts time steps into a vector representation, which is then used in the model.

💡Residual Connection

Residual connections are a feature of neural network architectures that help in learning by allowing the model to learn the residual function, which is the difference between the input and output. In the video script, residual connections are used in the ResNet blocks of the diffusion model to add the output of the first convolutional layer to the input, facilitating the training process and improving the model's ability to learn complex functions.

💡Self-Attention

Self-attention is a mechanism in neural networks that allows the model to weigh the importance of different parts of the input data. The video script describes the use of self-attention layers in the diffusion model, which are used after the ResNet blocks. These layers help the model focus on different parts of the image at different time steps, which is crucial for the model's ability to generate images.

💡Reparameterization Trick

The reparameterization trick is a technique used in variational autoencoders and diffusion models to enable gradient descent by separating the random noise from the deterministic part of the model. In the script, the reparameterization trick is used in the reverse process of the diffusion model to sample from the predicted distribution, allowing the model to generate images by reversing the noise addition process.

💡MNIST Dataset

The MNIST dataset is a large database of handwritten digits widely used for training and testing in the field of machine learning. The video script mentions training the diffusion model on the MNIST dataset, which is a common practice for testing generative models. The use of MNIST in the video demonstrates the model's ability to learn and generate images of handwritten digits, a fundamental task in image generation.

💡RGB Images

RGB stands for Red, Green, and Blue, the three primary colors used in digital imaging to represent color images. The video script discusses training the diffusion model on grayscale and RGB images, indicating the model's versatility in handling different types of image data. The mention of RGB images in the script suggests that the model is capable of generating more complex and colorful images, which is an important aspect of its applicability.

Highlights

Introduction to implementing diffusion models with a focus on DDPM and future coverage of stable diffusion.

Explanation of the architecture used in the latest diffusion models, differing from the original DDPM.

Deep dive into the different blocks of the model architecture before coding.

Overview of the diffusion process involving a forward process to create noisier image versions.

Discussion on the scheduled noise and its role in the transition function for image transformation.

Insight into the reverse diffusion process and its functional form, aiming to learn the mean and variance of the distribution.

Derivation of the training method involving sampling an image at a time step and a noise sample.

Introduction to the noise scheduler and its role in the forward and reverse processes.

Implementation of a linear noise scheduler with pre-computed alphas and cumulative products for efficiency.

Explanation of the model requirements for diffusion models and the flexibility in architecture choice.

Details on fusing time step information into the model to aid in predicting original noise.

Introduction to the time embedding block for representing time steps in the model.

Description of the encoder-decoder architecture used in the model, including downsampling, mid, and upsampling blocks.

Discussion on the specifics of the down block, including ResNet and self-attention blocks.

Explanation of the up block, which is similar to the down block but includes upsampling instead of downsampling.

Overview of the training loop, including the process of adding noise to images and backpropagation based on the loss.

Description of the sampling process, which involves reversing the time steps and using the model's noise prediction.

Presentation of the results from training on MNIST and texture image datasets, showcasing the model's performance.

Conclusion summarizing the implementation of DDPM and the understanding gained about diffusion models.

Transcripts

play00:00

in this video I'll cover the

play00:01

implementation of diffusion models we'll

play00:04

create ddpm for now and in later videos

play00:06

move to stable diffusion with text proms

play00:09

in this one we'll be implementing the

play00:10

training and sampling part for ddpm for

play00:13

our model we'll actually implement the

play00:15

architecture that is used in latest

play00:16

diffusion models rather than the one

play00:18

originally used in ddpm we'll dive deep

play00:21

into the different blocks in it before

play00:23

finally putting everything in code and

play00:25

see results of training this diffusion

play00:26

model on grayscale and RGB

play00:28

images

play00:30

I'll cover the specific math of

play00:32

diffusion models that we need for

play00:33

implementation very quickly in the next

play00:35

few minutes but this should only act as

play00:37

a refresher so if you're not aware of it

play00:39

and are interested in knowing it I would

play00:41

suggest to first see my diffusion math

play00:43

video that's linked

play00:45

above the entire diffusion process

play00:48

involves a forward process where we take

play00:49

an image and create noisier versions of

play00:51

it step by step by adding gossan noise

play00:55

after a large number of steps it becomes

play00:56

equivalent to a sample of noise from a

play00:58

normal distribution we do this by

play01:00

applying this transition function at

play01:02

every time step T and beta is a

play01:04

scheduled noise which we add to the

play01:06

image at T minus one to get the image at

play01:09

T we saw that having Alpha as 1 minus

play01:12

beta and Computing cumulative products

play01:15

of these Alphas at time T allows us to

play01:17

jump from original image to noisy image

play01:19

at any time step T in the forward

play01:23

process we then have a model learn the

play01:25

reverse process distribution and because

play01:27

the reverse diffusion process has the

play01:29

same functional form as the forward

play01:31

process which here is a goian we

play01:33

essentially want the model to learn to

play01:35

predict its mean and

play01:37

variance after going through a lot of

play01:40

derivation from the initial goal of

play01:41

optimizing the log likelihood of The

play01:43

observed data we ended with the

play01:45

requirement to minimize the K Divergence

play01:47

between the ground Ruth Ren noising

play01:49

distribution conditioned on x0 which we

play01:52

computed as having this mean and this

play01:53

variance and the distribution predicted

play01:55

by our model we fix the variance to be

play01:58

exactly same as the target distribtion

play01:59

bution and rewrite the mean in the same

play02:02

form after this minimizing KL Divergence

play02:05

ends up being minimizing square of

play02:06

difference between the noise predicted

play02:08

and the original noise

play02:11

sample our Training Method then involves

play02:14

sampling an image time step T and A

play02:16

noise sample and feeding the model the

play02:18

noisy version of this image at sample

play02:20

time step T using this equation the

play02:23

cumulative product terms needs to be

play02:25

coming from the noise Schuler which

play02:27

decides the schedule of noise added as

play02:29

we move along time steps and loss

play02:31

becomes the MSC between the original

play02:33

noise and whatever the model

play02:36

predicts for generating images we just

play02:39

sample from a learned reverse

play02:40

distribution starting from a noise

play02:42

sample XT from a normal distribution and

play02:45

then Computing the mean using the same

play02:47

formulation just in terms of XT and

play02:49

noise prediction and variance is same as

play02:51

the ground truth denoising distribution

play02:53

conditioned on x0 then we get a sample

play02:56

from this reverse distribution using the

play02:57

reparameterization trick and repeating

play03:00

this gets us to x0 and for x0 we don't

play03:03

add any noise and simply return the

play03:06

mean this was a very quick overview and

play03:08

I had to skim through a lot for a

play03:10

detailed version of this I would

play03:12

encourage you to look at the previous

play03:13

diffusion

play03:15

video so for implementation we saw that

play03:17

we need to do some computation for the

play03:19

forward and the reverse process so we'll

play03:22

create a noise Schuler which will do

play03:24

these two things for us for the forward

play03:26

process given an image and a noise

play03:28

sample and time step t it will return

play03:30

the noisy version of this image using

play03:32

the forward equation and in order to do

play03:34

this efficiently it will store the

play03:36

alphas which is just 1 minus beta and

play03:39

the cumulative product terms of alpha

play03:41

for all

play03:42

T the authors use a linear noise Schuler

play03:45

where they linearly scale beta from 1

play03:47

eus 4 to 0.02 with thousand time steps

play03:50

between them and we'll also do the

play03:53

same the second responsibility that this

play03:56

Schuler will do is given an XT and noise

play03:58

prediction from model it'll give us XT

play04:00

minus one by sampling from the reverse

play04:04

distribution for this it'll compute the

play04:07

mean and variance according to their

play04:08

respective equations and return a sample

play04:11

from this distribution using the

play04:12

reparameterization

play04:14

trick to do this we also store 1 minus

play04:16

Alpha T 1 minus the cumulative product

play04:19

terms and its square root obviously we

play04:22

can compute all of this at runtime as

play04:24

well but pre-computing them simplifies

play04:26

the code for the equation a lot so let's

play04:28

implement the noise schedu

play04:35

first as I mentioned we'll be creating a

play04:37

linear noise

play04:42

schedule after initializing all the

play04:44

parameters from the arguments of this

play04:45

class we'll create betas to linearly

play04:48

increase from start to end such that we

play04:50

have beta T from zero till the last time

play04:53

step we'll then initialize all the

play04:55

variables that we need for forward and

play04:57

reverse process

play04:58

equations

play05:10

the addore noise method is our forward

play05:12

process so it will take in an image

play05:14

original noise sample and time step T

play05:17

the images and noise will be of B Cross

play05:19

C cross H cross W and time step will be

play05:21

a 1D tensor of size

play05:25

B for the forward process we need the

play05:28

square root of cumulative product terms

play05:29

for the given time steps and 1 minus

play05:32

that and then we reshape them so that

play05:34

they are B cross 1 CR 1 CR

play05:37

1 lastly we apply the forward process

play05:47

equation the second function will be the

play05:49

guide that takes the image XT and gives

play05:51

us a sample from our learned reverse

play05:54

distribution for that we'll have it

play05:56

receive XT and noise prediction from the

play05:57

model and time step t as the argument

play06:00

we'll be saving the original image

play06:02

prediction x0 for visualizations and get

play06:04

that using this equation this can be

play06:07

obtained using the same equation for

play06:08

forward process that takes from x0 to XT

play06:11

by just rearranging the terms and using

play06:13

noise prediction instead of the actual

play06:15

noise then for sampling we'll compute

play06:18

the mean which is simply this

play06:27

equation and as mentioned T equals 0 we

play06:30

simply return the mean and noise is only

play06:32

added for other time steps the variance

play06:35

of that is same as the variance of

play06:37

ground truth re noising distribution

play06:39

condition on X zero which was

play06:52

this and lastly we'll sample from a

play06:54

gosen distribution with this mean and

play06:55

variance using the reparameterization

play06:58

trick this completes the entire noise

play07:00

Schuler which handles the forward

play07:01

process of adding noise and the reverse

play07:03

process of sampling first let's now get

play07:06

into the

play07:08

model for diffusion models we are

play07:10

actually free to use whatever

play07:11

architecture we want as long as we meet

play07:14

two

play07:15

requirements the first being that the

play07:17

shape of the input and output must be

play07:18

same and the other is some mechanism to

play07:21

fuse in time step information let's talk

play07:23

about why for a bit the information of

play07:26

what time step we are at is always

play07:28

available to us whether we are at

play07:29

training or sampling and in fact knowing

play07:32

what time step we are at would Aid the

play07:34

model in predicting original noise

play07:36

because we are providing the information

play07:38

that how much of that input image

play07:39

actually is noise so instead of just

play07:42

giving the model an image we also give

play07:43

the time step that we are

play07:45

at for the model I'll use unit which is

play07:48

also what the authors use but for the

play07:50

exact specification of the blocks

play07:52

activations normalizations and

play07:54

everything else I'll mimic the stable

play07:56

diffusion unit used by hugging face in

play07:57

the diffusers pipeline that's because I

play08:00

plan to soon create a video on stable

play08:01

diffusion so that will allow me to reuse

play08:03

a lot of code that I'll create now

play08:05

actually even before going into the unit

play08:07

model let's first see how the time step

play08:09

information is

play08:11

represented let's call this the time

play08:13

embedding block which will take in a 1D

play08:15

tensor of time steps of size B which is

play08:17

batch size and give us a tore _ dim

play08:21

sized representation for each of those

play08:22

time steps in the

play08:25

batch the time embedding block would

play08:27

first convert the integer time steps

play08:29

into to some Vector representation using

play08:30

an embedding

play08:32

space that will then be fed to two

play08:34

linear layers separated by activation to

play08:36

give us our final time step

play08:38

representation for the embedding space

play08:41

the authors use the sinusoidal position

play08:42

embedding used in

play08:44

Transformers for activations everywhere

play08:46

I have used sigmoid linear units but you

play08:49

can choose a different one as

play08:50

well okay now let's get into the model

play08:54

as I mentioned I'll be using unit just

play08:55

like the authors which is essentially

play08:57

this encoder decoder architecture

play08:59

where encoder is a series of

play09:01

downsampling blocks where each block

play09:03

reduces the size of the input typically

play09:04

by half and increases the number of

play09:07

channels the output of final down

play09:09

sampling block is passed to layers of

play09:11

mid block which all work at the same

play09:13

spatial resolution and after that we

play09:15

have a series of upsampling

play09:18

blocks these one by one increase the

play09:20

spatial size and reduce the number of

play09:22

channels to ultimately match the input

play09:24

size of the model the upsampling blocks

play09:26

also fusing the output coming from the

play09:28

corresponding down sampling block at the

play09:30

same resolution via residual skip

play09:33

connections most of the diffusion models

play09:35

usually follow this unit architecture

play09:37

but differ based on specifications

play09:39

happening inside the blocks and as I

play09:41

mentioned for this video I've tried to

play09:43

mimic to some extent what's happening

play09:45

inside the stable diffusion unit from

play09:46

hugging

play09:48

face let's look closely into the down

play09:50

block and once we understand that the

play09:52

rest are pretty easy to

play09:54

follow down blocks of almost all the

play09:56

variations would be a resonet block

play09:58

followed by a self attention block and

play10:00

then a down sample layer for our resonet

play10:03

plus self attention block we'll have

play10:05

group Norm followed by activation

play10:07

followed by a convolutional layer the

play10:09

output of this will again be passed to a

play10:11

normalization activation and

play10:13

convolutional layer we add a residual

play10:15

connection from the input of first

play10:17

normalization layer to the output of

play10:19

second convolutional

play10:20

layer this entire thing is what will be

play10:23

called as a resonet block which you can

play10:25

think of as two convolutional blocks

play10:26

plus residual connection this is Then

play10:29

followed by A normalization and A Self

play10:31

attention layer and again residual

play10:34

connection we have multiple such resonet

play10:36

plus self attention layers but for

play10:38

Simplicity our current implementation

play10:40

will only have one layer the code on the

play10:42

repo however will be configurable to

play10:44

make as many layers as

play10:46

desired we also need to fuse the time

play10:48

information and the way it's done is

play10:51

that each resonant block has an

play10:52

activation followed by a linear

play10:54

layer and we pass the time ending

play10:57

representations through them first

play10:59

before adding to the output of the first

play11:00

convolutional layer so essentially this

play11:03

linear layer is projecting the tore

play11:05

emore dim time step representation to a

play11:08

tensor of same size as the channels in

play11:10

the convolutional layers output that way

play11:13

these two can be added by replicating

play11:14

this time step representation across the

play11:16

spatial

play11:18

Dimension now that we have seen the

play11:20

details inside the block to simplify

play11:22

let's replace everything within this

play11:24

part as a resonet block and within this

play11:26

as a self attention block

play11:30

the other two blocks are using the same

play11:32

components and just slightly different

play11:34

let's go back to our previous

play11:35

illustration of all three

play11:38

blocks we saw that down block is just

play11:41

multiple layers of reset followed by

play11:42

self attention and lastly we have a down

play11:45

sampling layer up block is exactly the

play11:48

same except that it first upsamples the

play11:50

input to twice the spatial size and then

play11:53

concatenates the down block output of

play11:55

the same spatial resolution across the

play11:57

channel Dimension Forst that it's the

play11:59

same layers of resonet and self

play12:01

attention blocks the layers of mid block

play12:04

always maintain the input to the same

play12:05

spatial resolution the hugging face

play12:08

version has first one resonet block and

play12:10

Then followed by layers of self

play12:12

attention and resonet so I also went

play12:14

ahead and made the same

play12:16

implementation and let's not forget the

play12:18

time step information for each of these

play12:21

reset blocks we have a Time step

play12:22

projection layer this was what we just

play12:25

saw an activation followed by a linear

play12:27

layer the existing time step

play12:29

representation goes through these blocks

play12:31

before being added to the output of

play12:33

first convolution layer of the resonet

play12:35

block let's see how all of this looks in

play12:40

code the first thing we'll do is

play12:42

implement the sinusoidal position

play12:44

embedding code this function receives B

play12:46

sized 1D tensor time steps where B is

play12:49

the bat size and is expected to return B

play12:51

cross tore _ dim tensor we first

play12:55

implement the factor part which is

play12:57

everything that the position which here

play12:59

is the time step integer value will be

play13:01

divided with inside the S and cosine

play13:04

functions this will get us all values

play13:06

from 0 to half of the time embedding

play13:08

Dimension size half because we'll

play13:10

concatenate s and

play13:13

cosine after replicating the time step

play13:15

values we get our desired shape tensor

play13:17

and divided by the factor that we

play13:19

computed this is now exactly the

play13:21

arguments for which we have to call the

play13:23

sign and cosine function again all this

play13:25

method does is convert the integer time

play13:27

step representation embeddings using a

play13:29

fixed embedding

play13:31

space now we'll be implementing the down

play13:33

block but before that let's quickly take

play13:35

a peek at what layers we need to

play13:37

implement so we need layers of reset

play13:40

plus self attention blocks reset will be

play13:42

two Norm activation convolutional layers

play13:44

with residual and self attention will be

play13:46

Norm followed by self attention we also

play13:49

need the time projection layers which

play13:51

will project the time embedding onto the

play13:53

same Dimension as the number of channels

play13:55

in the output of first convolution

play13:56

feature map I'll only implement the the

play13:58

block to have one layer for now and

play14:00

we'll only need single instances of

play14:02

these and after resonant and self

play14:04

attention we have a down

play14:05

sampling okay back to coding

play14:08

it for each down block we'll have these

play14:10

arguments incore channel is the number

play14:13

of channels expected in input out

play14:15

underscore channels is the channels we

play14:16

want in the output of this down block

play14:19

then we have the embedding Dimension I

play14:21

also add down sample argument just so

play14:24

that we have the flexibility to ignore

play14:25

the down sampling part in the

play14:27

code lastly num underscore heads is the

play14:30

number of heads that our attention block

play14:31

will

play14:32

have this is our first convolution block

play14:35

of resnet we make the channel conversion

play14:37

from input to Output channels via the

play14:38

first cor blayer itself so after this

play14:41

everything will have out uncore channels

play14:42

as the number of

play14:44

channels then these are the time

play14:46

projection layers for this resonet block

play14:48

remember each resonet block will have

play14:50

one of these and we had seen that this

play14:52

was just activation followed by linear

play14:54

layer the output of this linear layer

play14:56

should have out uncore channels so that

play14:58

we can do the

play14:59

addition this is the second G block

play15:01

which will be exactly same except

play15:03

everything operating on out underscore

play15:04

channels as the channel

play15:09

Dimension and then we add the attention

play15:12

part the normalization and multi-ad

play15:13

attention the feature dimension for

play15:15

multi-ad attention will be same as the

play15:17

number of

play15:18

channels this residual connection is 1

play15:21

cross one con layer and this ensures

play15:23

that the input to the entire reset block

play15:25

can be added to the output of the last

play15:26

con blers and since the input was in

play15:29

underscore channels you have to first

play15:31

transform it to out underscore channels

play15:32

so this just does

play15:34

that and finally we have the down sample

play15:36

layer which can also be average pooling

play15:38

but I've used convolution with stri two

play15:40

and if the arguments convey to not down

play15:42

sample then this is just

play15:46

identity the forward method will be very

play15:48

simple we first pass the input to the

play15:49

first con

play15:53

block and then add the time

play15:56

information and then after going going

play15:58

through the second cor block we add the

play16:00

residual but only after passing through

play16:02

the one cross one corn

play16:04

player attention will happen between all

play16:06

the spatial H * W cells with out

play16:09

underscore channels being the feature

play16:11

dimensionality of each of those

play16:13

cells so the transpose just ensures that

play16:16

the channel features are the last

play16:18

Dimension and after the channel

play16:20

Dimension has been enriched with self

play16:21

attention representation we do the

play16:23

transpose back and again have the

play16:25

residual

play16:26

connection if we would be having multi

play16:28

layers then we would Loop over this

play16:30

entire thing but since we are only

play16:32

implementing one layer for now we'll

play16:33

just call the down sampling convolution

play16:35

after

play16:38

this next up is mid block and again

play16:41

let's revisit the illustration for

play16:44

this for Mid block we'll have a resonet

play16:46

block and then layers of self attention

play16:48

followed by

play16:49

resonet same as down block we'll only

play16:52

Implement one layer for

play16:57

now

play17:01

the code for midblock will have same

play17:02

kind of layers but we need two instances

play17:05

of every layer that belongs to the reset

play17:06

block so let's just put all of that

play17:27

in

play17:35

the forward method will have just one

play17:37

difference that is we call the first

play17:39

resonant block and then self attention

play17:41

and second resonant

play17:55

block had we implemented multiple layers

play17:57

the self attention and the following

play17:59

resonet block would have a

play18:01

loop now let's do up block which will be

play18:05

exactly same as down block except that

play18:07

instead of down sampling we'll have a

play18:08

upsampling

play18:14

layer we'll use con transpose to do the

play18:16

upsampling for

play18:24

us in the forward method let's first

play18:26

copy everything that we did for down

play18:28

block

play18:29

then we need to make three changes add

play18:31

the same spatial resolutions down block

play18:34

output as

play18:36

argument then before resonet plus self

play18:38

attention blocks we'll upsample the

play18:39

input and concat the corresponding down

play18:42

block output another way to implement

play18:44

this could be to First concat followed

play18:46

by reset and self attention and then

play18:48

upsample but I went with this

play18:53

one finally we'll build our unit class

play18:56

it will receive the channels and input

play18:57

image as argu doent we'll hard code the

play19:00

down channels and mid channels for

play19:03

now the way the code is implemented is

play19:05

that these four values of down channels

play19:07

will essentially be converted into three

play19:09

down blocks each taking input of Channel

play19:11

I dimensions and converting it to Output

play19:14

of Channel i+ 1

play19:15

dimensions and same for the mid

play19:19

blocks this is just the down sample

play19:21

arguments that we are going to pass to

play19:22

the

play19:23

blocks remember our time embedding block

play19:26

had position embedding followed by

play19:27

linear layers with activation in between

play19:29

these are those two linear

play19:31

layers this is different from the time

play19:33

step layers which we had for each

play19:35

resonant block this will only be called

play19:37

once in an entire forward pass right at

play19:40

the start to get initial time step

play19:43

representation we'll also first have to

play19:45

convert the input to have the same

play19:46

channel Dimensions as the input of first

play19:48

down block and this convolution will

play19:50

just do that for us we then create the

play19:53

down blocks mid blocks and up blocks

play19:55

based on the number of channels

play19:57

provided

play20:08

for the last up block I simply hardcode

play20:10

the output Channel as

play20:17

16 the output of last up block under

play20:20

goes a normalization and convolution to

play20:22

get us to the same number of channels as

play20:23

the input

play20:25

image we'll be training on mnist data

play20:27

set so the the number of channels in the

play20:29

input image would be one in the forward

play20:31

method we first call the con underscore

play20:33

in layer and then get the time step

play20:35

representation by calling the sinusoidal

play20:37

position embedding followed by our

play20:38

linear

play20:42

layers then we just call the down blocks

play20:45

and we keep saving the output of down

play20:46

blocks because we need it as input for

play20:48

the up

play20:51

block during up block calls we simply

play20:54

take down outputs from that list one by

play20:56

one and pass that together with the the

play20:58

current

play20:59

output and then we call our

play21:01

normalization activation and output

play21:06

convolution once we pass a 4 cross 1

play21:09

cross 28 cross 28 input tensor to this

play21:11

we get the following output

play21:13

shapes so you can see because we had

play21:15

down sampled only twice our smallest

play21:17

size input to any convolution layer is 7

play21:20

cross

play21:21

7 the code on the repo is much more

play21:24

configurable and creates these blocks

play21:25

based on whatever configuration is

play21:27

passed and can create multiple layers as

play21:29

well we'll look at a sample config file

play21:31

later but first let's take a brief look

play21:33

at the data set training and sampling

play21:37

code the data set class is very simple

play21:40

it just takes in the path where the

play21:41

images are and then stores the file name

play21:43

of all those images in

play21:46

there right now we are building

play21:47

unconditional diffusion model so we

play21:49

don't really use the

play21:52

labels then we simply load the images

play21:54

and convert it to tensor and we also

play21:56

scale it from minus1 to 1 just like the

play21:58

authors so that our model consistently

play22:00

sees similarly scaled images as compared

play22:02

to the random

play22:04

noise moving to Trainor ddpm file where

play22:07

the train function loads up the config

play22:09

and gets the model data set diffusion

play22:11

and training configurations from it we

play22:14

then instantiate the noise Schuler data

play22:16

set and our

play22:19

model after setting up the optimizer and

play22:22

the loss functions we run our training

play22:26

Loop

play22:29

here we take our image batch sample

play22:31

random noise of shape B cross1 cross H

play22:33

crossw and Sample random time steps the

play22:37

scheduler adds noise to these batch

play22:38

images based on the sample time steps

play22:41

and we then back propagate based on the

play22:42

loss between noise prediction by a model

play22:45

and the actual noise that we

play22:48

added for sampling similar to training

play22:51

we load the config and necessary

play22:52

parameters our model and noise

play22:56

Schuler the sample method then creates a

play22:59

random noise sample based on number of

play23:01

images requested and then we go through

play23:03

the time steps in Reverse for each time

play23:06

step we get our models noise prediction

play23:08

and call the reverse process of

play23:09

scheduler that we had created with this

play23:12

XT and noise prediction and then it

play23:14

Returns the mean of XD minus one and

play23:16

estimate of the original image we can

play23:19

choose to either save one of these to

play23:20

see the progress of

play23:23

sampling now let's also take a look at

play23:25

our config file this just has the data

play23:28

set parameters which stores our image

play23:31

path model params which stores

play23:33

parameters necessary to create model

play23:35

like the number of channels down

play23:36

channels and so on like I had mentioned

play23:38

we can put in the number of layers

play23:40

required in each of our down mid and up

play23:42

blocks and finally we specify the

play23:44

training

play23:45

parameters the unit class in the repo

play23:48

has blocks which actually read this

play23:49

config and create model based on

play23:51

whatever configuration is provided it

play23:54

does everything similar to what we just

play23:55

implemented except that it Loops over

play23:57

the number of of layers as

play24:03

well and I've also added shapes of the

play24:05

output that we would get at each of

play24:07

those block calls so that it helps a bit

play24:09

in understanding

play24:10

everything for training as I mentioned I

play24:13

train on mnist but in order to see if

play24:15

everything works for RGB images I also

play24:17

train on this data set of texture images

play24:19

because I already have it downloaded

play24:21

since my video on implementing di there

play24:23

is a sample of images from this data set

play24:26

these are not generated these are images

play24:27

from the data set

play24:29

itself though the data set has 256 cross

play24:31

256 images I resized the images to be 28

play24:35

cross 28 primarily because I lack two

play24:37

important things for training on larger

play24:39

sized images patience and compute rather

play24:42

cheap

play24:42

compute for mnist I train it for about

play24:45

20 box taking 40 minutes on v00 GPU and

play24:49

for this texture data set I train for

play24:51

about 60 box taking roughly about 3

play24:53

hours and that gives me these

play24:56

results

play24:58

here I'm saving the original image

play25:00

prediction at each time step and you can

play25:02

see that because amnest images are all

play25:04

similar looking the model pretty quickly

play25:06

gets a decent original image prediction

play25:09

whereas for the texture data set it

play25:11

doesn't till about last 200 300 times

play25:14

steps but by the end of all the steps we

play25:17

get decent results for both the data

play25:18

sets you can obviously train it on a

play25:20

larger size data set though probably you

play25:22

would have to maybe increase the

play25:23

channels and maybe train for longer

play25:25

epochs to get nice results

play25:28

so that's all that I wanted to cover for

play25:30

implementing ddpm we went through

play25:32

scheduler implementation unit

play25:34

implementation and saw how everything

play25:36

comes together in the training and

play25:37

sampling code hopefully it give you a

play25:40

better understanding of diffusion models

play25:42

and thank you so much for watching this

play25:44

video and if you're liking the content

play25:45

and getting benefit from it do subscribe

play25:47

the channel see you in the next

play25:50

video

Rate This

5.0 / 5 (0 votes)

相关标签
Diffusion ModelsImage GenerationDDPMStable DiffusionMachine LearningDeep LearningAI ImplementationCoding TutorialMNIST DatasetRGB Images
您是否需要英文摘要?