Denoising Diffusion Probabilistic Models Code | DDPM Pytorch Implementation
Summary
TLDRThis video tutorial delves into the implementation of diffusion models, specifically DDPM, with a focus on training and sampling. It covers the mathematical foundations of diffusion processes, the architecture of the latest diffusion models, and the creation of a noise scheduler. The tutorial also details the construction of the model, including the use of sinusoidal position embeddings and the encoder-decoder architecture with residual blocks and self-attention. Practical aspects such as dataset preparation, training loop, and sampling method are discussed, with examples from training on MNIST and texture image datasets.
Takeaways
- 📊 The video covers the implementation of diffusion models, starting with DDPM and moving towards Stable Diffusion with text prompts.
- 🔍 The architecture used in the latest diffusion models is implemented, rather than the original one used in DDPM.
- 🧩 The video dives into the different blocks of the model architecture before coding, focusing on the training and sampling parts of DDPM.
- 🖼️ The diffusion model is trained on grayscale and RGB images, with the specific math of diffusion models covered as a refresher.
- ⏱️ The diffusion process involves a forward process that adds Gaussian noise to an image step by step, eventually making it equivalent to a sample of noise from a normal distribution.
- 🔄 The reverse diffusion process requires the model to learn to predict the mean and variance of the noise, aiming to minimize the KL Divergence between the ground truth and the model's prediction.
- 🎯 The training method involves sampling an image at a time step T, a noise sample, and feeding the model the noisy version of the image to learn the reverse process.
- 🛠️ The noise scheduler is implemented to handle the forward process of adding noise and the reverse process of sampling from a learned distribution.
- 🏗️ The model architecture for diffusion models is detailed, with a focus on the requirements for the input and output shape and the necessity of incorporating time step information.
- 🔧 The video concludes with an overview of the training and sampling code, showcasing the results of training the diffusion model on MNIST and texture images.
Q & A
What is the main focus of the video?
-The video focuses on the implementation of diffusion models, specifically creating a Denoising Diffusion Probabilistic Model (DDPM) and discussing its training and sampling process.
What are the key components of a diffusion model covered in the video?
-The video covers the architecture used in the latest diffusion models, the specific math required for implementation, the forward and reverse processes, and the training method involving sampling and loss computation.
How is the noise schedule implemented in the video?
-The noise schedule is implemented using a linear noise schedule where beta linearly scales from 1 to 0.02 over a thousand time steps, and the alphas (1 - beta) and cumulative product terms are pre-computed for efficiency.
What is the role of the reverse process in diffusion models as explained in the video?
-The reverse process in diffusion models is about learning to predict the original noise from a noisy image by minimizing the KL Divergence between the ground truth distribution and the model's predicted distribution.
How is the time step information incorporated into the model in the video?
-Time step information is incorporated by using a time embedding block that converts integer time steps into a vector representation, which is then fused into the model via a linear layer after activation.
What architecture is used for the model in the video?
-The model uses a U-Net architecture with downsampling blocks, mid blocks, and upsampling blocks, each containing ResNet blocks, self-attention blocks, and time step projection layers.
What is the significance of the sinusoidal position embedding in the video?
-The sinusoidal position embedding is used to convert integer time steps into a fixed embedding space, which aids the model in predicting the original noise based on the current time step.
How is the training process of the DDPM described in the video?
-The training process involves sampling an image at a random time step, adding noise based on the noise schedule, and then training the model to predict the original noise, using the mean squared error as the loss function.
What datasets are used for training the model as mentioned in the video?
-The model is trained on the MNIST dataset for grayscale images and a dataset of texture images for RGB images.
How does the video demonstrate the sampling process from the learned model?
-The video demonstrates the sampling process by starting with a noise sample and iteratively applying the reverse process using the model's noise predictions to gradually refine the image towards the original.
What are the computational requirements mentioned for training on larger images in the video?
-The video mentions that training on larger images requires more patience and computational resources, suggesting the need for increased channels and longer training epochs for better results.
Outlines
🎥 Introduction to Implementing Diffusion Models
The speaker introduces the video's focus on implementing diffusion models, specifically starting with a DDPM (Denoising Diffusion Probabilistic Models) and later moving to stable diffusion with text prompts. The video aims to cover the training and sampling aspects of DDPM. The architecture implemented will be based on the latest diffusion models rather than the original DDPM. The speaker plans to delve into the different blocks of the architecture before coding and showcasing the results on grayscale and RGB images. A brief mention of the specific math required for implementation is made, suggesting viewers watch a linked diffusion math video for a deeper understanding. The diffusion process is explained as a forward process that adds Gaussian noise to an image step by step, eventually making it indistinguishable from pure noise. The reverse process is what the model learns, with the goal of predicting the mean and variance of the noise. The training method involves sampling an image at a time step T, adding noise, and feeding the noisy image to the model. The loss function is based on the mean squared difference between the predicted noise and the original noise sample.
🔍 Deep Dive into the Noise Scheduler and Model Architecture
The speaker discusses the creation of a noise scheduler, which is crucial for the forward and reverse processes in diffusion models. The scheduler computes the noisy version of an image given an original image, noise sample, and time step. It also computes the mean and variance for the reverse process, using the reparameterization trick for sampling. The implementation of a linear noise scheduler is described, where beta values are linearly scaled. The model architecture for diffusion models is explored, emphasizing the need for an architecture that can incorporate time step information. The speaker chooses to use a UNet-like architecture, similar to what is used in stable diffusion models, to allow for code reuse in future videos. The time embedding block, which represents time steps as vectors for the model, is explained. The video then delves into the specifics of the UNet model, including downsampling blocks, mid blocks, and upsampling blocks, each with their respective components like normalization, activation, and convolutional layers. The importance of fusing time step information within the model is highlighted.
🛠️ Coding the Diffusion Model's Components
The speaker begins coding the sinusoidal position embedding, which converts integer time steps into vector representations. This is followed by the implementation of the down block, which includes ResNet blocks, self-attention layers, and down-sampling layers. Each component's role in the model is explained, with a focus on how they contribute to the model's ability to process images at different resolutions. The code for the mid block and up block is also discussed, with the speaker noting that these blocks follow a similar structure to the down block but with up-sampling instead of down-sampling. The speaker emphasizes the configurability of the code, allowing for multiple layers and different configurations based on the model's requirements.
🖼️ Training and Sampling the Diffusion Model
The speaker outlines the training process for the diffusion model, starting with the setup of the dataset, model, and training configurations. The training loop is described, where batches of images are sampled, noise is added according to the noise scheduler, and the model's predictions are used to compute the loss function. Backpropagation is then performed to update the model's weights. The sampling process is also explained, where a random noise sample is generated, and the model is used to iteratively predict the original image by reversing the diffusion process. The speaker mentions training on the MNIST dataset and a texture dataset, providing results for both and discussing the training time and computational requirements.
🔚 Conclusion and Future Outlook
In conclusion, the speaker summarizes the video's content, which included the implementation of a diffusion model, the creation of a noise scheduler, and the development of a UNet-based model architecture. The speaker reflects on the training and sampling processes, providing insights into the model's performance on different datasets. The video ends with a teaser for future content, which will cover stable diffusion models. The speaker thanks the viewers for watching and encourages subscription for more content.
Mindmap
Keywords
💡Diffusion Models
💡DDPM
💡Stable Diffusion
💡Noise Schedule
💡Time Step
💡Residual Connection
💡Self-Attention
💡Reparameterization Trick
💡MNIST Dataset
💡RGB Images
Highlights
Introduction to implementing diffusion models with a focus on DDPM and future coverage of stable diffusion.
Explanation of the architecture used in the latest diffusion models, differing from the original DDPM.
Deep dive into the different blocks of the model architecture before coding.
Overview of the diffusion process involving a forward process to create noisier image versions.
Discussion on the scheduled noise and its role in the transition function for image transformation.
Insight into the reverse diffusion process and its functional form, aiming to learn the mean and variance of the distribution.
Derivation of the training method involving sampling an image at a time step and a noise sample.
Introduction to the noise scheduler and its role in the forward and reverse processes.
Implementation of a linear noise scheduler with pre-computed alphas and cumulative products for efficiency.
Explanation of the model requirements for diffusion models and the flexibility in architecture choice.
Details on fusing time step information into the model to aid in predicting original noise.
Introduction to the time embedding block for representing time steps in the model.
Description of the encoder-decoder architecture used in the model, including downsampling, mid, and upsampling blocks.
Discussion on the specifics of the down block, including ResNet and self-attention blocks.
Explanation of the up block, which is similar to the down block but includes upsampling instead of downsampling.
Overview of the training loop, including the process of adding noise to images and backpropagation based on the loss.
Description of the sampling process, which involves reversing the time steps and using the model's noise prediction.
Presentation of the results from training on MNIST and texture image datasets, showcasing the model's performance.
Conclusion summarizing the implementation of DDPM and the understanding gained about diffusion models.
Transcripts
in this video I'll cover the
implementation of diffusion models we'll
create ddpm for now and in later videos
move to stable diffusion with text proms
in this one we'll be implementing the
training and sampling part for ddpm for
our model we'll actually implement the
architecture that is used in latest
diffusion models rather than the one
originally used in ddpm we'll dive deep
into the different blocks in it before
finally putting everything in code and
see results of training this diffusion
model on grayscale and RGB
images
I'll cover the specific math of
diffusion models that we need for
implementation very quickly in the next
few minutes but this should only act as
a refresher so if you're not aware of it
and are interested in knowing it I would
suggest to first see my diffusion math
video that's linked
above the entire diffusion process
involves a forward process where we take
an image and create noisier versions of
it step by step by adding gossan noise
after a large number of steps it becomes
equivalent to a sample of noise from a
normal distribution we do this by
applying this transition function at
every time step T and beta is a
scheduled noise which we add to the
image at T minus one to get the image at
T we saw that having Alpha as 1 minus
beta and Computing cumulative products
of these Alphas at time T allows us to
jump from original image to noisy image
at any time step T in the forward
process we then have a model learn the
reverse process distribution and because
the reverse diffusion process has the
same functional form as the forward
process which here is a goian we
essentially want the model to learn to
predict its mean and
variance after going through a lot of
derivation from the initial goal of
optimizing the log likelihood of The
observed data we ended with the
requirement to minimize the K Divergence
between the ground Ruth Ren noising
distribution conditioned on x0 which we
computed as having this mean and this
variance and the distribution predicted
by our model we fix the variance to be
exactly same as the target distribtion
bution and rewrite the mean in the same
form after this minimizing KL Divergence
ends up being minimizing square of
difference between the noise predicted
and the original noise
sample our Training Method then involves
sampling an image time step T and A
noise sample and feeding the model the
noisy version of this image at sample
time step T using this equation the
cumulative product terms needs to be
coming from the noise Schuler which
decides the schedule of noise added as
we move along time steps and loss
becomes the MSC between the original
noise and whatever the model
predicts for generating images we just
sample from a learned reverse
distribution starting from a noise
sample XT from a normal distribution and
then Computing the mean using the same
formulation just in terms of XT and
noise prediction and variance is same as
the ground truth denoising distribution
conditioned on x0 then we get a sample
from this reverse distribution using the
reparameterization trick and repeating
this gets us to x0 and for x0 we don't
add any noise and simply return the
mean this was a very quick overview and
I had to skim through a lot for a
detailed version of this I would
encourage you to look at the previous
diffusion
video so for implementation we saw that
we need to do some computation for the
forward and the reverse process so we'll
create a noise Schuler which will do
these two things for us for the forward
process given an image and a noise
sample and time step t it will return
the noisy version of this image using
the forward equation and in order to do
this efficiently it will store the
alphas which is just 1 minus beta and
the cumulative product terms of alpha
for all
T the authors use a linear noise Schuler
where they linearly scale beta from 1
eus 4 to 0.02 with thousand time steps
between them and we'll also do the
same the second responsibility that this
Schuler will do is given an XT and noise
prediction from model it'll give us XT
minus one by sampling from the reverse
distribution for this it'll compute the
mean and variance according to their
respective equations and return a sample
from this distribution using the
reparameterization
trick to do this we also store 1 minus
Alpha T 1 minus the cumulative product
terms and its square root obviously we
can compute all of this at runtime as
well but pre-computing them simplifies
the code for the equation a lot so let's
implement the noise schedu
first as I mentioned we'll be creating a
linear noise
schedule after initializing all the
parameters from the arguments of this
class we'll create betas to linearly
increase from start to end such that we
have beta T from zero till the last time
step we'll then initialize all the
variables that we need for forward and
reverse process
equations
the addore noise method is our forward
process so it will take in an image
original noise sample and time step T
the images and noise will be of B Cross
C cross H cross W and time step will be
a 1D tensor of size
B for the forward process we need the
square root of cumulative product terms
for the given time steps and 1 minus
that and then we reshape them so that
they are B cross 1 CR 1 CR
1 lastly we apply the forward process
equation the second function will be the
guide that takes the image XT and gives
us a sample from our learned reverse
distribution for that we'll have it
receive XT and noise prediction from the
model and time step t as the argument
we'll be saving the original image
prediction x0 for visualizations and get
that using this equation this can be
obtained using the same equation for
forward process that takes from x0 to XT
by just rearranging the terms and using
noise prediction instead of the actual
noise then for sampling we'll compute
the mean which is simply this
equation and as mentioned T equals 0 we
simply return the mean and noise is only
added for other time steps the variance
of that is same as the variance of
ground truth re noising distribution
condition on X zero which was
this and lastly we'll sample from a
gosen distribution with this mean and
variance using the reparameterization
trick this completes the entire noise
Schuler which handles the forward
process of adding noise and the reverse
process of sampling first let's now get
into the
model for diffusion models we are
actually free to use whatever
architecture we want as long as we meet
two
requirements the first being that the
shape of the input and output must be
same and the other is some mechanism to
fuse in time step information let's talk
about why for a bit the information of
what time step we are at is always
available to us whether we are at
training or sampling and in fact knowing
what time step we are at would Aid the
model in predicting original noise
because we are providing the information
that how much of that input image
actually is noise so instead of just
giving the model an image we also give
the time step that we are
at for the model I'll use unit which is
also what the authors use but for the
exact specification of the blocks
activations normalizations and
everything else I'll mimic the stable
diffusion unit used by hugging face in
the diffusers pipeline that's because I
plan to soon create a video on stable
diffusion so that will allow me to reuse
a lot of code that I'll create now
actually even before going into the unit
model let's first see how the time step
information is
represented let's call this the time
embedding block which will take in a 1D
tensor of time steps of size B which is
batch size and give us a tore _ dim
sized representation for each of those
time steps in the
batch the time embedding block would
first convert the integer time steps
into to some Vector representation using
an embedding
space that will then be fed to two
linear layers separated by activation to
give us our final time step
representation for the embedding space
the authors use the sinusoidal position
embedding used in
Transformers for activations everywhere
I have used sigmoid linear units but you
can choose a different one as
well okay now let's get into the model
as I mentioned I'll be using unit just
like the authors which is essentially
this encoder decoder architecture
where encoder is a series of
downsampling blocks where each block
reduces the size of the input typically
by half and increases the number of
channels the output of final down
sampling block is passed to layers of
mid block which all work at the same
spatial resolution and after that we
have a series of upsampling
blocks these one by one increase the
spatial size and reduce the number of
channels to ultimately match the input
size of the model the upsampling blocks
also fusing the output coming from the
corresponding down sampling block at the
same resolution via residual skip
connections most of the diffusion models
usually follow this unit architecture
but differ based on specifications
happening inside the blocks and as I
mentioned for this video I've tried to
mimic to some extent what's happening
inside the stable diffusion unit from
hugging
face let's look closely into the down
block and once we understand that the
rest are pretty easy to
follow down blocks of almost all the
variations would be a resonet block
followed by a self attention block and
then a down sample layer for our resonet
plus self attention block we'll have
group Norm followed by activation
followed by a convolutional layer the
output of this will again be passed to a
normalization activation and
convolutional layer we add a residual
connection from the input of first
normalization layer to the output of
second convolutional
layer this entire thing is what will be
called as a resonet block which you can
think of as two convolutional blocks
plus residual connection this is Then
followed by A normalization and A Self
attention layer and again residual
connection we have multiple such resonet
plus self attention layers but for
Simplicity our current implementation
will only have one layer the code on the
repo however will be configurable to
make as many layers as
desired we also need to fuse the time
information and the way it's done is
that each resonant block has an
activation followed by a linear
layer and we pass the time ending
representations through them first
before adding to the output of the first
convolutional layer so essentially this
linear layer is projecting the tore
emore dim time step representation to a
tensor of same size as the channels in
the convolutional layers output that way
these two can be added by replicating
this time step representation across the
spatial
Dimension now that we have seen the
details inside the block to simplify
let's replace everything within this
part as a resonet block and within this
as a self attention block
the other two blocks are using the same
components and just slightly different
let's go back to our previous
illustration of all three
blocks we saw that down block is just
multiple layers of reset followed by
self attention and lastly we have a down
sampling layer up block is exactly the
same except that it first upsamples the
input to twice the spatial size and then
concatenates the down block output of
the same spatial resolution across the
channel Dimension Forst that it's the
same layers of resonet and self
attention blocks the layers of mid block
always maintain the input to the same
spatial resolution the hugging face
version has first one resonet block and
Then followed by layers of self
attention and resonet so I also went
ahead and made the same
implementation and let's not forget the
time step information for each of these
reset blocks we have a Time step
projection layer this was what we just
saw an activation followed by a linear
layer the existing time step
representation goes through these blocks
before being added to the output of
first convolution layer of the resonet
block let's see how all of this looks in
code the first thing we'll do is
implement the sinusoidal position
embedding code this function receives B
sized 1D tensor time steps where B is
the bat size and is expected to return B
cross tore _ dim tensor we first
implement the factor part which is
everything that the position which here
is the time step integer value will be
divided with inside the S and cosine
functions this will get us all values
from 0 to half of the time embedding
Dimension size half because we'll
concatenate s and
cosine after replicating the time step
values we get our desired shape tensor
and divided by the factor that we
computed this is now exactly the
arguments for which we have to call the
sign and cosine function again all this
method does is convert the integer time
step representation embeddings using a
fixed embedding
space now we'll be implementing the down
block but before that let's quickly take
a peek at what layers we need to
implement so we need layers of reset
plus self attention blocks reset will be
two Norm activation convolutional layers
with residual and self attention will be
Norm followed by self attention we also
need the time projection layers which
will project the time embedding onto the
same Dimension as the number of channels
in the output of first convolution
feature map I'll only implement the the
block to have one layer for now and
we'll only need single instances of
these and after resonant and self
attention we have a down
sampling okay back to coding
it for each down block we'll have these
arguments incore channel is the number
of channels expected in input out
underscore channels is the channels we
want in the output of this down block
then we have the embedding Dimension I
also add down sample argument just so
that we have the flexibility to ignore
the down sampling part in the
code lastly num underscore heads is the
number of heads that our attention block
will
have this is our first convolution block
of resnet we make the channel conversion
from input to Output channels via the
first cor blayer itself so after this
everything will have out uncore channels
as the number of
channels then these are the time
projection layers for this resonet block
remember each resonet block will have
one of these and we had seen that this
was just activation followed by linear
layer the output of this linear layer
should have out uncore channels so that
we can do the
addition this is the second G block
which will be exactly same except
everything operating on out underscore
channels as the channel
Dimension and then we add the attention
part the normalization and multi-ad
attention the feature dimension for
multi-ad attention will be same as the
number of
channels this residual connection is 1
cross one con layer and this ensures
that the input to the entire reset block
can be added to the output of the last
con blers and since the input was in
underscore channels you have to first
transform it to out underscore channels
so this just does
that and finally we have the down sample
layer which can also be average pooling
but I've used convolution with stri two
and if the arguments convey to not down
sample then this is just
identity the forward method will be very
simple we first pass the input to the
first con
block and then add the time
information and then after going going
through the second cor block we add the
residual but only after passing through
the one cross one corn
player attention will happen between all
the spatial H * W cells with out
underscore channels being the feature
dimensionality of each of those
cells so the transpose just ensures that
the channel features are the last
Dimension and after the channel
Dimension has been enriched with self
attention representation we do the
transpose back and again have the
residual
connection if we would be having multi
layers then we would Loop over this
entire thing but since we are only
implementing one layer for now we'll
just call the down sampling convolution
after
this next up is mid block and again
let's revisit the illustration for
this for Mid block we'll have a resonet
block and then layers of self attention
followed by
resonet same as down block we'll only
Implement one layer for
now
the code for midblock will have same
kind of layers but we need two instances
of every layer that belongs to the reset
block so let's just put all of that
in
the forward method will have just one
difference that is we call the first
resonant block and then self attention
and second resonant
block had we implemented multiple layers
the self attention and the following
resonet block would have a
loop now let's do up block which will be
exactly same as down block except that
instead of down sampling we'll have a
upsampling
layer we'll use con transpose to do the
upsampling for
us in the forward method let's first
copy everything that we did for down
block
then we need to make three changes add
the same spatial resolutions down block
output as
argument then before resonet plus self
attention blocks we'll upsample the
input and concat the corresponding down
block output another way to implement
this could be to First concat followed
by reset and self attention and then
upsample but I went with this
one finally we'll build our unit class
it will receive the channels and input
image as argu doent we'll hard code the
down channels and mid channels for
now the way the code is implemented is
that these four values of down channels
will essentially be converted into three
down blocks each taking input of Channel
I dimensions and converting it to Output
of Channel i+ 1
dimensions and same for the mid
blocks this is just the down sample
arguments that we are going to pass to
the
blocks remember our time embedding block
had position embedding followed by
linear layers with activation in between
these are those two linear
layers this is different from the time
step layers which we had for each
resonant block this will only be called
once in an entire forward pass right at
the start to get initial time step
representation we'll also first have to
convert the input to have the same
channel Dimensions as the input of first
down block and this convolution will
just do that for us we then create the
down blocks mid blocks and up blocks
based on the number of channels
provided
for the last up block I simply hardcode
the output Channel as
16 the output of last up block under
goes a normalization and convolution to
get us to the same number of channels as
the input
image we'll be training on mnist data
set so the the number of channels in the
input image would be one in the forward
method we first call the con underscore
in layer and then get the time step
representation by calling the sinusoidal
position embedding followed by our
linear
layers then we just call the down blocks
and we keep saving the output of down
blocks because we need it as input for
the up
block during up block calls we simply
take down outputs from that list one by
one and pass that together with the the
current
output and then we call our
normalization activation and output
convolution once we pass a 4 cross 1
cross 28 cross 28 input tensor to this
we get the following output
shapes so you can see because we had
down sampled only twice our smallest
size input to any convolution layer is 7
cross
7 the code on the repo is much more
configurable and creates these blocks
based on whatever configuration is
passed and can create multiple layers as
well we'll look at a sample config file
later but first let's take a brief look
at the data set training and sampling
code the data set class is very simple
it just takes in the path where the
images are and then stores the file name
of all those images in
there right now we are building
unconditional diffusion model so we
don't really use the
labels then we simply load the images
and convert it to tensor and we also
scale it from minus1 to 1 just like the
authors so that our model consistently
sees similarly scaled images as compared
to the random
noise moving to Trainor ddpm file where
the train function loads up the config
and gets the model data set diffusion
and training configurations from it we
then instantiate the noise Schuler data
set and our
model after setting up the optimizer and
the loss functions we run our training
Loop
here we take our image batch sample
random noise of shape B cross1 cross H
crossw and Sample random time steps the
scheduler adds noise to these batch
images based on the sample time steps
and we then back propagate based on the
loss between noise prediction by a model
and the actual noise that we
added for sampling similar to training
we load the config and necessary
parameters our model and noise
Schuler the sample method then creates a
random noise sample based on number of
images requested and then we go through
the time steps in Reverse for each time
step we get our models noise prediction
and call the reverse process of
scheduler that we had created with this
XT and noise prediction and then it
Returns the mean of XD minus one and
estimate of the original image we can
choose to either save one of these to
see the progress of
sampling now let's also take a look at
our config file this just has the data
set parameters which stores our image
path model params which stores
parameters necessary to create model
like the number of channels down
channels and so on like I had mentioned
we can put in the number of layers
required in each of our down mid and up
blocks and finally we specify the
training
parameters the unit class in the repo
has blocks which actually read this
config and create model based on
whatever configuration is provided it
does everything similar to what we just
implemented except that it Loops over
the number of of layers as
well and I've also added shapes of the
output that we would get at each of
those block calls so that it helps a bit
in understanding
everything for training as I mentioned I
train on mnist but in order to see if
everything works for RGB images I also
train on this data set of texture images
because I already have it downloaded
since my video on implementing di there
is a sample of images from this data set
these are not generated these are images
from the data set
itself though the data set has 256 cross
256 images I resized the images to be 28
cross 28 primarily because I lack two
important things for training on larger
sized images patience and compute rather
cheap
compute for mnist I train it for about
20 box taking 40 minutes on v00 GPU and
for this texture data set I train for
about 60 box taking roughly about 3
hours and that gives me these
results
here I'm saving the original image
prediction at each time step and you can
see that because amnest images are all
similar looking the model pretty quickly
gets a decent original image prediction
whereas for the texture data set it
doesn't till about last 200 300 times
steps but by the end of all the steps we
get decent results for both the data
sets you can obviously train it on a
larger size data set though probably you
would have to maybe increase the
channels and maybe train for longer
epochs to get nice results
so that's all that I wanted to cover for
implementing ddpm we went through
scheduler implementation unit
implementation and saw how everything
comes together in the training and
sampling code hopefully it give you a
better understanding of diffusion models
and thank you so much for watching this
video and if you're liking the content
and getting benefit from it do subscribe
the channel see you in the next
video
Weitere ähnliche Videos ansehen
SDXL Local LORA Training Guide: Unlimited AI Images of Yourself
Explained simply: How does AI create art?
Why Does Diffusion Work Better than Auto-Regression?
ULTIMATE FREE LORA Training In Stable Diffusion! Less Than 7GB VRAM!
Unpaired Image-Image Translation using CycleGANs
How I Understand Diffusion Models
5.0 / 5 (0 votes)