How AI Image Generators Work (Stable Diffusion / Dall-E) - Computerphile
Summary
TLDR视频脚本详细介绍了使用扩散模型生成图像的过程,与传统的生成对抗网络(GAN)相比,扩散模型通过迭代地添加和移除噪声来生成图像,使得训练过程更加稳定且易于控制。通过结合文本嵌入和分类器自由引导技术,扩散模型能够根据文本提示生成特定内容的图像,尽管生成的图像可能需要进一步优化。此外,虽然训练这类模型成本高昂,但已有一些免费资源如稳定扩散模型可供使用,用户可以通过谷歌Colab等平台进行体验。
Takeaways
- 🖼️ 扩散模型(Diffusion Models)是一种生成图像的新方法,与生成对抗网络(GANs)不同,它通过逐步减少噪声来生成图像。
- 🔍 扩散模型的核心思想是将图像从清晰逐渐转变为噪声,然后通过训练神经网络来逆转这个过程,从而生成新的图像。
- 🎨 扩散模型训练过程中,会使用不同量的噪声对图像进行扰动,这种策略被称为噪声调度(noise schedule)。
- 🤖 在扩散模型中,神经网络需要学习如何预测并移除加在图像上的噪声,这个过程称为推断(inference)。
- 📈 扩散模型的训练涉及到大量的图像数据和计算资源,但现在已经有一些免费的工具和平台,如Google Colab,使得个人用户也能尝试使用。
- 📚 通过在训练过程中加入文本嵌入,扩散模型可以生成与特定文本描述相匹配的图像,这种技术称为基于条件的生成。
- 🌐 扩散模型可以用于创建高质量的图像,但生成过程可能需要多次迭代,逐渐接近原始图像。
- 🔧 扩散模型中的权重是共享的,这意味着在生成过程中使用的神经网络结构是一致的,提高了效率。
- 💡 扩散模型的一个关键技术是无分类器引导(classifier-free guidance),它通过比较带文本嵌入和不带文本嵌入的图像生成结果来增强与文本描述的匹配度。
- 🚀 尽管扩散模型在技术上具有挑战性,但其代码实现相对简单,通过调用单个Python函数即可生成图像。
- 💸 使用扩散模型可能涉及较高的成本,但由于有免费资源可用,个人用户也可以进行实验和学习。
Q & A
什么是扩散模型(diffusion model)?
-扩散模型是一种生成模型,它通过逐步添加噪声将数据点(如图像)转化为随机噪声,然后再通过一个学习过程逆转这一过程,从随机噪声生成清晰的数据点。这种方法在生成高质量图像方面显示出了优势。
生成对抗网络(GAN)是如何工作的?
-生成对抗网络(GAN)由两部分组成:生成器(Generator)和判别器(Discriminator)。生成器的任务是生成看起来像真实数据的图像,而判别器的任务是判断一个图像是真实的还是由生成器生成的。这两个网络在训练过程中相互竞争,生成器试图生成越来越逼真的图像,而判别器则努力提高区分真伪图像的能力。
扩散模型与GAN相比有什么优势?
-扩散模型相比GAN更容易训练,更稳定,并且能够生成高分辨率的图像。GAN在训练过程中可能会出现模式崩溃(mode collapse)等问题,而扩散模型通过逐步减少噪声的方式生成图像,避免了这些问题。
在扩散模型中,如何决定添加多少噪声?
-在扩散模型中,添加噪声的量通常遵循一个预定的噪声计划(schedule)。这个计划定义了每一步添加的噪声量,可以是线性的,也可以是开始时添加少量噪声然后逐渐增加的非线性方式。
扩散模型是如何从噪声图像恢复出原始图像的?
-扩散模型通过训练一个网络来预测并去除噪声。在训练过程中,模型学习如何从带有不同程度噪声的图像中预测原始图像。在生成图像时,模型会逐步预测并去除噪声,每次迭代都会生成一个噪声更少的图像,最终恢复出清晰的原始图像。
如何使用扩散模型生成特定内容的图像?
-为了生成特定内容的图像,可以通过条件化(conditioning)扩散模型来实现。例如,可以将文本描述作为条件输入,模型会根据这些条件生成与描述相匹配的图像。
什么是分类器自由引导(classifier-free guidance)?
-分类器自由引导是一种提高生成图像与文本描述相关性的技术。它通过将图像两次输入模型,一次包含文本嵌入,一次不包含,然后计算两次预测噪声的差异,并放大这个信号,以此来引导生成过程更贴近文本描述。
扩散模型的训练成本如何?
-扩散模型的训练成本相对较高,因为它需要大量的图像数据和计算资源。不过,有一些免费的平台,如Google Colab,提供了使用这些模型的机会。
扩散模型的代码实现复杂吗?
-扩散模型的代码实现可以相对简单,有的版本只需要调用一个Python函数就可以生成图像。但为了深入理解和定制,可能需要更详细的代码,包含完整的迭代过程和条件输入的注入。
扩散模型中的权重是如何共享的?
-在扩散模型中,为了提高效率,生成过程中使用的网络权重与预测噪声时使用的权重是相同的。这样可以避免重复计算,加快生成速度。
扩散模型在实际应用中有哪些可能的用途?
-扩散模型可以用于生成高质量的图像,例如艺术作品、合成照片等。此外,它还可以作为图像处理工具,例如用于噪声去除或图像增强。
Outlines
🎨 探索扩散模型生成图像
本段落介绍了使用扩散模型生成图像的概念,与传统的生成对抗网络(GAN)相比,扩散模型通过逐步添加噪声并训练网络逆向去除噪声来生成图像。这种方法避免了GAN中的模式崩溃问题,并尝试通过迭代小步骤简化图像生成过程。
🔍 噪声添加策略与训练过程
这一段讨论了在扩散模型中添加噪声的不同策略,包括线性添加和递增添加。同时,解释了如何训练网络来预测并去除噪声,以及如何通过时间步骤和噪声量来指导网络的训练。此外,还提到了如何通过预测噪声而不是直接生成图像来简化训练过程。
📝 基于条件的图像生成与文本嵌入
这一部分详细描述了如何通过基于条件的方式引导扩散模型生成特定内容的图像,例如通过添加文本嵌入来指导网络生成特定主题的图像。此外,还介绍了如何使用分类器自由引导技术来增强网络对文本的响应,从而生成与文本描述更匹配的图像。
💻 扩散模型的可访问性和成本
最后一段讨论了扩散模型的可访问性,指出虽然这些模型的训练成本高昂,但存在如稳定扩散等免费资源,可以通过谷歌Colab等平台进行实验。同时,提到了作者通过谷歌Colab进行实践的经历,以及如何通过共享权重来优化计算资源的使用。
Mindmap
Keywords
💡生成对抗网络(GAN)
💡扩散模型(Diffusion Models)
💡噪声
💡迭代
💡条件生成
💡文本嵌入
💡分类器自由引导(Classifier-Free Guidance)
💡Google Colab
💡模式崩溃(Mode Collapse)
💡时间步(Time Step)
💡嵌入(Embedding)
Highlights
介绍使用扩散模型生成图像的技术。
讨论生成对抗网络(GANs)在图像生成中的标准方法。
解释GANs中的生成器网络如何产生图像。
描述GANs中判别器网络的作用。
讨论GANs训练中的模式崩溃问题。
介绍扩散模型如何简化图像生成过程。
解释扩散模型中的噪声添加策略。
讨论如何训练网络来逆转噪声添加过程。
描述扩散模型中的迭代过程,逐步减少噪声。
解释如何使用条件化网络生成特定内容的图像。
介绍文本嵌入如何影响生成的图像。
讨论分类器自由引导(classifier-free guidance)如何改善图像输出。
提及使用Google Colab免费访问和使用扩散模型。
强调扩散模型的权重共享机制。
讨论扩散模型的易用性,只需调用一个Python函数即可生成图像。
Transcripts
generating images using diffusion what
is that right so I should probably find
out it's just things like Dolly and
Dolly too yeah Imogen from Google stable
diffusion now as well I've spent quite a
long time messing about a stable
diffusion I'm having quite a lot of fun
with that so what I thought I'd do is I
download the code I'd you know read the
paper with work out what's going on and
then we can talk about it
I delved into this code and realized
it's actually quite a lot to these these
things right it's not so much that
they're complicated it's just there's a
lot of a lot of moving Parts
um so let's just have a quick reminder
of generative adversarial networks which
are I suppose before now the the
standard way for generating images and
then we can talk about how it's
different and why we're doing it using
diffusion having a network or some you
know deep Network train to Just Produce
the same image over and over again not
very interesting so we have some kind of
random noise that we're using to make it
different each time we have some kind of
very large generator Network which is
this is just I'm gonna give this black
box big neural network right that turn
that turns out an image that hopefully
looks nice like at the like the thing
we're trying to produce faces Landscapes
people you know is this that how those
Anonymous people on this person does not
exist is this one yeah that's exactly
how they work yeah if that's using I
think style Gan right and it's that
exact idea and that's trained on a large
Corpus of faces and it just generates
faces right at random right or at least
mostly at random the way we train this
is we have you know millions and
millions of pictures of something that
we're trying to produce so we produce we
give it noise
we produce an image and we have to tell
is that good or is that bad right we
need to give this network some
instruction on if this image is actually
looking like a face right otherwise it's
not going to train it so what we do is
we have another Network here which is
sort of like the opposite and this says
is it a real or is it a fake image and
so we're giving this half a time we're
giving it fake images and half a time
we're giving it real faces so this
trains and gets better at discriminating
between the fake images produced here
and the real images produced from the
training set and in doing so this has to
get better at faking them and so on and
so forth and the hope is that they just
get better and better and better all
right now that that kind of works the
the problem is that
um they're very hard to train right you
have a lot of problems with things like
mode collapse where it just produces the
same face if it produces a face that
fools this every time there's not a lot
of incentive for this network to do
anything interesting right because it
does solve the problem right it's beaten
this let's move on right and so if
you're not careful with your training
process it's these kind of things can
happen and I suppose intuitively it's
quite difficult to go from this bit of
noise to a really beautiful looking
image in high resolution
without there being some Oddities right
and some things that go a bit more so
what we're going to do is in diffusion
models is try and simplify this process
into a kind of iterative small step
situation where the work that this
network has to do is slightly smaller
and you just run it a few times to try
and make the process better right we'll
start again on the paper so we can clean
things up a bit so we've got an image
right let's say it's an image of a
rabbit right we add some noise so we've
got a rabbit
which is the same right and you add some
noise to it now it's not speckly noise
but I can't draw gaussian noise right
and then we add another bit of noise
right and the rabbit it's the same shape
rabbit there's a bit more noise right
and then we come over here and we come
over here
and we end up with just noise looks like
nonsense and so the question is like how
do we craft some kind of training
algorithm some kind of what we call
inference you know how do we actually
deploy a network that can undo this
process the first question is how much
noise do we add why don't we just add
loads of noise
right so just delete all these images
and doesn't really don't need to worry
about that add loads of noise and then
say like give me that and then you've
got a pair of training examples you
could use and the answer is it'll kind
of work but that's about a very
difficult job and you've sort of in the
same problem with the Gant you're trying
to do everything in one go right the
intuition perhaps is that it's maybe
slightly easier to go from this one to
this one just remove a little bit of
noise and then from this one to this one
a little bit more noise well in
traditional like image processing you do
there are noise removal techniques rise
yeah it's not difficult to do that is it
no I mean it's it's difficult in a sense
that you don't know what the original
image was so
what we're trying to do is train a
network to undo this process
that's the idea and if we can do that
then we can start with random noise a
bit like I can and we can just iterate
this process and produce an image right
now there's a lot of missing parts here
right so we'll start building up the
complexity a little bit
okay so the first thing is is let's go
back to our question of how much noise
do we add right so we could add a small
amount of noise
and then the same amount again I've been
the same amount again and we could keep
adding it until we have essentially what
looks like random noise over here right
and that will be what we would call a
linear
schedule right for that is the same same
amount of noise each time basically
right and it's not interesting but it
works the other thing you could do is
you could add very little noise at the
beginning and then ramp up the amount of
noise you add later right and so there
are different strategies depending on
what paper you read about the best
approach for adding noise
but it's called the schedule right so
the idea is you have a schedule that
says right given this image so this is
an image at uh at time T equals naught
this is T equals one blah blah blah T
equals some capital T which is like the
final number of steps you've got right
and this represents essentially all the
noise and this represents some amount of
noise and you can change how much each
step has right and then the nice thing
is you can then very easily produce
because gaussians add together very
nicely so you can say well I want T
equals seven and you don't have to
produce all the images you can just jump
straight to t7 add the exact right
amount of noise and then hand that back
to the network so when you train this
you can give it image random images from
your training set with random amounts of
noise added based on this schedule right
varying randomly between 1 and T right
and you can say okay here's a really
noisy image Undo It here's a little less
noisy image undo it right so what you do
is you take your noise image image
right I'm going to keep going with this
rabbit it's taller than it was before
right you take your noisy image at some
time let's say t equals five right you
have a giant unit shaped Network we've
talked about encoder decoder networks
before there's nothing particularly
surprising about this one
and then you also put in the time right
because if we're running a funny
schedule where your at different times
have different amounts of noise you need
to tell the network where it is so that
it knows okay I'm gonna have to remove a
lot of noise this time or just a little
bit of noise what do we produce here
so we could go for the whole hog and we
just say we'll just produce the original
rabbit image but then you've got a
situation where you have to go from here
all the way back to the rabbit that's a
little bit difficult right
mathematically it works out a little bit
easier if we just try and predict the
noise we want to know what is the noise
that was added to this image
that you could use to get back to the
original image so this is all the noise
from
t1234 and five so you just get noise
basically out here like this right with
no rabbit that's the hope
and then theoretically you could take
that away from this and you get the
rabbit back right now if you did that
from here you would find that it's a
little bit iffy right because you know
you're predicting the noise all the way
back to this rabbit is maybe quite
difficult but if you did it from here it
may be not quite so difficult we want to
predict the noise so what we could do is
predict the noise at let's say time T
equals five and to say give me the noise
it takes us back to T equals four right
and then T equals three and T equals two
the problem if you do that is that
you're very
stuck doing the exact time steps of the
schedule used right if you used a
thousand time steps for training now
you've got to use a thousand time steps
of inference right you can't speed it up
so what we might try and do instead is
say well okay whatever time step you're
at you've got some amount of noise
remove it all predict me all the noise
in the image and just give me back that
noise that I can take away and get back
to the original image and so that's what
we do so during training we pick a
random Source image we pick a random
time step and we add based on our
schedule that amount of noise right so
we have
a noisy image
a Time step T we put that into the
network
and we say what was the noise
that
we've just added to that image right now
we haven't given it the original image
right so that's what's Difficult about
this we we have the original image
without any noise on it that we're not
showing it and we added some noise and
we want that noise back right so we can
do that very easily we've got millions
of images in our or billions of images
in our data set right we can add random
bits of noise and we can say what was
that noise right and over time it starts
to build up a picture of what that noise
is so it sounds like a really good kind
of plug-in for Photoshop or something
right it's going to be noise removal
plug-in how does that turn into creating
new images yeah so actually in some
sense that's the clever bit right is how
we use this network that produces noise
to undo the noise right
we've got a network which given an image
with some noise added to it and a Time
step that represents how much noise that
is roughly or where we are in the
noising process we have a network which
produces an estimate for what that noise
is in total and theoretically if we take
that noise away from this we get back to
the original image now that is not a
perfect process right this network is
not going to be perfect and so if you
give it an incredibly noisy image
and you take away what it predicts
you'll get like a sort of
maybe like a vague shape and so what we
want to do is take it a little bit more
slowly okay so we take this noise and we
subtract it from our image
right to get an estimate of what the
original image was right T naught okay
so we take this
and we take this and we do subtraction
and we get another image which is our
estimate for T equals naught
right and it's not going to look very
good the first time
but then we add a bunch of this noise
back again and we get to a t that's
slightly less than this one so maybe
this was like T10 T equals 10. maybe we
add like nine tenths of a noise back and
we get to what we roughly T equals nine
right so now we have a slightly less
noisy image and we can repeat this
process so we put the slightly less
noisy image in we predict how to get
back to T naught and we add back most
but not all of the noise
and then we repeat the process right and
so each time we Loop this
we get a little bit closer to the
original image it was very difficult to
predict the noise at T equals 10. it's
slightly easier to predict the noise at
T equals nine and very easy at T equals
one because it's both mostly the image
with a little bit of noise on it and so
if we just sort of feel our way towards
it by taking off little bits of noise at
a time we can actually produce an image
right so you start off with a noisy
image
you predict all the noise and remove it
and then add back most of it right and
so then you get and so at each step you
have an estimate for what the original
image was and you have a next image
which is just a little bit less noisy
than the one before and you Loop this a
number of times right and that's
basically how the image generation
process works so you take your noisy
image you Loop it and you gradually
remove noise until you end up back at
what the network thinks was the original
image and you're doing this by
predicting the noise and taking it away
rather than spitting out an image with
less noise right and that mathematically
works out a lot easier to train and it's
a lot more stable than again there's an
elephant in the room here there is
you're kind of talking about how to make
random images effectively how do we
direct this so that's where the
complexity starts ramping up right we've
got a structure where we can train a
network to produce random images but
it's not guided there's no way of saying
I want a frog rabbit hybrid right which
I've done and it's very weird so how do
we do that the answer is we base
condition this network that's the word
we would use we'll basically give access
to the text as well all right so let's
actually infer on an image on my piece
of paper right I bear in mind the output
is going to be hand drawn by me so it's
going to be terrible you start off with
a random noise image right so this is
just an image that you've generated by
taking random gaussian noise
mathematically this is centered around
zero so you have negative and positive
numbers you don't go from zero to two
five five because it's just easier for
the network to train
you put in your time step so you
generate a you put in a times that let's
say you're going to do 50 iterations
right so we put in a Time step that's
maybe right at the end of our schedule
but it says like time step equals you
know 50 which is our most noised image
right and then you pass it through the
network and say estimate me the noise
and we also take our string which is
frogs
frogs on stilts
I'll have to have to try that later oh
look right what's this one anyway we
could spend let's say another 20 30
minutes producing fogs on stills
we embed this
right by using our GPT style Transformer
embedding and we'd stick that in as well
and then it produces an estimate of how
much noise it thinks is in that image
so that estimate on T equals 50 is going
to be a bit average right it's not going
to produce you a frog on a stilt picture
it's going to produce you like a gray
image or a brown image or something like
that because that is a very very
difficult problem to solve however if
you subtract
this noise from this image you get your
first estimate for what your first image
is right and when you add back a bunch
of noise and you get to T equals 49
right so now we've got slightly less
noise and maybe they're like the biggest
outline of a frog on a stilt right and
this is T equals 49 you take your
embedding and you put this in as well
right and you get another maybe slightly
better estimate of the noise in the
image and then we Loop right it's a for
Loop right we've done those before you
take this output you subtract it you add
noise back and you repeat this process
and you keep adding this text embedding
now there's one final trick that they
use to make things a little bit better
if you do this you will get a picture
that maybe looks slightly frog-like
maybe there's a stilt in it right but it
won't look anything like the images you
see on the internet that have been
produced by these tools because they do
another trick to make the output even
more tied to the text what you do is
something called classifier free
guidance so you actually put this image
in twice once you include the embeddings
of the text and once you don't right so
this method this network is maybe
slightly better when it has a text
estimating the noise so you actually put
in
two images
right this one's with the embedding
and this one's no embedding right and
this one is maybe slightly more random
noise and this one's slightly more
frog-like right or it's better better
it's slightly moving towards the right
thing
and we can calculate the difference
between these two noises and amplify
that signal right and then feed that
back so what we essentially do is we say
okay if this network wasn't given any
information on what was in the image and
then this version of a network was
what's the difference between those two
predictions and can we amplify that when
we loot this to really Target this kind
of output right and the idea is
basically you're really forcing this
network or this this Loop to really
point in direction of the of the scene
we want right
um and that's called classify free
guidance and it is somewhat of a hack at
the end of the network but it does work
right if you turn it off which I've done
it doesn't it produces you vague sort of
structures that kind of look right it's
not it's not terrible I mean I think I
did like a muppet cooking in the kitchen
and it just produced me a picture of a
generic kitchen with no Muppet in it
right but if you do this then you
suddenly are really targeting what you
want standard question got to ask it is
this something people can play with
without just going to one of these
websites and typing some words well yeah
I mean that's the thing is is that
um is that it costs hundreds of
thousands of dollars to try one of these
networks because of how many images they
use and how much processing power they
use um the good news is that there are
ones like stable diffusion that are um
that are available to use for free right
and you can use them through things like
Google colab Now I I did this through
Google collab
um and it works really really well
um and maybe we'll talk about that in
another video where we delve into the
code and see all of these bits happening
within the code right I blew through my
uh free Google allowance very very
quickly I had to pay my eight pounds for
uh for premium Google access so um you
know eight pounds eight pounds
thank you yeah so you know never let it
be said I don't spare expense I I know I
spare no expense on um on on computer
file uh getting access to proper compute
Hardware but um
could beasts do something like that it
could yeah almost of our servers could
I'm just a bit lazy and haven't set them
up to do so um but actually the code is
quite easy to run that the the sort of
the entry-level version of a code you
literally can just like basically call
one python function and it will produce
you an image I'm using a code which is
perhaps a little bit more detailed it's
got the full loop in it and I can go in
and inject things and change things so I
can understand it better and we'll talk
through that next you know perhaps next
time
the only other interesting thing about
the current neural networks is that the
weights here and here and here are
shared so they are the same because
otherwise this one here would always be
the time to make one sandwich but you've
got two people doing it so they make
twice as many sandwiches each time they
make a sandwich same with the computer
we could either make the computer
processor faster or
Browse More Related Video
What are Diffusion Models?
Text-to-GRAPH w/ LGGM: Generative Graph Models
【生成式AI導論 2024】第18講:有關影像的生成式AI (下) — 快速導讀經典影像生成方法 (VAE, Flow, Diffusion, GAN) 以及與生成的影片互動
【生成式AI導論 2024】第17講:有關影像的生成式AI (上) — AI 如何產生圖片和影片 (Sora 背後可能用的原理)
Stable Cascade released Within 24 Hours! A New Better And Faster Diffusion Model!
Let's build GPT: from scratch, in code, spelled out.
5.0 / 5 (0 votes)