# How AI Image Generators Work (Stable Diffusion / Dall-E) - Computerphile

### Summary

TLDR视频脚本详细介绍了使用扩散模型生成图像的过程，与传统的生成对抗网络（GAN）相比，扩散模型通过迭代地添加和移除噪声来生成图像，使得训练过程更加稳定且易于控制。通过结合文本嵌入和分类器自由引导技术，扩散模型能够根据文本提示生成特定内容的图像，尽管生成的图像可能需要进一步优化。此外，虽然训练这类模型成本高昂，但已有一些免费资源如稳定扩散模型可供使用，用户可以通过谷歌Colab等平台进行体验。

### Takeaways

- 🖼️ 扩散模型（Diffusion Models）是一种生成图像的新方法，与生成对抗网络（GANs）不同，它通过逐步减少噪声来生成图像。
- 🔍 扩散模型的核心思想是将图像从清晰逐渐转变为噪声，然后通过训练神经网络来逆转这个过程，从而生成新的图像。
- 🎨 扩散模型训练过程中，会使用不同量的噪声对图像进行扰动，这种策略被称为噪声调度（noise schedule）。
- 🤖 在扩散模型中，神经网络需要学习如何预测并移除加在图像上的噪声，这个过程称为推断（inference）。
- 📈 扩散模型的训练涉及到大量的图像数据和计算资源，但现在已经有一些免费的工具和平台，如Google Colab，使得个人用户也能尝试使用。
- 📚 通过在训练过程中加入文本嵌入，扩散模型可以生成与特定文本描述相匹配的图像，这种技术称为基于条件的生成。
- 🌐 扩散模型可以用于创建高质量的图像，但生成过程可能需要多次迭代，逐渐接近原始图像。
- 🔧 扩散模型中的权重是共享的，这意味着在生成过程中使用的神经网络结构是一致的，提高了效率。
- 💡 扩散模型的一个关键技术是无分类器引导（classifier-free guidance），它通过比较带文本嵌入和不带文本嵌入的图像生成结果来增强与文本描述的匹配度。
- 🚀 尽管扩散模型在技术上具有挑战性，但其代码实现相对简单，通过调用单个Python函数即可生成图像。
- 💸 使用扩散模型可能涉及较高的成本，但由于有免费资源可用，个人用户也可以进行实验和学习。

### Q & A

### 什么是扩散模型（diffusion model）？

-扩散模型是一种生成模型，它通过逐步添加噪声将数据点（如图像）转化为随机噪声，然后再通过一个学习过程逆转这一过程，从随机噪声生成清晰的数据点。这种方法在生成高质量图像方面显示出了优势。

### 生成对抗网络（GAN）是如何工作的？

-生成对抗网络（GAN）由两部分组成：生成器（Generator）和判别器（Discriminator）。生成器的任务是生成看起来像真实数据的图像，而判别器的任务是判断一个图像是真实的还是由生成器生成的。这两个网络在训练过程中相互竞争，生成器试图生成越来越逼真的图像，而判别器则努力提高区分真伪图像的能力。

### 扩散模型与GAN相比有什么优势？

-扩散模型相比GAN更容易训练，更稳定，并且能够生成高分辨率的图像。GAN在训练过程中可能会出现模式崩溃（mode collapse）等问题，而扩散模型通过逐步减少噪声的方式生成图像，避免了这些问题。

### 在扩散模型中，如何决定添加多少噪声？

-在扩散模型中，添加噪声的量通常遵循一个预定的噪声计划（schedule）。这个计划定义了每一步添加的噪声量，可以是线性的，也可以是开始时添加少量噪声然后逐渐增加的非线性方式。

### 扩散模型是如何从噪声图像恢复出原始图像的？

-扩散模型通过训练一个网络来预测并去除噪声。在训练过程中，模型学习如何从带有不同程度噪声的图像中预测原始图像。在生成图像时，模型会逐步预测并去除噪声，每次迭代都会生成一个噪声更少的图像，最终恢复出清晰的原始图像。

### 如何使用扩散模型生成特定内容的图像？

-为了生成特定内容的图像，可以通过条件化（conditioning）扩散模型来实现。例如，可以将文本描述作为条件输入，模型会根据这些条件生成与描述相匹配的图像。

### 什么是分类器自由引导（classifier-free guidance）？

-分类器自由引导是一种提高生成图像与文本描述相关性的技术。它通过将图像两次输入模型，一次包含文本嵌入，一次不包含，然后计算两次预测噪声的差异，并放大这个信号，以此来引导生成过程更贴近文本描述。

### 扩散模型的训练成本如何？

-扩散模型的训练成本相对较高，因为它需要大量的图像数据和计算资源。不过，有一些免费的平台，如Google Colab，提供了使用这些模型的机会。

### 扩散模型的代码实现复杂吗？

-扩散模型的代码实现可以相对简单，有的版本只需要调用一个Python函数就可以生成图像。但为了深入理解和定制，可能需要更详细的代码，包含完整的迭代过程和条件输入的注入。

### 扩散模型中的权重是如何共享的？

-在扩散模型中，为了提高效率，生成过程中使用的网络权重与预测噪声时使用的权重是相同的。这样可以避免重复计算，加快生成速度。

### 扩散模型在实际应用中有哪些可能的用途？

-扩散模型可以用于生成高质量的图像，例如艺术作品、合成照片等。此外，它还可以作为图像处理工具，例如用于噪声去除或图像增强。

### Outlines

### 🎨 探索扩散模型生成图像

本段落介绍了使用扩散模型生成图像的概念，与传统的生成对抗网络（GAN）相比，扩散模型通过逐步添加噪声并训练网络逆向去除噪声来生成图像。这种方法避免了GAN中的模式崩溃问题，并尝试通过迭代小步骤简化图像生成过程。

### 🔍 噪声添加策略与训练过程

这一段讨论了在扩散模型中添加噪声的不同策略，包括线性添加和递增添加。同时，解释了如何训练网络来预测并去除噪声，以及如何通过时间步骤和噪声量来指导网络的训练。此外，还提到了如何通过预测噪声而不是直接生成图像来简化训练过程。

### 📝 基于条件的图像生成与文本嵌入

这一部分详细描述了如何通过基于条件的方式引导扩散模型生成特定内容的图像，例如通过添加文本嵌入来指导网络生成特定主题的图像。此外，还介绍了如何使用分类器自由引导技术来增强网络对文本的响应，从而生成与文本描述更匹配的图像。

### 💻 扩散模型的可访问性和成本

最后一段讨论了扩散模型的可访问性，指出虽然这些模型的训练成本高昂，但存在如稳定扩散等免费资源，可以通过谷歌Colab等平台进行实验。同时，提到了作者通过谷歌Colab进行实践的经历，以及如何通过共享权重来优化计算资源的使用。

### Mindmap

### Keywords

### 💡生成对抗网络（GAN）

### 💡扩散模型（Diffusion Models）

### 💡噪声

### 💡迭代

### 💡条件生成

### 💡文本嵌入

### 💡分类器自由引导（Classifier-Free Guidance）

### 💡Google Colab

### 💡模式崩溃（Mode Collapse）

### 💡时间步（Time Step）

### 💡嵌入（Embedding）

### Highlights

介绍使用扩散模型生成图像的技术。

讨论生成对抗网络（GANs）在图像生成中的标准方法。

解释GANs中的生成器网络如何产生图像。

描述GANs中判别器网络的作用。

讨论GANs训练中的模式崩溃问题。

介绍扩散模型如何简化图像生成过程。

解释扩散模型中的噪声添加策略。

讨论如何训练网络来逆转噪声添加过程。

描述扩散模型中的迭代过程，逐步减少噪声。

解释如何使用条件化网络生成特定内容的图像。

介绍文本嵌入如何影响生成的图像。

讨论分类器自由引导（classifier-free guidance）如何改善图像输出。

提及使用Google Colab免费访问和使用扩散模型。

强调扩散模型的权重共享机制。

讨论扩散模型的易用性，只需调用一个Python函数即可生成图像。

### Transcripts

generating images using diffusion what

is that right so I should probably find

out it's just things like Dolly and

Dolly too yeah Imogen from Google stable

diffusion now as well I've spent quite a

long time messing about a stable

diffusion I'm having quite a lot of fun

with that so what I thought I'd do is I

download the code I'd you know read the

paper with work out what's going on and

then we can talk about it

I delved into this code and realized

it's actually quite a lot to these these

things right it's not so much that

they're complicated it's just there's a

lot of a lot of moving Parts

um so let's just have a quick reminder

of generative adversarial networks which

are I suppose before now the the

standard way for generating images and

then we can talk about how it's

different and why we're doing it using

diffusion having a network or some you

know deep Network train to Just Produce

the same image over and over again not

very interesting so we have some kind of

random noise that we're using to make it

different each time we have some kind of

very large generator Network which is

this is just I'm gonna give this black

box big neural network right that turn

that turns out an image that hopefully

looks nice like at the like the thing

we're trying to produce faces Landscapes

people you know is this that how those

Anonymous people on this person does not

exist is this one yeah that's exactly

how they work yeah if that's using I

think style Gan right and it's that

exact idea and that's trained on a large

Corpus of faces and it just generates

faces right at random right or at least

mostly at random the way we train this

is we have you know millions and

millions of pictures of something that

we're trying to produce so we produce we

give it noise

we produce an image and we have to tell

is that good or is that bad right we

need to give this network some

instruction on if this image is actually

looking like a face right otherwise it's

not going to train it so what we do is

we have another Network here which is

sort of like the opposite and this says

is it a real or is it a fake image and

so we're giving this half a time we're

giving it fake images and half a time

we're giving it real faces so this

trains and gets better at discriminating

between the fake images produced here

and the real images produced from the

training set and in doing so this has to

get better at faking them and so on and

so forth and the hope is that they just

get better and better and better all

right now that that kind of works the

the problem is that

um they're very hard to train right you

have a lot of problems with things like

mode collapse where it just produces the

same face if it produces a face that

fools this every time there's not a lot

of incentive for this network to do

anything interesting right because it

does solve the problem right it's beaten

this let's move on right and so if

you're not careful with your training

process it's these kind of things can

happen and I suppose intuitively it's

quite difficult to go from this bit of

noise to a really beautiful looking

image in high resolution

without there being some Oddities right

and some things that go a bit more so

what we're going to do is in diffusion

models is try and simplify this process

into a kind of iterative small step

situation where the work that this

network has to do is slightly smaller

and you just run it a few times to try

and make the process better right we'll

start again on the paper so we can clean

things up a bit so we've got an image

right let's say it's an image of a

rabbit right we add some noise so we've

got a rabbit

which is the same right and you add some

noise to it now it's not speckly noise

but I can't draw gaussian noise right

and then we add another bit of noise

right and the rabbit it's the same shape

rabbit there's a bit more noise right

and then we come over here and we come

over here

and we end up with just noise looks like

nonsense and so the question is like how

do we craft some kind of training

algorithm some kind of what we call

inference you know how do we actually

deploy a network that can undo this

process the first question is how much

noise do we add why don't we just add

loads of noise

right so just delete all these images

and doesn't really don't need to worry

about that add loads of noise and then

say like give me that and then you've

got a pair of training examples you

could use and the answer is it'll kind

of work but that's about a very

difficult job and you've sort of in the

same problem with the Gant you're trying

to do everything in one go right the

intuition perhaps is that it's maybe

slightly easier to go from this one to

this one just remove a little bit of

noise and then from this one to this one

a little bit more noise well in

traditional like image processing you do

there are noise removal techniques rise

yeah it's not difficult to do that is it

no I mean it's it's difficult in a sense

that you don't know what the original

image was so

what we're trying to do is train a

network to undo this process

that's the idea and if we can do that

then we can start with random noise a

bit like I can and we can just iterate

this process and produce an image right

now there's a lot of missing parts here

right so we'll start building up the

complexity a little bit

okay so the first thing is is let's go

back to our question of how much noise

do we add right so we could add a small

amount of noise

and then the same amount again I've been

the same amount again and we could keep

adding it until we have essentially what

looks like random noise over here right

and that will be what we would call a

linear

schedule right for that is the same same

amount of noise each time basically

right and it's not interesting but it

works the other thing you could do is

you could add very little noise at the

beginning and then ramp up the amount of

noise you add later right and so there

are different strategies depending on

what paper you read about the best

approach for adding noise

but it's called the schedule right so

the idea is you have a schedule that

says right given this image so this is

an image at uh at time T equals naught

this is T equals one blah blah blah T

equals some capital T which is like the

final number of steps you've got right

and this represents essentially all the

noise and this represents some amount of

noise and you can change how much each

step has right and then the nice thing

is you can then very easily produce

because gaussians add together very

nicely so you can say well I want T

equals seven and you don't have to

produce all the images you can just jump

straight to t7 add the exact right

amount of noise and then hand that back

to the network so when you train this

you can give it image random images from

your training set with random amounts of

noise added based on this schedule right

varying randomly between 1 and T right

and you can say okay here's a really

noisy image Undo It here's a little less

noisy image undo it right so what you do

is you take your noise image image

right I'm going to keep going with this

rabbit it's taller than it was before

right you take your noisy image at some

time let's say t equals five right you

have a giant unit shaped Network we've

talked about encoder decoder networks

before there's nothing particularly

surprising about this one

and then you also put in the time right

because if we're running a funny

schedule where your at different times

have different amounts of noise you need

to tell the network where it is so that

it knows okay I'm gonna have to remove a

lot of noise this time or just a little

bit of noise what do we produce here

so we could go for the whole hog and we

just say we'll just produce the original

rabbit image but then you've got a

situation where you have to go from here

all the way back to the rabbit that's a

little bit difficult right

mathematically it works out a little bit

easier if we just try and predict the

noise we want to know what is the noise

that was added to this image

that you could use to get back to the

original image so this is all the noise

from

t1234 and five so you just get noise

basically out here like this right with

no rabbit that's the hope

and then theoretically you could take

that away from this and you get the

rabbit back right now if you did that

from here you would find that it's a

little bit iffy right because you know

you're predicting the noise all the way

back to this rabbit is maybe quite

difficult but if you did it from here it

may be not quite so difficult we want to

predict the noise so what we could do is

predict the noise at let's say time T

equals five and to say give me the noise

it takes us back to T equals four right

and then T equals three and T equals two

the problem if you do that is that

you're very

stuck doing the exact time steps of the

schedule used right if you used a

thousand time steps for training now

you've got to use a thousand time steps

of inference right you can't speed it up

so what we might try and do instead is

say well okay whatever time step you're

at you've got some amount of noise

remove it all predict me all the noise

in the image and just give me back that

noise that I can take away and get back

to the original image and so that's what

we do so during training we pick a

random Source image we pick a random

time step and we add based on our

schedule that amount of noise right so

we have

a noisy image

a Time step T we put that into the

network

and we say what was the noise

that

we've just added to that image right now

we haven't given it the original image

right so that's what's Difficult about

this we we have the original image

without any noise on it that we're not

showing it and we added some noise and

we want that noise back right so we can

do that very easily we've got millions

of images in our or billions of images

in our data set right we can add random

bits of noise and we can say what was

that noise right and over time it starts

to build up a picture of what that noise

is so it sounds like a really good kind

of plug-in for Photoshop or something

right it's going to be noise removal

plug-in how does that turn into creating

new images yeah so actually in some

sense that's the clever bit right is how

we use this network that produces noise

to undo the noise right

we've got a network which given an image

with some noise added to it and a Time

step that represents how much noise that

is roughly or where we are in the

noising process we have a network which

produces an estimate for what that noise

is in total and theoretically if we take

that noise away from this we get back to

the original image now that is not a

perfect process right this network is

not going to be perfect and so if you

give it an incredibly noisy image

and you take away what it predicts

you'll get like a sort of

maybe like a vague shape and so what we

want to do is take it a little bit more

slowly okay so we take this noise and we

subtract it from our image

right to get an estimate of what the

original image was right T naught okay

so we take this

and we take this and we do subtraction

and we get another image which is our

estimate for T equals naught

right and it's not going to look very

good the first time

but then we add a bunch of this noise

back again and we get to a t that's

slightly less than this one so maybe

this was like T10 T equals 10. maybe we

add like nine tenths of a noise back and

we get to what we roughly T equals nine

right so now we have a slightly less

noisy image and we can repeat this

process so we put the slightly less

noisy image in we predict how to get

back to T naught and we add back most

but not all of the noise

and then we repeat the process right and

so each time we Loop this

we get a little bit closer to the

original image it was very difficult to

predict the noise at T equals 10. it's

slightly easier to predict the noise at

T equals nine and very easy at T equals

one because it's both mostly the image

with a little bit of noise on it and so

if we just sort of feel our way towards

it by taking off little bits of noise at

a time we can actually produce an image

right so you start off with a noisy

image

you predict all the noise and remove it

and then add back most of it right and

so then you get and so at each step you

have an estimate for what the original

image was and you have a next image

which is just a little bit less noisy

than the one before and you Loop this a

number of times right and that's

basically how the image generation

process works so you take your noisy

image you Loop it and you gradually

remove noise until you end up back at

what the network thinks was the original

image and you're doing this by

predicting the noise and taking it away

rather than spitting out an image with

less noise right and that mathematically

works out a lot easier to train and it's

a lot more stable than again there's an

elephant in the room here there is

you're kind of talking about how to make

random images effectively how do we

direct this so that's where the

complexity starts ramping up right we've

got a structure where we can train a

network to produce random images but

it's not guided there's no way of saying

I want a frog rabbit hybrid right which

I've done and it's very weird so how do

we do that the answer is we base

condition this network that's the word

we would use we'll basically give access

to the text as well all right so let's

actually infer on an image on my piece

of paper right I bear in mind the output

is going to be hand drawn by me so it's

going to be terrible you start off with

a random noise image right so this is

just an image that you've generated by

taking random gaussian noise

mathematically this is centered around

zero so you have negative and positive

numbers you don't go from zero to two

five five because it's just easier for

the network to train

you put in your time step so you

generate a you put in a times that let's

say you're going to do 50 iterations

right so we put in a Time step that's

maybe right at the end of our schedule

but it says like time step equals you

know 50 which is our most noised image

right and then you pass it through the

network and say estimate me the noise

and we also take our string which is

frogs

frogs on stilts

I'll have to have to try that later oh

look right what's this one anyway we

could spend let's say another 20 30

minutes producing fogs on stills

we embed this

right by using our GPT style Transformer

embedding and we'd stick that in as well

and then it produces an estimate of how

much noise it thinks is in that image

so that estimate on T equals 50 is going

to be a bit average right it's not going

to produce you a frog on a stilt picture

it's going to produce you like a gray

image or a brown image or something like

that because that is a very very

difficult problem to solve however if

you subtract

this noise from this image you get your

first estimate for what your first image

is right and when you add back a bunch

of noise and you get to T equals 49

right so now we've got slightly less

noise and maybe they're like the biggest

outline of a frog on a stilt right and

this is T equals 49 you take your

embedding and you put this in as well

right and you get another maybe slightly

better estimate of the noise in the

image and then we Loop right it's a for

Loop right we've done those before you

take this output you subtract it you add

noise back and you repeat this process

and you keep adding this text embedding

now there's one final trick that they

use to make things a little bit better

if you do this you will get a picture

that maybe looks slightly frog-like

maybe there's a stilt in it right but it

won't look anything like the images you

see on the internet that have been

produced by these tools because they do

another trick to make the output even

more tied to the text what you do is

something called classifier free

guidance so you actually put this image

in twice once you include the embeddings

of the text and once you don't right so

this method this network is maybe

slightly better when it has a text

estimating the noise so you actually put

in

two images

right this one's with the embedding

and this one's no embedding right and

this one is maybe slightly more random

noise and this one's slightly more

frog-like right or it's better better

it's slightly moving towards the right

thing

and we can calculate the difference

between these two noises and amplify

that signal right and then feed that

back so what we essentially do is we say

okay if this network wasn't given any

information on what was in the image and

then this version of a network was

what's the difference between those two

predictions and can we amplify that when

we loot this to really Target this kind

of output right and the idea is

basically you're really forcing this

network or this this Loop to really

point in direction of the of the scene

we want right

um and that's called classify free

guidance and it is somewhat of a hack at

the end of the network but it does work

right if you turn it off which I've done

it doesn't it produces you vague sort of

structures that kind of look right it's

not it's not terrible I mean I think I

did like a muppet cooking in the kitchen

and it just produced me a picture of a

generic kitchen with no Muppet in it

right but if you do this then you

suddenly are really targeting what you

want standard question got to ask it is

this something people can play with

without just going to one of these

websites and typing some words well yeah

I mean that's the thing is is that

um is that it costs hundreds of

thousands of dollars to try one of these

networks because of how many images they

use and how much processing power they

use um the good news is that there are

ones like stable diffusion that are um

that are available to use for free right

and you can use them through things like

Google colab Now I I did this through

Google collab

um and it works really really well

um and maybe we'll talk about that in

another video where we delve into the

code and see all of these bits happening

within the code right I blew through my

uh free Google allowance very very

quickly I had to pay my eight pounds for

uh for premium Google access so um you

know eight pounds eight pounds

thank you yeah so you know never let it

be said I don't spare expense I I know I

spare no expense on um on on computer

file uh getting access to proper compute

Hardware but um

could beasts do something like that it

could yeah almost of our servers could

I'm just a bit lazy and haven't set them

up to do so um but actually the code is

quite easy to run that the the sort of

the entry-level version of a code you

literally can just like basically call

one python function and it will produce

you an image I'm using a code which is

perhaps a little bit more detailed it's

got the full loop in it and I can go in

and inject things and change things so I

can understand it better and we'll talk

through that next you know perhaps next

time

the only other interesting thing about

the current neural networks is that the

weights here and here and here are

shared so they are the same because

otherwise this one here would always be

the time to make one sandwich but you've

got two people doing it so they make

twice as many sandwiches each time they

make a sandwich same with the computer

we could either make the computer

processor faster or

## Browse More Related Video

What are Diffusion Models?

Text-to-GRAPH w/ LGGM: Generative Graph Models

【生成式AI導論 2024】第18講：有關影像的生成式AI (下) — 快速導讀經典影像生成方法 (VAE, Flow, Diffusion, GAN) 以及與生成的影片互動

【生成式AI導論 2024】第17講：有關影像的生成式AI (上) — AI 如何產生圖片和影片 (Sora 背後可能用的原理)

Stable Cascade released Within 24 Hours! A New Better And Faster Diffusion Model!

Let's build GPT: from scratch, in code, spelled out.

5.0 / 5 (0 votes)