How AI Image Generators Work (Stable Diffusion / Dall-E) - Computerphile

Computerphile
4 Oct 202217:50

Summary

TLDR视频脚本详细介绍了使用扩散模型生成图像的过程,与传统的生成对抗网络(GAN)相比,扩散模型通过迭代地添加和移除噪声来生成图像,使得训练过程更加稳定且易于控制。通过结合文本嵌入和分类器自由引导技术,扩散模型能够根据文本提示生成特定内容的图像,尽管生成的图像可能需要进一步优化。此外,虽然训练这类模型成本高昂,但已有一些免费资源如稳定扩散模型可供使用,用户可以通过谷歌Colab等平台进行体验。

Takeaways

  • 🖼️ 扩散模型(Diffusion Models)是一种生成图像的新方法,与生成对抗网络(GANs)不同,它通过逐步减少噪声来生成图像。
  • 🔍 扩散模型的核心思想是将图像从清晰逐渐转变为噪声,然后通过训练神经网络来逆转这个过程,从而生成新的图像。
  • 🎨 扩散模型训练过程中,会使用不同量的噪声对图像进行扰动,这种策略被称为噪声调度(noise schedule)。
  • 🤖 在扩散模型中,神经网络需要学习如何预测并移除加在图像上的噪声,这个过程称为推断(inference)。
  • 📈 扩散模型的训练涉及到大量的图像数据和计算资源,但现在已经有一些免费的工具和平台,如Google Colab,使得个人用户也能尝试使用。
  • 📚 通过在训练过程中加入文本嵌入,扩散模型可以生成与特定文本描述相匹配的图像,这种技术称为基于条件的生成。
  • 🌐 扩散模型可以用于创建高质量的图像,但生成过程可能需要多次迭代,逐渐接近原始图像。
  • 🔧 扩散模型中的权重是共享的,这意味着在生成过程中使用的神经网络结构是一致的,提高了效率。
  • 💡 扩散模型的一个关键技术是无分类器引导(classifier-free guidance),它通过比较带文本嵌入和不带文本嵌入的图像生成结果来增强与文本描述的匹配度。
  • 🚀 尽管扩散模型在技术上具有挑战性,但其代码实现相对简单,通过调用单个Python函数即可生成图像。
  • 💸 使用扩散模型可能涉及较高的成本,但由于有免费资源可用,个人用户也可以进行实验和学习。

Q & A

  • 什么是扩散模型(diffusion model)?

    -扩散模型是一种生成模型,它通过逐步添加噪声将数据点(如图像)转化为随机噪声,然后再通过一个学习过程逆转这一过程,从随机噪声生成清晰的数据点。这种方法在生成高质量图像方面显示出了优势。

  • 生成对抗网络(GAN)是如何工作的?

    -生成对抗网络(GAN)由两部分组成:生成器(Generator)和判别器(Discriminator)。生成器的任务是生成看起来像真实数据的图像,而判别器的任务是判断一个图像是真实的还是由生成器生成的。这两个网络在训练过程中相互竞争,生成器试图生成越来越逼真的图像,而判别器则努力提高区分真伪图像的能力。

  • 扩散模型与GAN相比有什么优势?

    -扩散模型相比GAN更容易训练,更稳定,并且能够生成高分辨率的图像。GAN在训练过程中可能会出现模式崩溃(mode collapse)等问题,而扩散模型通过逐步减少噪声的方式生成图像,避免了这些问题。

  • 在扩散模型中,如何决定添加多少噪声?

    -在扩散模型中,添加噪声的量通常遵循一个预定的噪声计划(schedule)。这个计划定义了每一步添加的噪声量,可以是线性的,也可以是开始时添加少量噪声然后逐渐增加的非线性方式。

  • 扩散模型是如何从噪声图像恢复出原始图像的?

    -扩散模型通过训练一个网络来预测并去除噪声。在训练过程中,模型学习如何从带有不同程度噪声的图像中预测原始图像。在生成图像时,模型会逐步预测并去除噪声,每次迭代都会生成一个噪声更少的图像,最终恢复出清晰的原始图像。

  • 如何使用扩散模型生成特定内容的图像?

    -为了生成特定内容的图像,可以通过条件化(conditioning)扩散模型来实现。例如,可以将文本描述作为条件输入,模型会根据这些条件生成与描述相匹配的图像。

  • 什么是分类器自由引导(classifier-free guidance)?

    -分类器自由引导是一种提高生成图像与文本描述相关性的技术。它通过将图像两次输入模型,一次包含文本嵌入,一次不包含,然后计算两次预测噪声的差异,并放大这个信号,以此来引导生成过程更贴近文本描述。

  • 扩散模型的训练成本如何?

    -扩散模型的训练成本相对较高,因为它需要大量的图像数据和计算资源。不过,有一些免费的平台,如Google Colab,提供了使用这些模型的机会。

  • 扩散模型的代码实现复杂吗?

    -扩散模型的代码实现可以相对简单,有的版本只需要调用一个Python函数就可以生成图像。但为了深入理解和定制,可能需要更详细的代码,包含完整的迭代过程和条件输入的注入。

  • 扩散模型中的权重是如何共享的?

    -在扩散模型中,为了提高效率,生成过程中使用的网络权重与预测噪声时使用的权重是相同的。这样可以避免重复计算,加快生成速度。

  • 扩散模型在实际应用中有哪些可能的用途?

    -扩散模型可以用于生成高质量的图像,例如艺术作品、合成照片等。此外,它还可以作为图像处理工具,例如用于噪声去除或图像增强。

Outlines

00:00

🎨 探索扩散模型生成图像

本段落介绍了使用扩散模型生成图像的概念,与传统的生成对抗网络(GAN)相比,扩散模型通过逐步添加噪声并训练网络逆向去除噪声来生成图像。这种方法避免了GAN中的模式崩溃问题,并尝试通过迭代小步骤简化图像生成过程。

05:00

🔍 噪声添加策略与训练过程

这一段讨论了在扩散模型中添加噪声的不同策略,包括线性添加和递增添加。同时,解释了如何训练网络来预测并去除噪声,以及如何通过时间步骤和噪声量来指导网络的训练。此外,还提到了如何通过预测噪声而不是直接生成图像来简化训练过程。

10:01

📝 基于条件的图像生成与文本嵌入

这一部分详细描述了如何通过基于条件的方式引导扩散模型生成特定内容的图像,例如通过添加文本嵌入来指导网络生成特定主题的图像。此外,还介绍了如何使用分类器自由引导技术来增强网络对文本的响应,从而生成与文本描述更匹配的图像。

15:02

💻 扩散模型的可访问性和成本

最后一段讨论了扩散模型的可访问性,指出虽然这些模型的训练成本高昂,但存在如稳定扩散等免费资源,可以通过谷歌Colab等平台进行实验。同时,提到了作者通过谷歌Colab进行实践的经历,以及如何通过共享权重来优化计算资源的使用。

Mindmap

Keywords

💡生成对抗网络(GAN)

生成对抗网络(GAN)是一种深度学习模型,由生成器和判别器组成。生成器负责生成数据,判别器则负责判断数据是真实的还是生成器生成的。在视频中,GAN被用来生成图像,如人脸、风景等。GAN的训练过程是通过不断优化生成器和判别器来提高生成图像的质量,直到生成器能够生成足以欺骗判别器的高质量图像。

💡扩散模型(Diffusion Models)

扩散模型是一种生成模型,它通过逐步添加噪声来模拟数据的扩散过程,然后再通过一个逆过程,即逐步去除噪声,来生成新的数据。在视频中,扩散模型被用来生成图像,通过迭代地从小的噪声开始,逐步构建出清晰的图像。这种方法相比GAN来说,训练过程更稳定,且能够生成高分辨率的图像。

💡噪声

在视频内容中,噪声指的是在图像中加入的随机扰动,这是扩散模型的一个关键步骤。通过向图像中添加不同程度和类型的噪声,模型学习如何从噪声中恢复出原始图像。噪声的添加和去除过程是生成新图像的基础。

💡迭代

迭代是扩散模型中用于生成图像的一种方法,它涉及多次重复的过程。在每次迭代中,模型都会尝试从当前的噪声图像中去除一部分噪声,并预测原始图像的样子。通过多次迭代,模型逐渐接近并最终生成出清晰的图像。

💡条件生成

条件生成是指在生成模型中引入额外信息(如文本、标签等)来指导生成过程,从而生成特定内容的图像。在视频中,通过将文本嵌入到模型中,可以引导扩散模型生成与文本描述相匹配的图像,如生成“青蛙站在高跷上”的图像。

💡文本嵌入

文本嵌入是一种将文本数据转换为数值向量的技术,这些向量能够捕捉文本的语义信息。在视频中,文本嵌入被用于将描述性文本(如“青蛙站在高跷上”)转换为模型可以理解的数值形式,以便在条件生成过程中引导图像的生成。

💡分类器自由引导(Classifier-Free Guidance)

分类器自由引导是一种技术,用于在生成模型中提高生成图像与条件(如文本描述)的一致性。它通过比较包含条件信息和不包含条件信息的两次生成结果的差异,来增强与条件相关的特征。在视频中,这种方法被用来确保生成的图像更加符合文本描述的内容。

💡Google Colab

Google Colab是一个基于云的平台,允许用户在浏览器中运行Python代码,并且可以免费使用GPU资源。在视频中,Google Colab被用来运行扩散模型,使得用户无需昂贵的硬件即可体验到生成图像的过程。

💡模式崩溃(Mode Collapse)

模式崩溃是生成模型在训练过程中可能出现的问题,指的是模型过度集中在生成某一类或某几类特定的输出,而忽略了其他可能的输出。在视频中,这可能导致生成的人脸图像过于相似或重复,缺乏多样性。

💡时间步(Time Step)

时间步在扩散模型中指的是噪声添加和去除过程中的各个阶段。模型通过在不同的时间步预测和去除噪声,逐步恢复出原始图像。时间步的选择和安排对于模型的训练和生成图像的质量有重要影响。

💡嵌入(Embedding)

嵌入在视频中指的是将文本或其他类型的数据转换成可以在神经网络中处理的数值向量。这些向量捕捉了原始数据的关键特征,并作为模型的输入,以生成与这些特征相匹配的图像。

Highlights

介绍使用扩散模型生成图像的技术。

讨论生成对抗网络(GANs)在图像生成中的标准方法。

解释GANs中的生成器网络如何产生图像。

描述GANs中判别器网络的作用。

讨论GANs训练中的模式崩溃问题。

介绍扩散模型如何简化图像生成过程。

解释扩散模型中的噪声添加策略。

讨论如何训练网络来逆转噪声添加过程。

描述扩散模型中的迭代过程,逐步减少噪声。

解释如何使用条件化网络生成特定内容的图像。

介绍文本嵌入如何影响生成的图像。

讨论分类器自由引导(classifier-free guidance)如何改善图像输出。

提及使用Google Colab免费访问和使用扩散模型。

强调扩散模型的权重共享机制。

讨论扩散模型的易用性,只需调用一个Python函数即可生成图像。

Transcripts

play00:00

generating images using diffusion what

play00:02

is that right so I should probably find

play00:04

out it's just things like Dolly and

play00:06

Dolly too yeah Imogen from Google stable

play00:09

diffusion now as well I've spent quite a

play00:12

long time messing about a stable

play00:13

diffusion I'm having quite a lot of fun

play00:14

with that so what I thought I'd do is I

play00:16

download the code I'd you know read the

play00:19

paper with work out what's going on and

play00:20

then we can talk about it

play00:25

I delved into this code and realized

play00:27

it's actually quite a lot to these these

play00:29

things right it's not so much that

play00:30

they're complicated it's just there's a

play00:31

lot of a lot of moving Parts

play00:33

um so let's just have a quick reminder

play00:35

of generative adversarial networks which

play00:38

are I suppose before now the the

play00:41

standard way for generating images and

play00:42

then we can talk about how it's

play00:44

different and why we're doing it using

play00:45

diffusion having a network or some you

play00:47

know deep Network train to Just Produce

play00:49

the same image over and over again not

play00:51

very interesting so we have some kind of

play00:53

random noise that we're using to make it

play00:54

different each time we have some kind of

play00:57

very large generator Network which is

play00:59

this is just I'm gonna give this black

play01:01

box big neural network right that turn

play01:04

that turns out an image that hopefully

play01:07

looks nice like at the like the thing

play01:09

we're trying to produce faces Landscapes

play01:11

people you know is this that how those

play01:13

Anonymous people on this person does not

play01:15

exist is this one yeah that's exactly

play01:17

how they work yeah if that's using I

play01:18

think style Gan right and it's that

play01:20

exact idea and that's trained on a large

play01:22

Corpus of faces and it just generates

play01:25

faces right at random right or at least

play01:27

mostly at random the way we train this

play01:29

is we have you know millions and

play01:30

millions of pictures of something that

play01:31

we're trying to produce so we produce we

play01:34

give it noise

play01:35

we produce an image and we have to tell

play01:38

is that good or is that bad right we

play01:40

need to give this network some

play01:41

instruction on if this image is actually

play01:44

looking like a face right otherwise it's

play01:45

not going to train it so what we do is

play01:46

we have another Network here which is

play01:48

sort of like the opposite and this says

play01:50

is it a real or is it a fake image and

play01:53

so we're giving this half a time we're

play01:54

giving it fake images and half a time

play01:55

we're giving it real faces so this

play01:57

trains and gets better at discriminating

play01:59

between the fake images produced here

play02:01

and the real images produced from the

play02:03

training set and in doing so this has to

play02:06

get better at faking them and so on and

play02:08

so forth and the hope is that they just

play02:10

get better and better and better all

play02:11

right now that that kind of works the

play02:14

the problem is that

play02:16

um they're very hard to train right you

play02:18

have a lot of problems with things like

play02:20

mode collapse where it just produces the

play02:21

same face if it produces a face that

play02:23

fools this every time there's not a lot

play02:26

of incentive for this network to do

play02:27

anything interesting right because it

play02:29

does solve the problem right it's beaten

play02:31

this let's move on right and so if

play02:34

you're not careful with your training

play02:35

process it's these kind of things can

play02:36

happen and I suppose intuitively it's

play02:38

quite difficult to go from this bit of

play02:41

noise to a really beautiful looking

play02:42

image in high resolution

play02:44

without there being some Oddities right

play02:46

and some things that go a bit more so

play02:49

what we're going to do is in diffusion

play02:50

models is try and simplify this process

play02:53

into a kind of iterative small step

play02:55

situation where the work that this

play02:57

network has to do is slightly smaller

play02:58

and you just run it a few times to try

play03:00

and make the process better right we'll

play03:03

start again on the paper so we can clean

play03:04

things up a bit so we've got an image

play03:07

right let's say it's an image of a

play03:09

rabbit right we add some noise so we've

play03:11

got a rabbit

play03:12

which is the same right and you add some

play03:15

noise to it now it's not speckly noise

play03:17

but I can't draw gaussian noise right

play03:20

and then we add another bit of noise

play03:22

right and the rabbit it's the same shape

play03:24

rabbit there's a bit more noise right

play03:26

and then we come over here and we come

play03:28

over here

play03:29

and we end up with just noise looks like

play03:32

nonsense and so the question is like how

play03:34

do we craft some kind of training

play03:36

algorithm some kind of what we call

play03:38

inference you know how do we actually

play03:39

deploy a network that can undo this

play03:41

process the first question is how much

play03:43

noise do we add why don't we just add

play03:45

loads of noise

play03:46

right so just delete all these images

play03:48

and doesn't really don't need to worry

play03:49

about that add loads of noise and then

play03:51

say like give me that and then you've

play03:52

got a pair of training examples you

play03:54

could use and the answer is it'll kind

play03:57

of work but that's about a very

play03:58

difficult job and you've sort of in the

play04:00

same problem with the Gant you're trying

play04:01

to do everything in one go right the

play04:03

intuition perhaps is that it's maybe

play04:05

slightly easier to go from this one to

play04:06

this one just remove a little bit of

play04:07

noise and then from this one to this one

play04:09

a little bit more noise well in

play04:11

traditional like image processing you do

play04:13

there are noise removal techniques rise

play04:15

yeah it's not difficult to do that is it

play04:17

no I mean it's it's difficult in a sense

play04:20

that you don't know what the original

play04:21

image was so

play04:23

what we're trying to do is train a

play04:25

network to undo this process

play04:28

that's the idea and if we can do that

play04:29

then we can start with random noise a

play04:31

bit like I can and we can just iterate

play04:33

this process and produce an image right

play04:35

now there's a lot of missing parts here

play04:37

right so we'll start building up the

play04:38

complexity a little bit

play04:40

okay so the first thing is is let's go

play04:42

back to our question of how much noise

play04:43

do we add right so we could add a small

play04:46

amount of noise

play04:47

and then the same amount again I've been

play04:49

the same amount again and we could keep

play04:50

adding it until we have essentially what

play04:52

looks like random noise over here right

play04:54

and that will be what we would call a

play04:55

linear

play04:56

schedule right for that is the same same

play04:59

amount of noise each time basically

play05:00

right and it's not interesting but it

play05:01

works the other thing you could do is

play05:03

you could add very little noise at the

play05:04

beginning and then ramp up the amount of

play05:06

noise you add later right and so there

play05:08

are different strategies depending on

play05:09

what paper you read about the best

play05:11

approach for adding noise

play05:13

but it's called the schedule right so

play05:15

the idea is you have a schedule that

play05:17

says right given this image so this is

play05:19

an image at uh at time T equals naught

play05:22

this is T equals one blah blah blah T

play05:26

equals some capital T which is like the

play05:28

final number of steps you've got right

play05:31

and this represents essentially all the

play05:33

noise and this represents some amount of

play05:36

noise and you can change how much each

play05:38

step has right and then the nice thing

play05:40

is you can then very easily produce

play05:42

because gaussians add together very

play05:44

nicely so you can say well I want T

play05:45

equals seven and you don't have to

play05:47

produce all the images you can just jump

play05:48

straight to t7 add the exact right

play05:50

amount of noise and then hand that back

play05:52

to the network so when you train this

play05:54

you can give it image random images from

play05:57

your training set with random amounts of

play05:59

noise added based on this schedule right

play06:01

varying randomly between 1 and T right

play06:05

and you can say okay here's a really

play06:06

noisy image Undo It here's a little less

play06:08

noisy image undo it right so what you do

play06:11

is you take your noise image image

play06:14

right I'm going to keep going with this

play06:16

rabbit it's taller than it was before

play06:17

right you take your noisy image at some

play06:19

time let's say t equals five right you

play06:22

have a giant unit shaped Network we've

play06:25

talked about encoder decoder networks

play06:26

before there's nothing particularly

play06:28

surprising about this one

play06:30

and then you also put in the time right

play06:32

because if we're running a funny

play06:34

schedule where your at different times

play06:35

have different amounts of noise you need

play06:37

to tell the network where it is so that

play06:39

it knows okay I'm gonna have to remove a

play06:40

lot of noise this time or just a little

play06:42

bit of noise what do we produce here

play06:44

so we could go for the whole hog and we

play06:47

just say we'll just produce the original

play06:48

rabbit image but then you've got a

play06:49

situation where you have to go from here

play06:51

all the way back to the rabbit that's a

play06:52

little bit difficult right

play06:53

mathematically it works out a little bit

play06:55

easier if we just try and predict the

play06:58

noise we want to know what is the noise

play07:00

that was added to this image

play07:02

that you could use to get back to the

play07:04

original image so this is all the noise

play07:06

from

play07:07

t1234 and five so you just get noise

play07:10

basically out here like this right with

play07:12

no rabbit that's the hope

play07:14

and then theoretically you could take

play07:17

that away from this and you get the

play07:18

rabbit back right now if you did that

play07:21

from here you would find that it's a

play07:23

little bit iffy right because you know

play07:25

you're predicting the noise all the way

play07:27

back to this rabbit is maybe quite

play07:28

difficult but if you did it from here it

play07:31

may be not quite so difficult we want to

play07:33

predict the noise so what we could do is

play07:35

predict the noise at let's say time T

play07:36

equals five and to say give me the noise

play07:38

it takes us back to T equals four right

play07:40

and then T equals three and T equals two

play07:42

the problem if you do that is that

play07:44

you're very

play07:46

stuck doing the exact time steps of the

play07:48

schedule used right if you used a

play07:49

thousand time steps for training now

play07:51

you've got to use a thousand time steps

play07:52

of inference right you can't speed it up

play07:54

so what we might try and do instead is

play07:56

say well okay whatever time step you're

play07:58

at you've got some amount of noise

play07:59

remove it all predict me all the noise

play08:01

in the image and just give me back that

play08:04

noise that I can take away and get back

play08:05

to the original image and so that's what

play08:07

we do so during training we pick a

play08:09

random Source image we pick a random

play08:11

time step and we add based on our

play08:13

schedule that amount of noise right so

play08:16

we have

play08:17

a noisy image

play08:20

a Time step T we put that into the

play08:22

network

play08:24

and we say what was the noise

play08:27

that

play08:29

we've just added to that image right now

play08:31

we haven't given it the original image

play08:32

right so that's what's Difficult about

play08:33

this we we have the original image

play08:36

without any noise on it that we're not

play08:37

showing it and we added some noise and

play08:40

we want that noise back right so we can

play08:42

do that very easily we've got millions

play08:44

of images in our or billions of images

play08:46

in our data set right we can add random

play08:48

bits of noise and we can say what was

play08:49

that noise right and over time it starts

play08:51

to build up a picture of what that noise

play08:53

is so it sounds like a really good kind

play08:55

of plug-in for Photoshop or something

play08:56

right it's going to be noise removal

play08:58

plug-in how does that turn into creating

play09:01

new images yeah so actually in some

play09:03

sense that's the clever bit right is how

play09:05

we use this network that produces noise

play09:08

to undo the noise right

play09:12

we've got a network which given an image

play09:15

with some noise added to it and a Time

play09:17

step that represents how much noise that

play09:19

is roughly or where we are in the

play09:22

noising process we have a network which

play09:25

produces an estimate for what that noise

play09:28

is in total and theoretically if we take

play09:31

that noise away from this we get back to

play09:33

the original image now that is not a

play09:35

perfect process right this network is

play09:37

not going to be perfect and so if you

play09:38

give it an incredibly noisy image

play09:41

and you take away what it predicts

play09:43

you'll get like a sort of

play09:44

maybe like a vague shape and so what we

play09:47

want to do is take it a little bit more

play09:48

slowly okay so we take this noise and we

play09:51

subtract it from our image

play09:53

right to get an estimate of what the

play09:55

original image was right T naught okay

play09:58

so we take this

play10:01

and we take this and we do subtraction

play10:03

and we get another image which is our

play10:05

estimate for T equals naught

play10:07

right and it's not going to look very

play10:09

good the first time

play10:10

but then we add a bunch of this noise

play10:12

back again and we get to a t that's

play10:15

slightly less than this one so maybe

play10:16

this was like T10 T equals 10. maybe we

play10:19

add like nine tenths of a noise back and

play10:21

we get to what we roughly T equals nine

play10:23

right so now we have a slightly less

play10:25

noisy image and we can repeat this

play10:27

process so we put the slightly less

play10:29

noisy image in we predict how to get

play10:32

back to T naught and we add back most

play10:34

but not all of the noise

play10:35

and then we repeat the process right and

play10:39

so each time we Loop this

play10:40

we get a little bit closer to the

play10:43

original image it was very difficult to

play10:45

predict the noise at T equals 10. it's

play10:48

slightly easier to predict the noise at

play10:49

T equals nine and very easy at T equals

play10:52

one because it's both mostly the image

play10:53

with a little bit of noise on it and so

play10:55

if we just sort of feel our way towards

play10:57

it by taking off little bits of noise at

play10:59

a time we can actually produce an image

play11:01

right so you start off with a noisy

play11:03

image

play11:04

you predict all the noise and remove it

play11:06

and then add back most of it right and

play11:09

so then you get and so at each step you

play11:11

have an estimate for what the original

play11:12

image was and you have a next image

play11:15

which is just a little bit less noisy

play11:16

than the one before and you Loop this a

play11:19

number of times right and that's

play11:21

basically how the image generation

play11:22

process works so you take your noisy

play11:25

image you Loop it and you gradually

play11:27

remove noise until you end up back at

play11:29

what the network thinks was the original

play11:31

image and you're doing this by

play11:32

predicting the noise and taking it away

play11:34

rather than spitting out an image with

play11:37

less noise right and that mathematically

play11:39

works out a lot easier to train and it's

play11:41

a lot more stable than again there's an

play11:43

elephant in the room here there is

play11:44

you're kind of talking about how to make

play11:46

random images effectively how do we

play11:49

direct this so that's where the

play11:51

complexity starts ramping up right we've

play11:53

got a structure where we can train a

play11:54

network to produce random images but

play11:56

it's not guided there's no way of saying

play11:58

I want a frog rabbit hybrid right which

play12:00

I've done and it's very weird so how do

play12:02

we do that the answer is we base

play12:04

condition this network that's the word

play12:06

we would use we'll basically give access

play12:08

to the text as well all right so let's

play12:10

actually infer on an image on my piece

play12:11

of paper right I bear in mind the output

play12:13

is going to be hand drawn by me so it's

play12:15

going to be terrible you start off with

play12:17

a random noise image right so this is

play12:20

just an image that you've generated by

play12:22

taking random gaussian noise

play12:23

mathematically this is centered around

play12:24

zero so you have negative and positive

play12:26

numbers you don't go from zero to two

play12:28

five five because it's just easier for

play12:29

the network to train

play12:31

you put in your time step so you

play12:33

generate a you put in a times that let's

play12:35

say you're going to do 50 iterations

play12:36

right so we put in a Time step that's

play12:38

maybe right at the end of our schedule

play12:40

but it says like time step equals you

play12:42

know 50 which is our most noised image

play12:45

right and then you pass it through the

play12:47

network and say estimate me the noise

play12:48

and we also take our string which is

play12:51

frogs

play12:52

frogs on stilts

play12:56

I'll have to have to try that later oh

play12:58

look right what's this one anyway we

play13:00

could spend let's say another 20 30

play13:01

minutes producing fogs on stills

play13:03

we embed this

play13:06

right by using our GPT style Transformer

play13:09

embedding and we'd stick that in as well

play13:11

and then it produces an estimate of how

play13:14

much noise it thinks is in that image

play13:16

so that estimate on T equals 50 is going

play13:18

to be a bit average right it's not going

play13:20

to produce you a frog on a stilt picture

play13:22

it's going to produce you like a gray

play13:23

image or a brown image or something like

play13:25

that because that is a very very

play13:27

difficult problem to solve however if

play13:30

you subtract

play13:32

this noise from this image you get your

play13:34

first estimate for what your first image

play13:36

is right and when you add back a bunch

play13:38

of noise and you get to T equals 49

play13:41

right so now we've got slightly less

play13:42

noise and maybe they're like the biggest

play13:44

outline of a frog on a stilt right and

play13:47

this is T equals 49 you take your

play13:50

embedding and you put this in as well

play13:53

right and you get another maybe slightly

play13:57

better estimate of the noise in the

play14:00

image and then we Loop right it's a for

play14:02

Loop right we've done those before you

play14:04

take this output you subtract it you add

play14:06

noise back and you repeat this process

play14:08

and you keep adding this text embedding

play14:10

now there's one final trick that they

play14:12

use to make things a little bit better

play14:14

if you do this you will get a picture

play14:16

that maybe looks slightly frog-like

play14:18

maybe there's a stilt in it right but it

play14:20

won't look anything like the images you

play14:22

see on the internet that have been

play14:24

produced by these tools because they do

play14:25

another trick to make the output even

play14:27

more tied to the text what you do is

play14:30

something called classifier free

play14:31

guidance so you actually put this image

play14:33

in twice once you include the embeddings

play14:36

of the text and once you don't right so

play14:38

this method this network is maybe

play14:40

slightly better when it has a text

play14:42

estimating the noise so you actually put

play14:44

in

play14:45

two images

play14:47

right this one's with the embedding

play14:51

and this one's no embedding right and

play14:54

this one is maybe slightly more random

play14:55

noise and this one's slightly more

play14:56

frog-like right or it's better better

play14:59

it's slightly moving towards the right

play15:00

thing

play15:01

and we can calculate the difference

play15:03

between these two noises and amplify

play15:05

that signal right and then feed that

play15:07

back so what we essentially do is we say

play15:10

okay if this network wasn't given any

play15:12

information on what was in the image and

play15:14

then this version of a network was

play15:16

what's the difference between those two

play15:18

predictions and can we amplify that when

play15:20

we loot this to really Target this kind

play15:23

of output right and the idea is

play15:25

basically you're really forcing this

play15:27

network or this this Loop to really

play15:30

point in direction of the of the scene

play15:31

we want right

play15:33

um and that's called classify free

play15:34

guidance and it is somewhat of a hack at

play15:37

the end of the network but it does work

play15:39

right if you turn it off which I've done

play15:40

it doesn't it produces you vague sort of

play15:43

structures that kind of look right it's

play15:45

not it's not terrible I mean I think I

play15:47

did like a muppet cooking in the kitchen

play15:48

and it just produced me a picture of a

play15:50

generic kitchen with no Muppet in it

play15:52

right but if you do this then you

play15:55

suddenly are really targeting what you

play15:56

want standard question got to ask it is

play15:59

this something people can play with

play16:00

without just going to one of these

play16:02

websites and typing some words well yeah

play16:04

I mean that's the thing is is that

play16:06

um is that it costs hundreds of

play16:08

thousands of dollars to try one of these

play16:10

networks because of how many images they

play16:11

use and how much processing power they

play16:13

use um the good news is that there are

play16:15

ones like stable diffusion that are um

play16:19

that are available to use for free right

play16:22

and you can use them through things like

play16:23

Google colab Now I I did this through

play16:25

Google collab

play16:26

um and it works really really well

play16:29

um and maybe we'll talk about that in

play16:30

another video where we delve into the

play16:32

code and see all of these bits happening

play16:33

within the code right I blew through my

play16:36

uh free Google allowance very very

play16:38

quickly I had to pay my eight pounds for

play16:41

uh for premium Google access so um you

play16:45

know eight pounds eight pounds

play16:50

thank you yeah so you know never let it

play16:53

be said I don't spare expense I I know I

play16:55

spare no expense on um on on computer

play16:58

file uh getting access to proper compute

play17:01

Hardware but um

play17:03

could beasts do something like that it

play17:05

could yeah almost of our servers could

play17:07

I'm just a bit lazy and haven't set them

play17:09

up to do so um but actually the code is

play17:11

quite easy to run that the the sort of

play17:13

the entry-level version of a code you

play17:15

literally can just like basically call

play17:17

one python function and it will produce

play17:18

you an image I'm using a code which is

play17:20

perhaps a little bit more detailed it's

play17:22

got the full loop in it and I can go in

play17:23

and inject things and change things so I

play17:25

can understand it better and we'll talk

play17:27

through that next you know perhaps next

play17:29

time

play17:30

the only other interesting thing about

play17:32

the current neural networks is that the

play17:34

weights here and here and here are

play17:36

shared so they are the same because

play17:38

otherwise this one here would always be

play17:40

the time to make one sandwich but you've

play17:42

got two people doing it so they make

play17:43

twice as many sandwiches each time they

play17:44

make a sandwich same with the computer

play17:46

we could either make the computer

play17:47

processor faster or

Rate This

5.0 / 5 (0 votes)

Related Tags
图像生成扩散模型生成对抗网络迭代过程噪声去除网络训练深度学习AI艺术图像处理技术教程
Do you need a summary in English?