Why Does Diffusion Work Better than Auto-Regression?

Algorithmic Simplicity
16 Feb 202420:18

Summary

TLDRThis video script explores the concept of generative AI, focusing on how deep neural networks create images from text descriptions. It explains the process of prediction tasks and how they differ from generation tasks, highlighting the use of auto-regressors and diffusion models to generate images. The script delves into technical details like training neural nets on image completion, the importance of random sampling for diversity, and the use of text prompts to condition image generation. It concludes by emphasizing the underlying principle of generative AI as a form of curve fitting.

Takeaways

  • 🧠 Deep neural networks are the underlying technology for generative AI models that can create images, text, audio, code, and more.
  • 🔮 Traditional neural nets are used for prediction tasks, learning from examples to predict labels for new inputs.
  • 🎨 Generative models, however, are capable of creating novel content, which might seem beyond mere prediction but is still a form of curve fitting.
  • 🖼️ An image generator can be trained to create new images by using a black image as a dummy input, aiming to produce images similar to the training set.
  • 🔍 Predictors tend to output the average of possible labels, which can lead to a blurry result when averaging images, unlike classification tasks.
  • 🔧 Training a neural net to predict a single missing pixel can work well because the average of plausible values for one pixel is still a meaningful color.
  • 🌟 The process of generating an image can involve training multiple neural nets to predict each pixel sequentially, starting from a black image.
  • 🔄 Introducing randomness in sampling from the probability distribution of possible labels can create diversity in the generated images.
  • 🔢 Auto-regressors are one of the oldest generative models, with the ability to generate text or images by predicting the next element based on the previous ones.
  • 🚀 Diffusion models improve on auto-regressors by adding noise to the image and training the model to predict the original image from increasingly noisy versions, requiring fewer evaluations.
  • 📝 Generative models can be conditioned on various inputs like text prompts or sketches, allowing for the creation of content that matches specific descriptions or styles.

Q & A

  • What is an artificial intelligence image generator?

    -An artificial intelligence image generator is a system that creates images from text descriptions using deep neural networks, capable of producing high-quality images of various scenes.

  • What types of generative AI models have been developed recently?

    -In recent years, generative AI models have been developed for text, audio, code, and soon videos, all based on deep neural networks.

  • How do deep neural networks solve prediction tasks?

    -Deep neural networks solve prediction tasks by being trained with examples of inputs and their labels, then predicting the label for new, unseen inputs.

  • What is the difference between prediction and generation in the context of AI models?

    -Prediction involves fitting a curve to a set of points to forecast outcomes based on existing data, while generation involves creating novel outputs, such as images, that were not part of the training data.

  • Why did the initial attempt to generate an image from a black image fail?

    -The initial attempt failed because predictors output the average of possible labels, which in the case of images results in a blurry mess instead of a clear, distinct image.

  • How can a neural net be trained to complete an image with a missing pixel?

    -A neural net can be trained to predict the value of a missing pixel by using the average of plausible values for that pixel, as the prediction for a single pixel does not result in a blurring effect.

  • What is an auto-regressor in the context of generative models?

    -An auto-regressor is a generative model that uses a removal process to gradually remove information and a generation process to add back information, typically pixel by pixel, using neural nets to predict the next pixel based on the partially masked image.

  • Why are auto-regressors not commonly used to generate images anymore?

    -Auto-regressors are not commonly used for image generation because they require evaluating a neural net for every element, making the process computationally expensive and slow for large images with millions of pixels.

  • What is a denoising diffusion model and how does it work?

    -A denoising diffusion model is a type of generative model that adds noise to the entire image to remove information in a spread-out manner, allowing for fewer neural net evaluations while maintaining image quality. It predicts the original clean image from the noisy version in multiple steps.

  • How can generative models be conditioned on text prompts or other inputs?

    -Generative models can be conditioned on text prompts or other inputs by providing the neural net with the additional input at each step during the generation process. The models are trained on pairs of images and corresponding descriptions or inputs to ensure the generated output matches the prompt.

  • What is classifier free guidance and how does it improve conditional diffusion models?

    -Classifier free guidance is a technique where the model is trained to make predictions with and without the conditioning prompt. During the denoising process, the model runs twice, subtracting the prediction without the prompt from the one with it, to focus on details that align with the prompt, resulting in more accurate generations.

Outlines

00:00

🤖 Introduction to AI Image Generation

This paragraph introduces the concept of artificial intelligence image generation, explaining how AI can create images from text descriptions using deep neural networks. It discusses the capabilities of generative AI in producing various types of content beyond images, such as text, audio, and code. The explanation delves into how neural networks are trained for prediction tasks, converting datasets into points in space and fitting curves to make predictions. It challenges the notion that neural nets are only for prediction, suggesting they can also be creative through a process that seems like curve fitting.

05:00

🖼️ The Process of Image Generation in AI

This section explores the process of generating images using AI, starting with the idea of training a neural net with images as labels. It explains the failure of using a completely black image as an input and the concept that predictors learn to output the average of possible labels, leading to a blurry result when averaging images. The paragraph then discusses the success of training neural nets to predict single missing pixels and the strategy of using multiple neural nets to complete an image one pixel at a time. It introduces the concept of auto-regressors and touches on the limitations of this method, such as the lack of diversity in generated images and the time-consuming nature of the process.

10:02

🔄 Enhancing Creativity in AI Image Generation

The paragraph discusses enhancing the creativity of AI image generation by introducing random sampling to avoid generating the same image repeatedly. It explains how predictors output a probability distribution and how sampling from this distribution can introduce diversity. The text also covers the practical aspect of training neural nets without the need for manual image labeling by using unlabeled images from the internet. It further explains the trade-off between the number of pixels generated at once and the quality of the generated image, highlighting the limitations of predicting multiple pixels simultaneously due to the averaging effect.

15:03

🌐 Advanced Techniques in Generative AI

This section delves into advanced techniques for improving the efficiency and quality of generative AI models. It introduces the concept of removing information from pixels in a way that maximizes the spread, such as adding noise to the entire image, allowing for faster generation with fewer neural net evaluations. The paragraph explains the denoising diffusion model, which is more efficient than auto-regressors for generating high-quality images. It also covers important technical details for implementing these models, including the use of the same model for all generation steps, special neural net architectures for faster training, and the importance of predicting the original clean image at every step of the generation process.

📝 Conditioning Generative Models with Text Prompts

The final paragraph discusses the ability to condition generative models on text prompts to create images that match a given description. It explains how models are trained on pairs of images and text descriptions, allowing for the generation of images that correspond to the prompts. The paragraph also introduces a technique called classifier-free guidance, which improves the model's adherence to the text prompt by subtracting predictions made without the prompt from those made with it. This results in images that more closely follow the provided text description, showcasing the flexibility and potential of generative AI models.

Mindmap

Keywords

💡Artificial Intelligence Image Generator

An Artificial Intelligence Image Generator is a technology that uses AI to create images from textual descriptions. It is a form of generative AI that can produce high-quality images of various scenes. In the video, this concept is central to explaining how AI can generate content, not just predict it, and the script describes the process by which these generators work, emphasizing the underlying technology of deep neural networks.

💡Deep Neural Networks

Deep Neural Networks are a class of machine learning algorithms modeled loosely after the human brain that are particularly good at handling large amounts of data and recognizing patterns. In the context of the video, they are the foundational technology behind generative AI models, including image generators, and are explained as being responsible for the ability of these models to learn from examples and make predictions or generate new content.

💡Prediction Task

A prediction task, as mentioned in the script, is a machine learning problem where the model is trained to predict an outcome based on given data. The video uses the example of training a neural net to identify objects in images, highlighting how the neural net learns to predict the label of an image based on its content, which is crucial for understanding the process of how generative AI models are trained.

💡Curve-Fitting

Curve-fitting is a method used in the script to describe the process of training a neural network to find a function that best fits a set of data points. In the context of AI, it is used metaphorically to explain how prediction tasks are solved by finding a model that fits the training data, which is key to understanding the limitations of prediction tasks in terms of creativity and generation.

💡Generative Models

Generative Models are AI models capable of creating new content, such as images, text, or audio, from scratch. The video explains that while these models are based on the same technology as predictive models, they are used to produce novel outputs rather than just fitting a curve to existing data, which is a critical distinction in understanding the capabilities of AI beyond mere prediction.

💡Auto-Regressor

An Auto-Regressor, as discussed in the video, is a type of generative model that generates data by predicting the next value in a sequence based on previous values. It is one of the earliest forms of generative models and is still in use today, such as in the case of Chat-GPT for text generation. The script uses the concept of an auto-regressor to illustrate the process of generating images pixel by pixel.

💡Pixel Completion

Pixel Completion refers to the process of training a neural net to predict the color of a missing pixel in an image. The script uses this as an example to explain how generative models can be trained to fill in missing parts of an image, which is a simplified version of the process used in more complex image generation.

💡Random Sampling

Random Sampling is a technique used in the script to introduce diversity and creativity into the outputs of generative models. By sampling from the probability distribution output by the model, rather than always choosing the most likely outcome, the model can produce a variety of different images, thus avoiding the generation of the same image each time.

💡Denoising Diffusion Model

A Denoising Diffusion Model is a type of generative model that works by gradually adding noise to an image and then training a neural net to reverse this process, predicting the original image from the noisy version. The video explains how this model allows for fewer neural net evaluations while maintaining high-quality image generation, making it more efficient than auto-regressors.

💡Classifier Free Guidance

Classifier Free Guidance is a technique mentioned in the script for improving the performance of conditional diffusion models. It involves training the model with and without the conditioning prompt, and then during the generation process, using the difference between the two predictions to enhance the details that come from the prompt, leading to more accurate and relevant image generation.

Highlights

Artificial intelligence can generate high-quality images from text descriptions using deep neural networks.

Generative AI models have expanded beyond images to include text, audio, code, and potentially videos.

Deep neural networks excel in prediction tasks by learning from input-output examples.

Prediction tasks are essentially curve-fitting exercises within a data space.

Generative models creatively produce novel outputs despite being based on prediction and curve fitting.

An initial attempt to train a neural net with black images as labels resulted in blurry outputs.

Predictors learn the average of possible labels when multiple labels apply to the same input.

A neural net can be trained to complete images with missing pixels, starting with a single pixel.

The process of image completion can be extended to fill in missing pixels one by one.

Introducing random sampling into the prediction process can increase the diversity of generated images.

Auto-regressors are a type of generative model that predict the next step based on the current state.

Auto-regressors can be inefficient for generating large images due to the number of evaluations required.

Denoising diffusion models add noise to images and train neural nets to predict the original clean image.

Denoising diffusion models can generate high-quality images with fewer neural net evaluations than auto-regressors.

Technical details such as using the same model for all steps and causal architectures can improve efficiency.

Conditional diffusion models use text prompts to guide the generation process towards specific descriptions.

Classifier-free guidance is a technique to improve the adherence of generated images to text prompts.

Generative AI, despite its creative outputs, is fundamentally a process of curve fitting in machine learning.

Transcripts

play00:00

This is an artificial intelligence image  generator. Given a text description of a picture,  

play00:06

it will create, out of nothing, an image  matching that description. As you can see,  

play00:11

it is capable of generating high quality  images of all kinds of different scenes.  

play00:16

And it’s not just images, in recent  years generative AI models have been  

play00:20

developed that can generate text, audio,  code, and soon, videos too. All of these  

play00:32

models are based on the same underlying  technology, namely deep neural networks.

play00:38

In a few of my previous videos, I’ve explained  how and why deep neural networks work so well.  

play00:43

But I only explained how neural nets can  solve prediction tasks. In a prediction task,  

play00:49

the neural net is trained with a bunch of  examples of inputs and their labels, and  

play00:54

tries to predict what the label will be for a new  input which it hasn’t seen before. For example,  

play01:00

if you trained a neural net on images labelled  with the type of object appearing in each image,  

play01:05

that neural net would learn to predict which  object a human would say is in an image,  

play01:10

even for new images which it hasn’t seen before.  Under the hood, the way that prediction tasks  

play01:16

are solved is by converting the training dataset  into a set of points in a space, and then fitting  

play01:22

a curve through those points, so prediction  tasks are also known as curve-fitting tasks.

play01:29

And while prediction is certainly cool and very  useful, it’s not generation. Right? This model  

play01:35

is just fitting a curve to a set of points.  It can’t produce new images. So where does  

play01:41

the creativity of these generative models come  from, if neural nets can only do curve fitting?

play01:49

Well, all of these generative models,  are in fact just predictors. Yep,  

play01:53

it turns out that the process of  producing novel works of art can be  

play01:57

reduced to a curve fitting exercise. And  in this video, you’ll learn exactly how.

play02:05

Suppose that we have a training dataset  consisting of a bunch of images. We want  

play02:09

to train a neural net to create new images which  are similar in style to these training images.

play02:16

The first thing you might try is to  simply use the images as labels to  

play02:20

train the predictor. Here we don’t care  about the mapping from inputs to outputs,  

play02:26

so we can just use anything we like for  the inputs, for example a completely  

play02:30

black image. Predictors learn to map inputs to  outputs according to their training data. So,  

play02:36

this predictor, once trained, should be able  to map the dummy all-black image to new images,  

play02:42

like those seen in the training set, right? Err,  ok maybe not quite. That didn’t work so well,  

play02:52

instead of producing a nice, beautiful  picture, we just got this blurry mess.

play02:58

This demonstrates a very important fact about  of predictors. If there are multiple possible  

play03:03

labels for the same input, the predictor  will learn to output the average of those  

play03:07

labels. For traditional classification  tasks, this isn’t really a problem,  

play03:13

because the average of multiple class labels  can still be a meaningful label. For example,  

play03:19

this image could plausibly be given two  different labels, both cat and dog would be  

play03:24

valid labels. In this case a classifier would  learn to output the average of those labels,  

play03:30

which means you end up with a score of 0.5 cat and  0.5 dog. Which is still a useful label. In fact  

play03:38

it’s arguably a better label than either of the  original ones. On the other hand, when you average  

play03:44

a bunch of images together you do not get a  meaningful image out, you just get a blurry mess.

play03:51

Let’s try something a bit easier this time. How  about, instead of generating a new image from  

play03:56

scratch, we try to complete an image which has  a part of it missing. In fact, let’s make this  

play04:02

really easy and suppose there is only one  missing pixel, say, the bottom right pixel. 

play04:08

Can we train a neural net to predict the  value of this one missing pixel? Well,  

play04:13

as before, the neural net is going to  output the average of plausible values  

play04:17

that the missing pixel can take. But since  it’s only one pixel that we’re predicting,  

play04:22

the average value is still meaningful. The average  of a bunch of colors is just another color,  

play04:28

there’s no blurring effect. So,  this model works perfectly fine!

play04:35

And we can use the value predicted by this neural  

play04:38

net to complete images which are  missing the bottom-right pixel.

play04:43

Great, so we can complete images  with 1 missing pixel… What about 2?

play04:50

Well, we can do the same thing again, train  another neural net on images with 2 missing  

play04:55

pixels, using the value of the second missing  pixel as the label. And then use this neural  

play05:00

net to fill in the second missing pixel. Now  we have an image with just 1 missing pixel,  

play05:08

and so we can use the first  neural net to fill in that. Great.

play05:15

And we can do this for every pixel in the image;  train a neural net to predict the color of that  

play05:20

pixel when it and all of the subsequent  pixels are missing. Now we can “complete”  

play05:25

an image starting from a fully black image,  and filling in one pixel at a time. Crucially,  

play05:31

each neural net only predicts one pixel,  and so there’s no blurring effect.

play05:38

And there we have it, we have just generated a  plausible image, out of nothing… There’s just  

play05:44

one small problem. If we run this model again,  it will generate exactly the same image… Not very  

play05:50

creative, is it? But not to worry, we can fix this  by introducing a bit of random sampling. You see,  

play05:57

all predictors actually output a probability  distribution over possible labels. Usually,  

play06:04

we just take the label with the largest  probability as the predicted value. But  

play06:09

if we want diversity in our outputs, we  can instead randomly sample a value from  

play06:14

this probability distribution. This way,  each time the model is run, it will sample  

play06:19

different values at each step, which therefore  changes the prediction for subsequent steps,  

play06:24

and we get a completely different image each  time. Now we have an interesting image generator.

play06:32

But still, at the end of the day, this model is  made of predictors. They take as input a partially  

play06:38

masked image, and predict the value of the next  pixel. The only difference between this and a  

play06:43

traditional image classifier is the label we used  for training. The labels for our generator happen  

play06:49

to be pixel colors which come from the original  image itself, and not a human labeller. This is  

play06:55

a very important point in practice: it means  we don’t need humans to manually label images  

play07:00

for this model, we can just scrape unlabelled  images off the internet. But from the point of  

play07:05

view of the neural net, it doesn’t know, nor  does it care, that the label came from the  

play07:10

original image. As far as it’s concerned this is  just a curve fitting exercise, like any other.

play07:17

The generative model we’ve just created is called  an auto-regressor. We have a removal process,  

play07:23

which removes pixels one at a time, and we train  neural nets to undo this process, generating and  

play07:29

adding back in pixels one at a time. This is  actually one of the oldest generative models,  

play07:35

the very earliest use of auto-regression dates  back to 1927, where it was used to model the  

play07:41

timing of sunspots. But auto-regressors  are still in use today. Most notably,  

play07:46

Chat-GPT is an auto-regressor. Chat-GPT generates  text by using a transformer classifier to output  

play07:54

a probability distribution over possible next  words, given a partial piece of text. However,  

play08:00

auto-regressors are not used to generate  images anymore. And the reason is that,  

play08:06

while they can generate very realistic  images, they take too long to run.

play08:11

In order to generate a sample  with an auto-regressor,  

play08:14

we need to evaluate a neural net once  for every element. This is fine for  

play08:18

generating a few thousand words to make  a piece of text, but large images can  

play08:23

have tens of millions of pixels. How can we  get away with fewer neural net evaluations?

play08:33

For our auto-regressor, we removed one pixel at a  time. But we don’t have to remove only one pixel,  

play08:40

we could, for example, remove a 4  by 4 patch of pixels at a time. And  

play08:45

train the neural net to predict  all 16 missing pixels at once.

play08:49

This way, when we use our  model to generate an image,  

play08:52

it can produce 16 pixels per evaluation,  and so generation is 16 times as fast.

play09:01

But there is a limit to this. We can’t generate  too many pixels at the same time. In the extreme  

play09:07

case, if we try to generate every pixel in  the image at once, then we’re back to the  

play09:12

original problem: there are many possible labels  that get averaged together into a blurry mess.

play09:20

To be clear, the reason why the  image quality degrades is that,  

play09:24

when we predict a bunch of pixels at the same  time, the model has to decide on the values for  

play09:29

all of them at once. There are lots of plausible  ways that this missing patch could be filled in,  

play09:35

and so the model outputs the average  of those. The model isn’t able to  

play09:40

make sure that the generated values are  consistent with each-other. In contrast,  

play09:46

when we predict one pixel at a time, the model  gets to see the previously generated pixels,  

play09:51

and so the model can change its prediction for  this pixel to make it consistent with what has  

play09:55

already been generated. This is why there’s a  trade-off, the more pixels we generate at once,  

play10:01

the less computation we need to use, but the  worse the quality of the generated images will be.

play10:08

Although, this problem only arises if the  values we are predicting are related to  

play10:13

each other. Suppose that the values were  statistically independent of each-other,  

play10:19

that is, knowing one of them does not  help to predict any others. In this case,  

play10:24

the model doesn’t need look at  the previously generated values,  

play10:28

since knowing what they were wouldn’t change  its prediction for the next value anyway. In  

play10:33

this case you can predict all of them at  the same time without any loss in quality.

play10:39

So, that means, ideally, we want our model to  generate a set of pixels that are unrelated to  

play10:45

each other. For natural images, nearby  pixels are the most strongly related,  

play10:51

because they are usually part of the same  object. Knowing the value of one pixel very  

play10:56

often gives you a good idea of what color nearby  pixels will be. This means that removing pixels  

play11:01

in contiguous chunks is actually the worst way to  do it. Instead, we should be removing pixels that  

play11:08

are far away from each other, and hence more  likely to be unrelated. So if in each step,  

play11:15

we remove a random set of pixels, and predict  values for those, then we can remove more pixels  

play11:21

in each step for the same loss in image  quality, compared to contiguous chunks.

play11:27

In order to minimize the number of steps  needed for generation, we want the pixels  

play11:31

we remove in each step to be as spread out as  possible. Removing pixels in a random order is  

play11:37

a pretty good way of maximizing the average  spread, but there is an even better way.

play11:43

We can think of our generative model as two  processes: a removal process that gradually  

play11:49

removes information from the input, until  nothing is left. And a generation process  

play11:54

that uses neural nets to undo the removal process,  generating and adding back in information. So far,  

play12:01

we have been completely removing pixels.  But rather than completely removing a pixel,  

play12:07

we could instead remove only some of the  information from a pixel, by, for example,  

play12:13

adding a small amount of random noise to it.  This means we don’t know exactly what the  

play12:19

original pixel value was, but we do know it  was somewhere close to the noisy value. Now,  

play12:26

instead of removing a bunch of pixels in each  step, we can add noise to the entire image.  

play12:32

This way, we can remove information from  every pixel in the image in a single step,  

play12:37

which is the most spread-out way of removing  information. And since its more spread out,  

play12:42

you can remove more information in each step,  for the same loss in generation quality.

play12:50

There is one small problem with this though. When  we want to generate a new image, we need to start  

play12:55

the neural net off with some initial blank image.  When we were removing pixels, then every image  

play13:01

eventually ends up as a completely black image,  so of course that’s where we start the generation  

play13:07

process from. But now that we’re adding noise,  the values just keep getting larger and larger,  

play13:13

never converging to anything. So where  do we start the generation process from?

play13:18

We can avoid this problem by changing  our noising step slightly, so that we  

play13:22

first scale down the original value and  then add the noise. This ensures that,  

play13:28

when we repeat this noising step many  times, information from the original  

play13:32

image will disappear, and the result will  be equivalent to a pure random sample from  

play13:37

the noise distribution. So we can start our  generation process from any such noise sample.

play13:44

And there we have it, this is known as a denoising  diffusion model. The overall form is identical  

play13:50

to an auto-regressor, the only difference is  the way in which we remove information at each  

play13:55

step. By adding noise, we can spread out the  removal of information all across the image,  

play14:01

which makes the predicted values as independent  of each-other as possible, allowing you to use  

play14:06

fewer neural net evaluations. Empirically,  diffusion models can produce high-quality  

play14:12

photo-realistic images in about a hundred steps,  where auto-regressors would take millions.

play14:20

Now that we understand how these generative models  

play14:22

work at a conceptual level, if you are ever  going to implement these models in practice,  

play14:26

there are a few important technical  details that you should be aware of.

play14:31

First, in the procedure I described for  auto-regression, I used a different neural  

play14:35

net in each step of the process. This is certainly  the best way to get the most accurate predictions,  

play14:41

but it’s also very inefficient, since we  need to train a whole bunch of different  

play14:46

neural nets. In practice, you would just use  the same model to do every step. This gives  

play14:53

slightly worse predictions, but the savings  in computation time more than make up for it.

play14:58

In order to train a single neural net  to perform all of the generation steps,  

play15:02

you would remove a random number of  pixels from each input, and train the  

play15:06

neural net to predict the corresponding  next pixel of each input. Additionally,  

play15:12

you can also give the number of pixels removed as  an input to the neural net, so that it knows which  

play15:17

pixel it’s supposed to be generating. Now this one  neural net can be used for all generation steps.

play15:25

In the setup I just described, for  each training image, the neural  

play15:29

is trained on only one generation  step for that image. But ideally,  

play15:34

we would like to train it on every generation  step of every image, we can get more use out  

play15:39

of our training data that way. If you did  this the naïve way you would have to evaluate  

play15:45

the neural net once for every generation  step. Which means a lot more computation.

play15:50

Fortunately, there exist special neural net  architectures, known as causal architectures,  

play15:55

that allow you to train on all of these  generation steps while only evaluating  

play16:00

the neural net once. There exist causal versions  of all of the popular neural net architectures,  

play16:06

such as causal convolutional neural  nets, and causal transformers. Causal  

play16:11

architectures actually give slightly  worse predictions, but in practice,  

play16:15

auto-regression is almost always done with causal  architectures because the training is so much  

play16:20

faster. The generation process for causal  architectures is still exactly same though.

play16:26

For diffusion models, you can’t use  causal architectures and so you do  

play16:30

have to just train with each data  point at a random generation step.

play16:37

I described the diffusion model as predicting  the slightly less noisy image from the previous  

play16:42

step. However, it’s actually better to predict  the original, completely clean image at every  

play16:47

step. The reason for this is it makes the  job of the neural net easier. If you make  

play16:54

it predict the noisy next step image, then the  neural net needs to learn how to generate images  

play16:59

at all different noise levels. This means  the model will waste some of its capacity  

play17:04

learning to produce noisy versions of images.  If you instead just have the neural net always  

play17:10

predict the clean image, then the model only  needs to learn how to generate clean images,  

play17:15

which is all we care about. You can then  take the predicted clean image and reapply  

play17:21

the noising process to it to get the  next step of the generation process.

play17:26

Except that when you predict the clean image  then, at the early steps of the generation  

play17:32

process the model has only pure noise as input, so  the original clean image could have been anything,  

play17:39

and so you get a blurry mess again. To avoid  this, we can train the neural net to predict  

play17:46

the noise which was added to the image. Once we  have a predicted value for the noise, we can plug  

play17:51

it into this equation to get a prediction for the  original clean image. So we are still predicting  

play17:57

the original clean image, just in a round-about  way. The advantage of doing it this way is that,  

play18:03

now, the model output is uncertain at the  later stages of the generation process,  

play18:08

since any noise could have been added to  the clean image. So the model outputs the  

play18:13

average of a bunch of different noise  samples, which is still valid noise.

play18:19

So far we’ve just been generating images from  nothing, but most image generators actually  

play18:24

allow you to provide a text prompt describing  the image you want to make. The way that this  

play18:29

works is exactly the same, you just give the  neural net the text as an additional input at  

play18:34

each step. These models are trained on pairs of  images and their corresponding text descriptions,  

play18:39

usually scraped from image alt text tags found  on the internet. This ensures that the generated  

play18:45

image is something for which the text prompt could  plausibly be given as a description of that image.

play18:52

In principle, you can condition  generative models on anything,  

play18:55

not just text, so long as you can find  appropriate training data. For example,  

play19:01

here is a generative model that  is conditioned on sketches.

play19:08

Finally, there’s a technique to make  conditional diffusion models work better,  

play19:12

called classifier free guidance. For this,  during training the model will sometimes  

play19:17

be given the text-prompts as additional  input, and sometimes it won’t. This way,  

play19:22

the same model learns to do predictions with or  without the conditioning prompt as input. Then,  

play19:28

at each step of the denoising process, the  model is run twice, once with the prompt,  

play19:33

and once without. The prediction without the  prompt is subtracted from the prediction with  

play19:38

the prompt, which removes details that are  generated without the prompt, thus leaving  

play19:42

only details that came from the prompt, leading to  generations which more closely follow the prompt.

play19:49

In conclusion, generative AI, like all  machine learning, is just curve fitting.

play19:55

And that’s all for this video. If you enjoyed  it, please like and subscribe. And if you have  

play19:59

any suggestions for topics you’d like to me to  cover in a future video, leave a comment below.

Rate This

5.0 / 5 (0 votes)

Étiquettes Connexes
AI Image GenerationDeep Neural NetworksGenerative AIAuto-RegressorsDenoising DiffusionImage PredictionText PromptsNeural NetsMachine LearningInnovation
Besoin d'un résumé en anglais ?