Why Does Diffusion Work Better than Auto-Regression?
Summary
TLDRThis video script explores the concept of generative AI, focusing on how deep neural networks create images from text descriptions. It explains the process of prediction tasks and how they differ from generation tasks, highlighting the use of auto-regressors and diffusion models to generate images. The script delves into technical details like training neural nets on image completion, the importance of random sampling for diversity, and the use of text prompts to condition image generation. It concludes by emphasizing the underlying principle of generative AI as a form of curve fitting.
Takeaways
- 🧠 Deep neural networks are the underlying technology for generative AI models that can create images, text, audio, code, and more.
- 🔮 Traditional neural nets are used for prediction tasks, learning from examples to predict labels for new inputs.
- 🎨 Generative models, however, are capable of creating novel content, which might seem beyond mere prediction but is still a form of curve fitting.
- 🖼️ An image generator can be trained to create new images by using a black image as a dummy input, aiming to produce images similar to the training set.
- 🔍 Predictors tend to output the average of possible labels, which can lead to a blurry result when averaging images, unlike classification tasks.
- 🔧 Training a neural net to predict a single missing pixel can work well because the average of plausible values for one pixel is still a meaningful color.
- 🌟 The process of generating an image can involve training multiple neural nets to predict each pixel sequentially, starting from a black image.
- 🔄 Introducing randomness in sampling from the probability distribution of possible labels can create diversity in the generated images.
- 🔢 Auto-regressors are one of the oldest generative models, with the ability to generate text or images by predicting the next element based on the previous ones.
- 🚀 Diffusion models improve on auto-regressors by adding noise to the image and training the model to predict the original image from increasingly noisy versions, requiring fewer evaluations.
- 📝 Generative models can be conditioned on various inputs like text prompts or sketches, allowing for the creation of content that matches specific descriptions or styles.
Q & A
What is an artificial intelligence image generator?
-An artificial intelligence image generator is a system that creates images from text descriptions using deep neural networks, capable of producing high-quality images of various scenes.
What types of generative AI models have been developed recently?
-In recent years, generative AI models have been developed for text, audio, code, and soon videos, all based on deep neural networks.
How do deep neural networks solve prediction tasks?
-Deep neural networks solve prediction tasks by being trained with examples of inputs and their labels, then predicting the label for new, unseen inputs.
What is the difference between prediction and generation in the context of AI models?
-Prediction involves fitting a curve to a set of points to forecast outcomes based on existing data, while generation involves creating novel outputs, such as images, that were not part of the training data.
Why did the initial attempt to generate an image from a black image fail?
-The initial attempt failed because predictors output the average of possible labels, which in the case of images results in a blurry mess instead of a clear, distinct image.
How can a neural net be trained to complete an image with a missing pixel?
-A neural net can be trained to predict the value of a missing pixel by using the average of plausible values for that pixel, as the prediction for a single pixel does not result in a blurring effect.
What is an auto-regressor in the context of generative models?
-An auto-regressor is a generative model that uses a removal process to gradually remove information and a generation process to add back information, typically pixel by pixel, using neural nets to predict the next pixel based on the partially masked image.
Why are auto-regressors not commonly used to generate images anymore?
-Auto-regressors are not commonly used for image generation because they require evaluating a neural net for every element, making the process computationally expensive and slow for large images with millions of pixels.
What is a denoising diffusion model and how does it work?
-A denoising diffusion model is a type of generative model that adds noise to the entire image to remove information in a spread-out manner, allowing for fewer neural net evaluations while maintaining image quality. It predicts the original clean image from the noisy version in multiple steps.
How can generative models be conditioned on text prompts or other inputs?
-Generative models can be conditioned on text prompts or other inputs by providing the neural net with the additional input at each step during the generation process. The models are trained on pairs of images and corresponding descriptions or inputs to ensure the generated output matches the prompt.
What is classifier free guidance and how does it improve conditional diffusion models?
-Classifier free guidance is a technique where the model is trained to make predictions with and without the conditioning prompt. During the denoising process, the model runs twice, subtracting the prediction without the prompt from the one with it, to focus on details that align with the prompt, resulting in more accurate generations.
Outlines
🤖 Introduction to AI Image Generation
This paragraph introduces the concept of artificial intelligence image generation, explaining how AI can create images from text descriptions using deep neural networks. It discusses the capabilities of generative AI in producing various types of content beyond images, such as text, audio, and code. The explanation delves into how neural networks are trained for prediction tasks, converting datasets into points in space and fitting curves to make predictions. It challenges the notion that neural nets are only for prediction, suggesting they can also be creative through a process that seems like curve fitting.
🖼️ The Process of Image Generation in AI
This section explores the process of generating images using AI, starting with the idea of training a neural net with images as labels. It explains the failure of using a completely black image as an input and the concept that predictors learn to output the average of possible labels, leading to a blurry result when averaging images. The paragraph then discusses the success of training neural nets to predict single missing pixels and the strategy of using multiple neural nets to complete an image one pixel at a time. It introduces the concept of auto-regressors and touches on the limitations of this method, such as the lack of diversity in generated images and the time-consuming nature of the process.
🔄 Enhancing Creativity in AI Image Generation
The paragraph discusses enhancing the creativity of AI image generation by introducing random sampling to avoid generating the same image repeatedly. It explains how predictors output a probability distribution and how sampling from this distribution can introduce diversity. The text also covers the practical aspect of training neural nets without the need for manual image labeling by using unlabeled images from the internet. It further explains the trade-off between the number of pixels generated at once and the quality of the generated image, highlighting the limitations of predicting multiple pixels simultaneously due to the averaging effect.
🌐 Advanced Techniques in Generative AI
This section delves into advanced techniques for improving the efficiency and quality of generative AI models. It introduces the concept of removing information from pixels in a way that maximizes the spread, such as adding noise to the entire image, allowing for faster generation with fewer neural net evaluations. The paragraph explains the denoising diffusion model, which is more efficient than auto-regressors for generating high-quality images. It also covers important technical details for implementing these models, including the use of the same model for all generation steps, special neural net architectures for faster training, and the importance of predicting the original clean image at every step of the generation process.
📝 Conditioning Generative Models with Text Prompts
The final paragraph discusses the ability to condition generative models on text prompts to create images that match a given description. It explains how models are trained on pairs of images and text descriptions, allowing for the generation of images that correspond to the prompts. The paragraph also introduces a technique called classifier-free guidance, which improves the model's adherence to the text prompt by subtracting predictions made without the prompt from those made with it. This results in images that more closely follow the provided text description, showcasing the flexibility and potential of generative AI models.
Mindmap
Keywords
💡Artificial Intelligence Image Generator
💡Deep Neural Networks
💡Prediction Task
💡Curve-Fitting
💡Generative Models
💡Auto-Regressor
💡Pixel Completion
💡Random Sampling
💡Denoising Diffusion Model
💡Classifier Free Guidance
Highlights
Artificial intelligence can generate high-quality images from text descriptions using deep neural networks.
Generative AI models have expanded beyond images to include text, audio, code, and potentially videos.
Deep neural networks excel in prediction tasks by learning from input-output examples.
Prediction tasks are essentially curve-fitting exercises within a data space.
Generative models creatively produce novel outputs despite being based on prediction and curve fitting.
An initial attempt to train a neural net with black images as labels resulted in blurry outputs.
Predictors learn the average of possible labels when multiple labels apply to the same input.
A neural net can be trained to complete images with missing pixels, starting with a single pixel.
The process of image completion can be extended to fill in missing pixels one by one.
Introducing random sampling into the prediction process can increase the diversity of generated images.
Auto-regressors are a type of generative model that predict the next step based on the current state.
Auto-regressors can be inefficient for generating large images due to the number of evaluations required.
Denoising diffusion models add noise to images and train neural nets to predict the original clean image.
Denoising diffusion models can generate high-quality images with fewer neural net evaluations than auto-regressors.
Technical details such as using the same model for all steps and causal architectures can improve efficiency.
Conditional diffusion models use text prompts to guide the generation process towards specific descriptions.
Classifier-free guidance is a technique to improve the adherence of generated images to text prompts.
Generative AI, despite its creative outputs, is fundamentally a process of curve fitting in machine learning.
Transcripts
This is an artificial intelligence image generator. Given a text description of a picture,
it will create, out of nothing, an image matching that description. As you can see,
it is capable of generating high quality images of all kinds of different scenes.
And it’s not just images, in recent years generative AI models have been
developed that can generate text, audio, code, and soon, videos too. All of these
models are based on the same underlying technology, namely deep neural networks.
In a few of my previous videos, I’ve explained how and why deep neural networks work so well.
But I only explained how neural nets can solve prediction tasks. In a prediction task,
the neural net is trained with a bunch of examples of inputs and their labels, and
tries to predict what the label will be for a new input which it hasn’t seen before. For example,
if you trained a neural net on images labelled with the type of object appearing in each image,
that neural net would learn to predict which object a human would say is in an image,
even for new images which it hasn’t seen before. Under the hood, the way that prediction tasks
are solved is by converting the training dataset into a set of points in a space, and then fitting
a curve through those points, so prediction tasks are also known as curve-fitting tasks.
And while prediction is certainly cool and very useful, it’s not generation. Right? This model
is just fitting a curve to a set of points. It can’t produce new images. So where does
the creativity of these generative models come from, if neural nets can only do curve fitting?
Well, all of these generative models, are in fact just predictors. Yep,
it turns out that the process of producing novel works of art can be
reduced to a curve fitting exercise. And in this video, you’ll learn exactly how.
Suppose that we have a training dataset consisting of a bunch of images. We want
to train a neural net to create new images which are similar in style to these training images.
The first thing you might try is to simply use the images as labels to
train the predictor. Here we don’t care about the mapping from inputs to outputs,
so we can just use anything we like for the inputs, for example a completely
black image. Predictors learn to map inputs to outputs according to their training data. So,
this predictor, once trained, should be able to map the dummy all-black image to new images,
like those seen in the training set, right? Err, ok maybe not quite. That didn’t work so well,
instead of producing a nice, beautiful picture, we just got this blurry mess.
This demonstrates a very important fact about of predictors. If there are multiple possible
labels for the same input, the predictor will learn to output the average of those
labels. For traditional classification tasks, this isn’t really a problem,
because the average of multiple class labels can still be a meaningful label. For example,
this image could plausibly be given two different labels, both cat and dog would be
valid labels. In this case a classifier would learn to output the average of those labels,
which means you end up with a score of 0.5 cat and 0.5 dog. Which is still a useful label. In fact
it’s arguably a better label than either of the original ones. On the other hand, when you average
a bunch of images together you do not get a meaningful image out, you just get a blurry mess.
Let’s try something a bit easier this time. How about, instead of generating a new image from
scratch, we try to complete an image which has a part of it missing. In fact, let’s make this
really easy and suppose there is only one missing pixel, say, the bottom right pixel.
Can we train a neural net to predict the value of this one missing pixel? Well,
as before, the neural net is going to output the average of plausible values
that the missing pixel can take. But since it’s only one pixel that we’re predicting,
the average value is still meaningful. The average of a bunch of colors is just another color,
there’s no blurring effect. So, this model works perfectly fine!
And we can use the value predicted by this neural
net to complete images which are missing the bottom-right pixel.
Great, so we can complete images with 1 missing pixel… What about 2?
Well, we can do the same thing again, train another neural net on images with 2 missing
pixels, using the value of the second missing pixel as the label. And then use this neural
net to fill in the second missing pixel. Now we have an image with just 1 missing pixel,
and so we can use the first neural net to fill in that. Great.
And we can do this for every pixel in the image; train a neural net to predict the color of that
pixel when it and all of the subsequent pixels are missing. Now we can “complete”
an image starting from a fully black image, and filling in one pixel at a time. Crucially,
each neural net only predicts one pixel, and so there’s no blurring effect.
And there we have it, we have just generated a plausible image, out of nothing… There’s just
one small problem. If we run this model again, it will generate exactly the same image… Not very
creative, is it? But not to worry, we can fix this by introducing a bit of random sampling. You see,
all predictors actually output a probability distribution over possible labels. Usually,
we just take the label with the largest probability as the predicted value. But
if we want diversity in our outputs, we can instead randomly sample a value from
this probability distribution. This way, each time the model is run, it will sample
different values at each step, which therefore changes the prediction for subsequent steps,
and we get a completely different image each time. Now we have an interesting image generator.
But still, at the end of the day, this model is made of predictors. They take as input a partially
masked image, and predict the value of the next pixel. The only difference between this and a
traditional image classifier is the label we used for training. The labels for our generator happen
to be pixel colors which come from the original image itself, and not a human labeller. This is
a very important point in practice: it means we don’t need humans to manually label images
for this model, we can just scrape unlabelled images off the internet. But from the point of
view of the neural net, it doesn’t know, nor does it care, that the label came from the
original image. As far as it’s concerned this is just a curve fitting exercise, like any other.
The generative model we’ve just created is called an auto-regressor. We have a removal process,
which removes pixels one at a time, and we train neural nets to undo this process, generating and
adding back in pixels one at a time. This is actually one of the oldest generative models,
the very earliest use of auto-regression dates back to 1927, where it was used to model the
timing of sunspots. But auto-regressors are still in use today. Most notably,
Chat-GPT is an auto-regressor. Chat-GPT generates text by using a transformer classifier to output
a probability distribution over possible next words, given a partial piece of text. However,
auto-regressors are not used to generate images anymore. And the reason is that,
while they can generate very realistic images, they take too long to run.
In order to generate a sample with an auto-regressor,
we need to evaluate a neural net once for every element. This is fine for
generating a few thousand words to make a piece of text, but large images can
have tens of millions of pixels. How can we get away with fewer neural net evaluations?
For our auto-regressor, we removed one pixel at a time. But we don’t have to remove only one pixel,
we could, for example, remove a 4 by 4 patch of pixels at a time. And
train the neural net to predict all 16 missing pixels at once.
This way, when we use our model to generate an image,
it can produce 16 pixels per evaluation, and so generation is 16 times as fast.
But there is a limit to this. We can’t generate too many pixels at the same time. In the extreme
case, if we try to generate every pixel in the image at once, then we’re back to the
original problem: there are many possible labels that get averaged together into a blurry mess.
To be clear, the reason why the image quality degrades is that,
when we predict a bunch of pixels at the same time, the model has to decide on the values for
all of them at once. There are lots of plausible ways that this missing patch could be filled in,
and so the model outputs the average of those. The model isn’t able to
make sure that the generated values are consistent with each-other. In contrast,
when we predict one pixel at a time, the model gets to see the previously generated pixels,
and so the model can change its prediction for this pixel to make it consistent with what has
already been generated. This is why there’s a trade-off, the more pixels we generate at once,
the less computation we need to use, but the worse the quality of the generated images will be.
Although, this problem only arises if the values we are predicting are related to
each other. Suppose that the values were statistically independent of each-other,
that is, knowing one of them does not help to predict any others. In this case,
the model doesn’t need look at the previously generated values,
since knowing what they were wouldn’t change its prediction for the next value anyway. In
this case you can predict all of them at the same time without any loss in quality.
So, that means, ideally, we want our model to generate a set of pixels that are unrelated to
each other. For natural images, nearby pixels are the most strongly related,
because they are usually part of the same object. Knowing the value of one pixel very
often gives you a good idea of what color nearby pixels will be. This means that removing pixels
in contiguous chunks is actually the worst way to do it. Instead, we should be removing pixels that
are far away from each other, and hence more likely to be unrelated. So if in each step,
we remove a random set of pixels, and predict values for those, then we can remove more pixels
in each step for the same loss in image quality, compared to contiguous chunks.
In order to minimize the number of steps needed for generation, we want the pixels
we remove in each step to be as spread out as possible. Removing pixels in a random order is
a pretty good way of maximizing the average spread, but there is an even better way.
We can think of our generative model as two processes: a removal process that gradually
removes information from the input, until nothing is left. And a generation process
that uses neural nets to undo the removal process, generating and adding back in information. So far,
we have been completely removing pixels. But rather than completely removing a pixel,
we could instead remove only some of the information from a pixel, by, for example,
adding a small amount of random noise to it. This means we don’t know exactly what the
original pixel value was, but we do know it was somewhere close to the noisy value. Now,
instead of removing a bunch of pixels in each step, we can add noise to the entire image.
This way, we can remove information from every pixel in the image in a single step,
which is the most spread-out way of removing information. And since its more spread out,
you can remove more information in each step, for the same loss in generation quality.
There is one small problem with this though. When we want to generate a new image, we need to start
the neural net off with some initial blank image. When we were removing pixels, then every image
eventually ends up as a completely black image, so of course that’s where we start the generation
process from. But now that we’re adding noise, the values just keep getting larger and larger,
never converging to anything. So where do we start the generation process from?
We can avoid this problem by changing our noising step slightly, so that we
first scale down the original value and then add the noise. This ensures that,
when we repeat this noising step many times, information from the original
image will disappear, and the result will be equivalent to a pure random sample from
the noise distribution. So we can start our generation process from any such noise sample.
And there we have it, this is known as a denoising diffusion model. The overall form is identical
to an auto-regressor, the only difference is the way in which we remove information at each
step. By adding noise, we can spread out the removal of information all across the image,
which makes the predicted values as independent of each-other as possible, allowing you to use
fewer neural net evaluations. Empirically, diffusion models can produce high-quality
photo-realistic images in about a hundred steps, where auto-regressors would take millions.
Now that we understand how these generative models
work at a conceptual level, if you are ever going to implement these models in practice,
there are a few important technical details that you should be aware of.
First, in the procedure I described for auto-regression, I used a different neural
net in each step of the process. This is certainly the best way to get the most accurate predictions,
but it’s also very inefficient, since we need to train a whole bunch of different
neural nets. In practice, you would just use the same model to do every step. This gives
slightly worse predictions, but the savings in computation time more than make up for it.
In order to train a single neural net to perform all of the generation steps,
you would remove a random number of pixels from each input, and train the
neural net to predict the corresponding next pixel of each input. Additionally,
you can also give the number of pixels removed as an input to the neural net, so that it knows which
pixel it’s supposed to be generating. Now this one neural net can be used for all generation steps.
In the setup I just described, for each training image, the neural
is trained on only one generation step for that image. But ideally,
we would like to train it on every generation step of every image, we can get more use out
of our training data that way. If you did this the naïve way you would have to evaluate
the neural net once for every generation step. Which means a lot more computation.
Fortunately, there exist special neural net architectures, known as causal architectures,
that allow you to train on all of these generation steps while only evaluating
the neural net once. There exist causal versions of all of the popular neural net architectures,
such as causal convolutional neural nets, and causal transformers. Causal
architectures actually give slightly worse predictions, but in practice,
auto-regression is almost always done with causal architectures because the training is so much
faster. The generation process for causal architectures is still exactly same though.
For diffusion models, you can’t use causal architectures and so you do
have to just train with each data point at a random generation step.
I described the diffusion model as predicting the slightly less noisy image from the previous
step. However, it’s actually better to predict the original, completely clean image at every
step. The reason for this is it makes the job of the neural net easier. If you make
it predict the noisy next step image, then the neural net needs to learn how to generate images
at all different noise levels. This means the model will waste some of its capacity
learning to produce noisy versions of images. If you instead just have the neural net always
predict the clean image, then the model only needs to learn how to generate clean images,
which is all we care about. You can then take the predicted clean image and reapply
the noising process to it to get the next step of the generation process.
Except that when you predict the clean image then, at the early steps of the generation
process the model has only pure noise as input, so the original clean image could have been anything,
and so you get a blurry mess again. To avoid this, we can train the neural net to predict
the noise which was added to the image. Once we have a predicted value for the noise, we can plug
it into this equation to get a prediction for the original clean image. So we are still predicting
the original clean image, just in a round-about way. The advantage of doing it this way is that,
now, the model output is uncertain at the later stages of the generation process,
since any noise could have been added to the clean image. So the model outputs the
average of a bunch of different noise samples, which is still valid noise.
So far we’ve just been generating images from nothing, but most image generators actually
allow you to provide a text prompt describing the image you want to make. The way that this
works is exactly the same, you just give the neural net the text as an additional input at
each step. These models are trained on pairs of images and their corresponding text descriptions,
usually scraped from image alt text tags found on the internet. This ensures that the generated
image is something for which the text prompt could plausibly be given as a description of that image.
In principle, you can condition generative models on anything,
not just text, so long as you can find appropriate training data. For example,
here is a generative model that is conditioned on sketches.
Finally, there’s a technique to make conditional diffusion models work better,
called classifier free guidance. For this, during training the model will sometimes
be given the text-prompts as additional input, and sometimes it won’t. This way,
the same model learns to do predictions with or without the conditioning prompt as input. Then,
at each step of the denoising process, the model is run twice, once with the prompt,
and once without. The prediction without the prompt is subtracted from the prediction with
the prompt, which removes details that are generated without the prompt, thus leaving
only details that came from the prompt, leading to generations which more closely follow the prompt.
In conclusion, generative AI, like all machine learning, is just curve fitting.
And that’s all for this video. If you enjoyed it, please like and subscribe. And if you have
any suggestions for topics you’d like to me to cover in a future video, leave a comment below.
浏览更多相关视频
Introduction to Generative AI
Explained simply: How does AI create art?
How Generative Text to Video Diffusion Models work in 12 minutes!
Text to Image generation using Stable Diffusion || HuggingFace Tutorial Diffusers Library
Stanford CS25: V1 I Transformers in Language: The development of GPT Models, GPT3
Introduction to Generative AI
5.0 / 5 (0 votes)