Stanford CS25: V1 I Transformers in Language: The development of GPT Models, GPT3
Summary
TLDRThis video script presents a comprehensive overview of the remarkable progress in neural language modeling, culminating in the development of GPT-3, a powerful 175 billion parameter autoregressive transformer model. It explores the evolution of language models, from n-gram models to recurrent neural networks, LSTMs, and transformers, highlighting their increasing coherence and ability to generate realistic text. The script delves into the unsupervised learning approach, demonstrating how models like GPT-3 can perform various tasks, from reading comprehension to translation, without explicit fine-tuning. It also showcases the versatility of transformers in modeling other modalities like images and code, showcasing impressive results in tasks like image generation and code writing.
Takeaways
- 🤖 Progress in neural language modeling has been rapid, driven by work on unsupervised learning in language.
- 🧠 Autoregressive modeling with transformers is a universal approach that can yield strong results even in domains with strong inductive biases like images or text-to-image generation.
- 📝 GPT models were not initially focused on language modeling itself, but rather on pushing the boundaries of unsupervised learning in the language domain.
- 🔢 Scaling up model parameters and pretraining on large unlabeled datasets allows for zero-shot and few-shot capabilities to emerge in language, image, and code generation tasks.
- 🖼️ Transformer models can be applied to model different modalities like images by treating them as sequences of pixels and using a next-pixel prediction objective.
- 🖥️ Generating diverse samples from language models through techniques like increasing temperature and re-ranking by mean log probability can significantly improve performance on tasks like code generation.
- 💻 Fine-tuning GPT-3 on code data and further supervised fine-tuning on function input-output examples can produce strong code-generating models like Codex.
- 🧪 Evaluating code generation models using functional correctness metrics like pass rates on unit tests is more informative than traditional match-based metrics like BLEU.
- 🌐 Transformer models can jointly model different modalities like text and images by training on concatenated sequences of text and image data.
- ⚠️ While powerful, code generation models still have limitations like variable binding issues and difficulties with composition of operations.
Q & A
What was the main motivation behind GPT at OpenAI?
-The GPT models were not originally created to push language modeling itself, but rather as a result of work on unsupervised learning in language.
How does GPT-3 differ from earlier GPT models in terms of performance?
-With GPT-3's much larger scale (175 billion parameters vs 1 billion for GPT-2), even just sampling the first completion often produces results comparable to taking the best of multiple samples from GPT-2.
What is the key insight that allowed GPT-3 to perform well on different tasks?
-The training process can be interpreted as meta-learning over a distribution of tasks, allowing GPT-3 to quickly adapt to new tasks based on the given prompt during inference.
How were the GPT models evaluated on tasks like reading comprehension and summarization?
-The prompts were framed in a natural language format, allowing zero-shot evaluation by having the model continue generating text based on the provided context.
What is the key advantage of using transformers for modeling different modalities like images?
-Transformers can ingest any sequence of bytes, allowing them to model various data modalities like images, audio, or video represented as sequences on computers.
How does DALL-E demonstrate the capability of transformers to model multiple modalities?
-DALL-E was trained on the joint distribution of text captions and images, allowing it to generate images conditioned on text captions or perform zero-shot multi-modal transformations.
What was the main motivation behind Codex, the code generation model?
-GPT-3 already showed rudimentary ability to write Python code, so the researchers wanted to explore training a model specifically on code data to enhance this capability.
What is the key advantage of the evaluation metric used for Codex over standard metrics like BLEU?
-The pass@k metric based on unit tests provides a ground truth evaluation of functional correctness, which BLEU and other match-based metrics cannot capture effectively for code.
What is the 'unreasonable effectiveness of sampling' observed with Codex?
-Sampling many solutions from the model and reranking them significantly improves the pass rate, showing that the model composes different approaches rather than simply resampling the same approach.
What are some of the key limitations of the current code generation models?
-The models can struggle with maintaining proper variable bindings across complex operations and have difficulty composing multiple simple operations into more complex ones.
Outlines
📚 Evolution of Language Models
This paragraph discusses the progression of language models from the era of n-gram models to the development of neural network-based models, focusing on recurrent neural networks (RNNs), long short-term memory (LSTM) models, and the groundbreaking shift to transformer-based architectures like GPT-2 and GPT-3. It illustrates how each innovation contributed to improving the coherence and relevance of generated text, from producing largely incoherent gibberish to creating text that is not only coherent across multiple sentences but also maintains thematic consistency, albeit with occasional errors or nonsensical phrases. This evolution showcases the language models' growing ability to understand and generate human-like text, culminating in examples where GPT-2 and GPT-3 can produce impressively coherent stories and explanations.
🔍 Improving Coherence and Realism in Generated Text
The second paragraph delves into how advances in language models, specifically with GPT-3, have led to the generation of text that not only achieves greater coherence but also mimics the stylistic and thematic nuances of specific genres, such as novels. It addresses questions about the size and complexity of GPT-3 compared to its predecessors, emphasizing the significant increase in parameters (from 1 billion in GPT-2 to 175 billion in GPT-3) and how this scale contributes to the model's nuanced understanding and generation capabilities. The paragraph also touches on the concept of neural scaling laws, suggesting that the improvement in language models' performance can be anticipated based on the scaling of model size, training data, and computational resources.
🌐 From Supervised to Unsupervised Learning in Language
This section explores the shift from supervised learning approaches to unsupervised learning in the context of language modeling, highlighting the vast potential of leveraging the internet's extensive repository of unlabeled data. It outlines the challenges associated with unsupervised learning, such as the absence of direct objective alignment with desired downstream tasks, but also emphasizes the optimism in the language domain due to the availability of large amounts of text data. The paragraph elaborates on the utility of generative models, especially autoregressive models, in understanding and generating language by synthesizing diverse and coherent samples.
🚀 Leveraging Unsupervised Learning for Language Tasks
The fourth paragraph showcases how GPT-2, by being trained on large swaths of the internet, capitalizes on unsupervised learning to perform a variety of language tasks without task-specific fine-tuning. It demonstrates the concept of zero-shot learning through examples like reading comprehension, summarization, and translation, illustrating how GPT-2 can understand and respond to prompts in a context-aware manner. The discussion extends to the role of model scaling in enhancing zero-shot capabilities and the importance of finding effective measures to evaluate translation quality, underscoring the limitations of current metrics like BLEU.
🔬 Autoregressive Models and Their Applications Beyond Language
This paragraph discusses the application of autoregressive models, specifically the GPT architecture, beyond language tasks to other domains such as images, through a project called DALL-E. It highlights the flexibility of the transformer architecture in modeling different data modalities by converting images into a 'language' of pixels and then generating images based on text descriptions in a zero-shot fashion. The section emphasizes the universality of the autoregressive modeling approach and its effectiveness in handling tasks even where strong inductive biases exist, as demonstrated by successes in both text-to-image generation and code generation with Codex.
📈 Codex: Specializing GPT for Code Generation
The concluding paragraphs focus on Codex, a model specifically trained to generate code by fine-tuning GPT-3 on a large dataset of programming code. It details the motivation behind creating a model focused on code, the unique challenges of evaluating functional correctness in generated code, and the introduction of a new metric, pass@k, for this purpose. The discussion showcases how Codex significantly outperforms previous models in generating functionally correct code, highlighting the importance of sampling strategies and the potential for further improvements by integrating reranking techniques based on meaning rather than probability. The section closes with acknowledgments and reflections on the limitations of current models, pointing towards areas for future exploration and enhancement.
Mindmap
Keywords
💡Language Modeling
💡Unsupervised Learning
💡Autoregressive Modeling
💡Transformer Architecture
💡Zero-Shot Learning
💡Few-Shot Learning
💡Multimodal Learning
💡Code Generation
💡Neural Scaling Laws
💡Sampling
Highlights
The GPT model was not developed specifically for language modeling, but rather as a result of work on pushing unsupervised learning in language.
Autoregressive modeling is universal and can yield strong results even in domains with strong inductive biases, like images or text-to-image tasks.
By fine-tuning GPT-3 on code data and employing sampling, strong code-generating models can be produced, with the 'unreasonable effectiveness of sampling' significantly boosting model performance.
GPT-3 already had a rudimentary ability to write Python code from docstrings or descriptive method names, despite not being trained on much code data.
A new evaluation dataset called HumanEval was created, consisting of handwritten programming problems with function names, docstrings, solutions, and unit tests.
The 'pass@k' metric was introduced, measuring the average probability that at least one out of k samples passes the unit tests for a given problem.
Techniques like compressing runs of white space and fine-tuning from GPT-3 models were employed to make training more efficient.
Sampling at different temperatures affects the pass@k rate, with higher temperatures allowing for more diverse samples at the cost of lower individual sample quality.
Ranking samples by their mean log probability (mean log-p) rather than sampling probability can approximate the 'oracle sampling' performance without access to unit tests.
Fine-tuning Codex on additional data sources like competitive programming problems and projects with continuous integration tests further improved performance.
Generative models like Codex can struggle with variable binding and composition of simple operations, limitations that human programmers do not face.
Progress in neural language modeling has been rapid, driven by advances in unsupervised learning.
Autoregressive models can model any sequence of bytes, making them applicable to various modalities like images and audio.
Techniques like using contrastive loss during pretraining and scaling up model size significantly improved unsupervised learning capabilities.
The ability to distinguish between real and fake samples decreased as model size increased, approaching random chance for large language models like GPT-3.
Transcripts
Great.
OK, perfect.
So a sample from this model looks like this.
"They also point to ninety nine point
six billion dollars from two hundred four
oh six three percent."
It's a bunch of kind of gibberish.
So the sentence isn't too coherent,
but at least the words do seem to be somewhat related,
like they come from the same space.
Now, jumping forwards to the beginning of the deep learning
boom in 2011, we have language modeling with neural networks
now, and in particular with recurrent neural networks.
So you can get rid of this giant lookup table
from the n-gram models.
And instead, we can have our inputs be these tokens
and let this kind of return cell remember some state
and persistent state.
So if we set up a neural model like this,
we get a sample as shown below.
"The meaning of life is the tradition
of the ancient human reproduction--
it is less favorable to the good boy for when to remove bigger."
So again, this doesn't really make any sense,
but it kind of starts to have the flow of a real sentence.
Yeah, so jumping forward even more to 2016,
we have LSTM models.
And of course, LSTMs are an architectural innovation
on top of RNNs.
And they have better gradient flow,
so they can better model long-term dependencies.
And so with an LSTM model, we get a sample like this.
"With even more new technologies coming onto the market
quickly during the past three years,
an increasing number of companies
must tackle the ever changing and ever
changing environmental challenges online."
So this sentence is starting to make a little bit of sense,
though there are clear artifacts, like the repetition
of the phrase ever changing.
Now, starting in 2018, we have our first autoregressive
transformer based language models,
which are even better at modeling these very
long-term dependencies.
And here, what I'm showing is an example of a completion.
So in a completion, the user supplies the prompt.
In this case, it's this text, Wings Over Kansas.
And the model will continue from this prompt.
So you can see that this completion
is coherent across multiple sentences
now, though there are notable spelling mistakes.
So you see this whatever "daknfi" is.
So it doesn't kind of make sense.
And now we arrive at GPT-2, which is a 1.5 billion
parameter transformer model.
And I copied in what I personally
found was the most compelling conclusion from GPT-2.
And in contrast with the last slide, what this does
is it sets up a clearly fake prompt.
So we have something about finding unicorns and scientists
in South America.
And so the model has probably not
seen this exact prompt before.
It has to make up something that's consistent.
So the thing I find most impressive is it does so,
and it's coherent across multiple paragraphs.
It invents this fictional Dr. Perez,
and it persists Perez throughout multiple paragraphs.
And I think it's very aptly named.
You have him from University of La Paz.
And yeah, we just have barely coherent completions
at this point.
So it's worth disclosing that this
was the best of 10 samples.
So we still had to sample multiple times
to get a sample like this.
And finally, to end this section--
I'm sorry.
Can I interrupt?
Yeah, for sure.
We're not just thinking of examples of the failing,
the worst of the text.
I can post them up, yes.
[INAUDIBLE] what's bad and what's [INAUDIBLE]..
Yes, yes, yes, yes.
[INAUDIBLE]
[LAUGHTER]
Wait, sorry.
One last question.
When you have these 10-- you said
we took the best of the 10.
Best in what sense?
Yeah, so this is human-judged.
And I'll probably expand a little bit
on that more today, yeah.
So I want to end this kind of fly by overview with GPT-3.
And since GPT-2 already produces such coherent text,
how do you characterize GPT-3?
And I would say that the best way
to do so is that say you took the best of out of five
or ten completions from GPT-2.
That would be kind of your first completion from GPT-3.
And of course, best is kind of a personal metric here.
So here, I'm showing a completion from the book
Three-Body Problem.
And you can see that the impressive things
about this completion are that it really stays
true to the style of the novel.
I think the second thing that kind of impressed
me was just how poetic like the metaphors and similes that it
produces are.
So you have this stuff like blood
was seeping through a jacket and a dark red flower
was blooming on her chest, like these kind of very, very
poetic and stylistic sentences.
So it definitely understands it's part of a novel,
and it's trying to generate this kind of prose
in the same style.
So as generated text becomes more and more coherent,
I think one of the really--
[INAUDIBLE] how much bigger is it in terms of the parameters,
is GPT-3?
Yeah, yeah, so it's 175 billion parameters versus GPT-2,
which is around 1 billion.
[INAUDIBLE]
Do you feel like that very subtle increase in accuracy
is the root cause of how much difference [INAUDIBLE]??
Yeah, that's a very good question.
So there's kind of stuff-- maybe we
can dive into it a little bit after,
but there is work on neural scaling laws.
And so the idea is like, can you predict the performance
of a larger model from a series of smaller models?
And so I would rather characterize the increase
in performance not by the small gain in perplexity,
but whether it lines up with the projections.
And in that sense, GPT-3 does.
So yeah, that's some intuition for--
yeah.
I think personally, I hope OpenAI would have stopped
the experiment if it did it.
So yeah.
No,
I just think it's interesting for, this
is more of a general thing.
[INAUDIBLE]
In machine learning, you see people
pushing for like an extra 1% to probably 5% accuracy,
but the models are increasing at a scale that's exponential.
Right.
So I wonder sometimes whether it's worth it
and where you should stop [INAUDIBLE]..
Right.
Yeah, I think maybe this slide will get to it a little bit.
But there's also some sense in which
like as you reach kind of like the entropy floor of modeling,
every halving kind of gives you--
if you think about accuracy, it's not on a linear scale.
A 1% early on isn't the same as that last 1%.
And so those last bits really do help you squeeze
a little bit out of that.
That's obvious.
Yep.
Sorry.
[INAUDIBLE] the [INAUDIBLE] access too?
Oh yes.
Sorry, this is accuracy [INAUDIBLE]..
I will explain this slide.
Cool.
So as generated text becomes more and more realistic,
I think one very natural question to ask
is whether humans can still distinguish
between real and fake attempts, right?
And here we have--
this is, of course, a very set up scenario.
In all cases, the models wouldn't trick humans.
But this is for news articles, we kind of
presented GPT-3 generated samples
against real news articles.
And you can tell as the number of parameters increases,
the ability of humans to distinguish
between the real and fake articles--
that ability goes down to near random chance.
And, oh, yes?
How did you generate the news articles?
What prompts did you use?
Oh, I'm actually not completely sure.
So I didn't do this work particularly,
but I think one possible approach would
be to prime with a couple of news articles and then
just to have a delimiter and just
have it start generating news articles from there.
Yeah?
Any other quick questions?
Great.
So even with all of these impressive results,
I think it's worth taking a step back at this point and asking,
what do we really care about language modeling for?
And what is it actually useful for?
I think one can make the argument that it is actually
a fairly narrow capability.
Why would you just want some system that
just continues text for you?
And you could argue that there's more important tasks
to solve, like summarization or translation.
And I think most researchers at OpenAI
would agree with this point of view.
And in fact, GPT was not really a project
that was focused on language modeling as an end goal,
but mostly as a tool to solve a problem called
unsupervised learning, which I'm going to go through
in the next couple of slides.
So I want to do a history of language modeling at OpenAI
and hopefully motivate why we ended up
at the GPT series of models, and kind of how we arrived there.
And hopefully it will become much more intuitive
after this section.
So the deep learning boom started in 2012
with AlexNet, which was a system that
could take images and labels, and it could classify images
to their labels.
And what we found with AlexNet was these systems
were able to generalize surprisingly well.
You could take data sets that weren't necessarily
in the training distribution, and you just
flag pretty good features on it.
And since then, this kind of supervised approach
has been really, really powerful, right?
We've been able to train models in many different domains
to classify very accurately.
And you can even have some guarantees
that supervised learning will work well.
So there's empirical risk minimization.
But the problem with supervised learning
is that oftentimes the labels are scarce, right,
especially in language tasks.
There isn't really that many kind of text
paired with their summaries, or too many pairs
across languages for instance.
So collecting a lot of data can be not too hard, but actually
scalably labeling all of that data,
it could be very time consuming, and expensive.
So the main problem with unsupervised learning
is can we also learn from unlabeled data?
And this is a lot scarier, because, all of a sudden,
we're starting to optimize an objective, which
isn't the one we care about downstream, right?
So a lot of the guarantees that we used to have,
we no longer have.
And we can only kind of hope that we
learn some features that are adaptable to a wide variety
of downstream tasks.
But nevertheless, there's a reason
to be very optimistic in language.
And the reason is that there is a huge trove of unlabeled data.
And it's called the internet.
And so the real question is, can we
leverage all this available data from the internet
to solve language tasks where we don't really
have that much data?
And the hope is that if we kind of pretrain
this model on the internet, it will
see all of these words used in different settings,
kind of understand the relationships,
and they'll be able to leverage this kind of understanding
for any kind of task du jour.
So now that we've established why language
is such a good domain to try unsupervised learning in, let's
talk about why use generative models for it,
and also why use autoregressive generative models.
And I do want to stress that a lot of the guarantees we have
with supervised learning are no longer there for unsupervised
learning.
So some of these arguments will be
a little bit kind of intuitive.
And so the first argument I want to present is this quote
by Richard Feynman which is pretty widespread,
"What I cannot create, I do not understand."
And there's the inverse of this idea, which we call analysis
by synthesis.
And it's "What I can create, I can also understand."
And this has been studied by Josh Tenenbaum.
There's definitely some kind of biological motivation
as well for it.
But the idea here is that if you're
able to create a language model which can generate
diverse samples that are coherent, then
it must also build up representations
that can help you solve language understanding tasks.
And then the next question is, why do we
use autoregressive models?
You might argue that autoregressive models
are a kind of local objective.
You're just predicting the next words.
You could do really well with some n-gram approximation.
Why would it be good at solving things
that allow you to summarize an entire piece of text?
And so, an intuitive argument here
could be, say that you wanted to do very well on language
modeling for a mystery novel.
And there's this grand reveal at the end,
like, oh, the culprit was--
and then you want to predict that next token.
And to do really well at that task,
you really need to have a good understanding
of what happened in the story along with all the twists
and turns, and maybe even some of this kind of like deductive
reasoning built in.
So the first sign of life--
did you have a question?
[INAUDIBLE]
Oh, yeah.
So the first sign of life we had at OpenAI
was in the task of predicting whether Amazon reviews were
positive or negative.
And this was worked on in 2017.
So instead of training a classifier
in the kind of typical supervised way, what we did
was we trained an LSTM model just
to predict the next character in Amazon reviews.
And when we trained a linear model on the features
from this LSTM, what we found, surprisingly,
was one of these cells or one of these neurons
was firing in terms of predicting sentiment.
And positive activations for this neuron
corresponded to positive reviews,
and negative activations to negative reviews.
And this was despite not seeing any of the labels
at training time.
So you can even track kind of what this neuron
value is across a sample.
So it's a little bit hard to read,
but these are reviews where maybe someone says,
oh, I really liked this film, but I didn't like this part.
And you can kind of see the sentiment switching as you
go from positive to negative.
So yeah, just predicting the next character
resulted in-- oh yeah?
Was there any sort of [INAUDIBLE] architecture
to encourage this?
No, this was just a pure LSTM.
OK.
So you guys came up with all the neurons,
saw which ones were closest?
Yeah, in the hidden state.
Yeah.
So you train a linear classifier on top of that.
And one neuron is firing with--
yeah, just outsized predictive power.
Yeah, great.
So next, GPT-1 was one of the first demonstrations
that this kind of approach could work broadly for text.
So GPT-1 was trained on the internet, not on Amazon reviews
anymore.
And it was fine tuned on a bunch of different downstream tasks.
And one thing to stress here was, kind of to your point
that the fine tuning was very--
I guess minimally-- you're not kind
of bashing the architecture apart and kind of repurposing
a new module.
So it's just a new head that classifies for your task.
And this showed that you can use this approach
not just for sentiment analysis, but also for entailments,
and semantic similarity, and getting SotAs
on a lot of these benchmarks downstream.
So I've already presented GPT-2 from the point
of view of a very powerful language model.
And now, I think it's worth revisiting from the viewpoint
of unsupervised learning.
So like GPT-1, GPT-2 was trained on a large chunk
of the internet.
And it's only trained to predict the next token
or word from previous words.
But the key insight of GPT-2 is that many downstream tasks
can be expressed naturally as a language model in the past.
And yeah, so GPT-2 explores how well
we can perform on downstream tasks
simply by using this method without any fine tuning, right?
So let me start with a couple of examples.
So let's say you want to solve some reading comprehension
benchmark.
And this is usually set up as a prompt, which
is some passage you have to read,
and then a bunch of questions, which you have to answer.
So you can literally just stick the entire project in context.
You put a question, colon, you write out
the question, answer, colon.
And then have the model complete from there.
And this gives you Zero-Shot reading comprehension.
We can also use it for other tasks, like summarization.
For instance, here's the beginning
of a CNN article about kind of some archaeological finding.
And you can just put TLDR after you see this passage.
And the model, hopefully, if it's good enough,
will produce good summaries.
And the final example I want to show
is that you can do Zero-Shot translation as well.
So the way you would do this is if you wanted to convert,
let's say, a French sentence into English,
you could set up a prompt like the sentence,
insert the French sentence, "translated from French
to English means," and then the model will complete.
And they can sometimes do this well.
And one kind of critical thing to note
here is here's a chart of performance
as you increase the number of parameters.
And all these models are trained on the same data sets,
so the only kind of confounding variable is scale.
And you can see that as we scale up the models,
these kind of Zero-Shot capabilities
emerge and kind of smoothly get better.
So the role of scale is important here.
And I think these are starting to approach--
I guess they're not great benchmarks, but at least
respectable benchmarks.
[INAUDIBLE]
Yeah, exactly.
It's not going to be great in a lot of cases.
And to be honest, the blue metric
used for translation is actually often--
thank you very much.
It's not a great metric.
What it does is it takes a reference solution.
And basically, it does some kind of like n-gram comparison.
So it is a big problem to have good translation
metrics in NLP.
And yeah, I think when I talk about code,
I'll talk a little more about [INAUDIBLE]..
So let's finally talk about how GPT-3 fits into this picture.
So the primary insight of GPT-3 is
that the training process itself can be interpreted
in the context of metalearning, which is kind of like learning
over a distribution of tasks.
And during training, what the model
is doing is it's developing certain kind of capabilities,
it's picking up some set of skills in terms
of modeling certain passages.
And during inference time, what it's doing,
it's kind of quickly picking up on what a task is based on what
the prompt is so far, and adapting to that task
to predict the next token.
So you can kind of view this as outward loop of all the SGDs
that you're doing during training,
and this inward loop of kind of picking up
on what the task is, and then modeling the next token.
So you can imagine a lot of tasks being framed in this way.
For instance, on the left, you can have addition.
You have a lot of examples of machine context.
And hopefully, that would help you with a new addition
problem, or you can try to kind of unscramble
a word for instance.
And I'll explore results on these two kind of benchmarks
in the next slides.
So this setting, you can call two-shot arithmetic.
And just to explain what's going on,
you're taking the entire context slide of your transformer
and you're putting in as many examples as will fit.
And then finally, you put in the example
that you would like to solve.
So here, these examples could be these kind
of first three addition problems,
and then you have 31 plus 41 equals.
And you ask the model to complete.
So you notice that as the language model gets bigger,
it's better able to recognize this task.
And you can see that performance on addition, subtraction,
even some kind of multiplication tasks
increases sharply as you go towards 200 billion parameters.
And there does seem to be kind of some step function change
right here.
And looking at word unscrambling,
this is also true.
So we have parameters again on the x-axis, we have accuracy.
And each of these is a different kind of unscramble task.
So this blue line is you kind of do
a cyclic shift of the letters, and you wanted to uncycle.
And there's a lot of other transforms
you can do, like randomly inserting words for instance.
So the final point here is that this
is a pretty general phenomenon.
We didn't just test it on these two aforementioned tasks.
We tried an array of I think 40 plus tasks.
And here, you can see how the Zero-Shot, One-Shot,
and Few-Shot performance increases
as we scale the models.
So of course, they're all smoothly increasing.
But one thing to be aware of is that the gap between Zero-Shot
and Few-Shot is also improving as a function of scale.
Awesome.
So we've just seen that we can pretrain the--
Oh.
Go ahead.
Sorry, with few-shot learning, I was curious, [INAUDIBLE]..
One is the tasks themselves that were used.
Two is the number of parameters.
And then three, my understanding is also
the quantity of [INAUDIBLE].
I was curious, between those three, which ones--
you've shown a lot of examples.
The number of parameters definitely helps.
I was curious though if you had a sense of the degree
to which also the training tasks and the sophistication
of the tasks, as well as the quantity of [INAUDIBLE]
adjustments [INAUDIBLE].
Yeah.
So I guess I can dive--
maybe it's something to save for or after.
Yeah, let's dig into that after.
Yes?
Just a thought, [INAUDIBLE] a little bit, too, right?
I guess GPT-2 and 3 aren't different.
GPT-1 just has an extra classification head
for certain tasks here.
Great, yeah.
Good questions.
So yeah, we've just seen that we can use a transformer
in this kind of pretrained the [INAUDIBLE] setups,
where we have a lot of unlabeled data in the pretraining
setting.
And we have just a little bit of data
in the fine-tuned settings.
And we can solve a lot of language tasks in this way.
And I would say this has become the dominant paradigm
in language over the last couple of years.
So there's follow up objectives, like BERT and T5,
which have done extremely good at pushing the SotA.
But there's nothing really that says
that these transformer models have to be applied to language.
The transformer is a sequence model.
And as such, it can just ingest any sequence of bytes
and model them.
And when you think about this, all of the data
that we consume, like videos or audio,
they're represented on our computers
as sequences of bytes, right?
And so we might think, oh, could this approach
be used to just model whatever modality we want?
And I think this kind of paradigm
is very at least interesting when we don't really
have good inductive biases.
We don't necessarily update them.
But one question to ask is, does it even
work when you do have really strong inductive biases?
So I'm going to present some work that
suggests that the answer is yes, it still
works fairly well in this case in the domain of images, where
convolutions are already so popular and proven out.
And I'm going to show a second result very briefly here,
which is DALL-E, which shows that it's strong enough
to even ingest two different modalities
and be able to jointly model them.
So the first question is, how would you apply GPT to images?
And there's a few things you have to do.
You have to modify this autoregressive next word
prediction objective.
So the natural analog is you can think of images
as a very strange language, where the words are pixels
instead.
And instead, you need to predict the next pixel at each point.
And so we can just change the objective for the next word
prediction to next pixel prediction.
And of course, we want this kind of large-- yeah?
[INAUDIBLE]
Oh, yeah.
So you just unroll it as a sequence.
It's the same way it's stored on a computer.
You just have a secret spice, yeah, yeah.
Good question.
So in the language setting, we pretrain
on this large unlabeled data set on the internet,
and we fine tuned on question answering
or these other benchmarks.
In images, one good analog of the situation
is you can pretrain on ImageNet without the labels.
If you have, let's say, a low resource-- a low data,
sorry, setting like CIFAR.
And you can try to attack CIFAR classification.
And of course, in both settings, you can do fine tuning.
In GPT, you can do Zero-Shot.
And I would say the standard eval
on images is you do linear probes,
so you take features from your model.
The model is frozen.
You pass through CIFAR through-- do the model,
get some features.
And you see how predictive these features
are of the CIFAR classes.
Is it kind of PixelCNN which basically
asks a model to predict the next pixel given the [INAUDIBLE]..
Yeah.
So PixelCNN is an instantiation of an autoregressive image
generation model.
So what we're asking here is, can we actually take
the same transformer architecture that we
use in language, don't make any modifications at all,
and just throw--
so there's no kind of 2D prior on this, yeah.
So yeah, I'll call this model that we trained Image GPT-2,
or IGPT for short.
And here, you can see actually what some completions
from the model look like.
So on the left column what I'm feeding in
is the pixels of the first half of the image.
In the next four columns, what you're seeing
is different model generated completions.
In the right column here is the original reference image.
And you can actually see that the model
is kind of doing some interesting things, right?
If you look at the last two rows,
it's not coming up with kind of magically the same completion
every single time.
It's like putting these birds in different settings,
sometimes adding reflections.
It's putting this lighthouse in grassy areas
and like watery areas for instance.
So if you buy into this philosophy of analysis
by synthesis, we definitely have some hint
of the synthesis part.
So I don't have time to go through all of the results with
you, but I just want to say that it is fairly successful in this
CIFAR setting where you don't have much labelled data.
If you train a linear model on top of the features,
you get better results than if you do the same approach
with a ResNet trained on ImageNet with labels.
So that's the typical approach in the paper.
You train some ResNet on ImageNet,
you get the features-- oh, yeah?
[INAUDIBLE]
Oh, yeah.
And if you compare to this approach, a generative model
on ImageNet without the labels, take the features.
It's actually better predictive of [INAUDIBLE]..
Yeah, [INAUDIBLE].
What if the architecture for this is the same [INAUDIBLE]??
Oh, yeah.
[INAUDIBLE]
Exactly, yeah.
[INAUDIBLE]
Yeah, yeah, yeah.
It's the GPT architecture, yeah, yeah.
So you can modify GPT to have like 2D bias.
Like you can do 2D position embeddings.
We'll be able to do that.
We just want to see can you use the same exact approach.
Yeah?
So earlier, you said the data's just sequential.
But also there's like metadata showing
about how that sequential should be reconstructed at the end.
So what's the width, for example.
Oh, can you explain?
Yeah.
Sorry if I didn't say that well.
So the data on this [INAUDIBLE]?
Yes.
OK.
But when you want to transform this sequence into an image,
you have metadata that will say something
like-- just like in NumPy arrays, it'll say,
here's the stride.
So you're just going to rearrange it [INAUDIBLE]..
I see.
What I'm curious to know, is does
GPT, before it's given an image, at least given
this metadata [INAUDIBLE]?
I see.
Yeah, that's an extremely good question.
Because I don't know how this problem is solved.
Yeah.
In this case, all the images have the same shape.
Oh, OK.
OK, cool.
But we don't tell it like the concept of row
within the model, yeah.
But if all images are the same?
Yeah, so it needs to learn it from the data.
But yeah, the data looks the same.
Got it.
[INAUDIBLE] variable image shapes,
then they can just submit [INAUDIBLE]..
Yeah.
Mhm.
Aren't there a lot more pixels than there are
[INAUDIBLE] sizes [INAUDIBLE]?
Yes.
This is pretty low resolution images.
Yeah, so we can actually-- the models
we're comparing against are trained on kind
of high resolution images.
So I think that makes it even more impressive.
Yeah, we're just trading out the 32 by 32 res image, yeah.
Cool.
So if we fine tune these models for CIFAR classification,
we can get 99% accuracy, which matches T5.
T5, for instance, is a system which
is pretrained on ImageNet with labels and then also fine tuned
with labels.
So yeah, it just kind of shows you,
even this approach which doesn't really know about convolutions
can do well.
I think you're going to hear more about that next week
with Lucas' talk.
So by now, it shouldn't be surprising at all
that you can model a lot of different modalities
with transformers.
So in DALL-E, we just ask, what about throwing
two different modalities at the model
and seeing if it can learn how to condition on text
to produce an image.
And for instance, one thing you might want it to do
is like you provide one of these text captions,
and you want it to generate some image like the one below.
And the easy way to do this is just
train a transformer on the concatenation
of a caption and an image.
And of course, in a lot of these situations,
the idea is very simple, but the implementation and execution
is where the difficulty is.
And I'm not going to talk too much about that.
I think the focus today is on language.
But you can refer to the paper for a lot of those details.
Could you describe if you have a caption that's-- a variable
caption?
Okay, yeah, so you have a max caption length,
and you just kind of cut it off at that length.
And you can pad up to that one.
So you can see that it can generate fairly good samples.
So if you want like a storefront with the word OpenAI on it,
it's not perfect, but at least it's
kind of like reverse OCR problem, where you
take some text and render it.
And it's kind of typically rendering it
in like office-looking places.
So that's one encouraging sign.
But I do think my favorite results here are Zero-Shot
in machine transformation.
So what's going on here, is, for instance,
if your prompt is "the exact same cat
on the top as a sketch on the bottom,"
and you feed in the top half of this image, which is a cat,
and you ask it to complete the rest of the image,
then it'll render the top cat actually as like a sketch.
And you can do the same thing with flipping over
photos for instance.
You can zoom in to a photo.
Of course, they're not perfect, but it
has some understanding of what the text is trying to do, yeah.
In the caption originally like in the training set,
do they have wording such as extreme closeup view?
I think that-- there probably are some examples like that,
and that's probably where it's picking up
some of this knowledge from, though we
don't seek out these examples.
It's just--
[INAUDIBLE]
Yeah exactly.
[INAUDIBLE]
OK, perfect.
This is just-- we just go and do a massive web scrape.
We're not trying to find examples like this, right?
And so you can also do things like colorization, right?
You can take the cat and color it red.
And this has to kind of recognize what
the object is in the figure.
And yeah, here, you can do stuff like semantic transformations,
like adding sunglasses into the cat.
And you can put it on postage for instance.
So it's just remarkable that you can
do a lot of these like transform Zero-Shot.
It wasn't trained to do these things specifically.
Cool, so moving on, the last section of my talk today
is on Codex, which is our most recently released code writing
models.
And the first question you should rightly ask here
is, why train them all on code anyway?
At this point, isn't it just another modality?
And what is the novelty that there is at this point, right?
So let me give you a couple of reasons.
So first is that GPT-3 had a rudimentary ability
to write Python code already from a docstring
or a descriptive method name.
And we actually didn't train it on much code data.
Actually, I think there might have been active filtering
to get rid of code data.
And so we were surprised that there
is this capability anyway.
So we thought if we actually purposed a model
and trained it on the large amount of code
that we can find, maybe something interesting
will happen there.
Next, what sets apart code from other modalities
is that there is a kind of ground truth
correctness of a sample.
And functions can be tested with unit tests and an interpreter.
So this is very different from language,
where to get a ground truth eval,
you might need a human to come in.
And even then, sometimes humans won't agree.
Like, this is the better example or this
isn't the better sample.
Last thing is I used to dabble in competitive programming
myself, and I really wanted to create a model that could solve
problems that I couldn't.
Go ahead.
[INAUDIBLE]
Is this the same thing [INAUDIBLE] get up on this?
[INAUDIBLE]
Yeah, exactly.
[INAUDIBLE]
Yeah, we wrote a paper on it, too, so, yeah.
So I recognize that you use kind of a high level programming
language where it's basically similar to like
our human language.
Have you guys ever tried to predict some even lower
level operations like CPP, or--
Yeah, I think there's follow-up work where we just
train on a bunch of different languages.
And I don't know the metrics off the top of my head,
but I have seen some assembly writing models, cool.
So I guess, yeah, continue on the [INAUDIBLE]..
So we have this setting where we have unit test and interpreter.
So how do we actually evaluate these models
in a way that's kind of aware of these two concepts?
So the first thing we did was we have a data set, a new data
set, which is 164 handwritten programming problems.
And these kind of have the format shown here.
There's a function name, a docstring, there's a solution,
and there's an average of around eight unit tests per problem.
And why is it important that we hand wrote these?
Well, the thing is we're training
on such a large part of GitHub.
If you said, OK, I'm going to take like some LeetCode
problems, and I'm going to turn them into an evaluation.
That's not going to work, because there's
just so many GitHub repos that are like, oh, here's
the solution to this LeetCode problem.
So while this doesn't kind of guarantee
that this problem isn't duplicated,
at least someone wrote it without copying it
from another source.
So here's some kind of examples of a unit test
that you would evaluate the previous function on.
I think it should be fairly clear that we
should be using this metric.
This is the correct kind of ground truth metric to use.
I mean, humans do use unit tests to evaluate code.
And I would say if you're familiar with competitive
programming, you can't manually judge
all like tens of thousands of submissions that are coming in.
You need the unit tests.
And that is a fairly good placement.
So one interesting point here was
we had to create a sandbox environment
to run these kind of generated solutions in.
Because when you train on GitHub,
there's a bunch of malicious code,
there's a bunch of kind of insecure code.
You don't want your model to be sampling
that and kind of running that on your environment.
Cool.
So now that we have an evaluation
data set, let's define a metric on them.
And so the metric we're going to use is called pass @ K.
And the definition is the average probability
over all the problems that at least 1 out of K samples
passes the unit tests.
So if we evaluate this metric by just taking every problem
and exactly generating k samples, it's actually not--
there's high variance just kind of sampling it that way.
Imagine the pass rate of a particular sample is around 1
over k.
This is kind of like an all or nothing metric.
So what we do instead is we generate a much larger set
of samples, n greater than k--
most of the time, it's greater than 5k.
And we count the number that are correct,
and we compute this unbiased estimator.
And it looks more complicated than it actually is.
It's just complementary counting.
You take the number of combos where all of them fail
and subtract that out.
Cool.
So then we train our model.
And like I alluded to earlier, there's
about 160 gigabytes of code which is collected
from 54 million repositories.
For efficient training, what we did
was we fine tuned from GPT-3 models of various sizes.
And this isn't actually strictly necessary.
We find that we can get to roughly the same final loss
in performance without this, but it is slower
to do it without this pretraining step.
And so we already have these models;
why not just fine tune them?
And one extra trick to make training much faster here is--
in code, there's a lot of runs of spaces, right,
and those don't get compressed efficiently in language
because you just don't see them very often.
So they typically get broken up into like many separate tokens.
So we introduce additionally some tokens that
compress runs of white space.
And that makes training maybe like 30% or 40% more efficient.
So the token [INAUDIBLE]?
Yeah, exactly, yeah.
Great, so once we have these models,
we can go and revisit the HumanEval data set.
And I can share a couple of problems
to give you a sense of where the models are at and also
what kind of difficulty level the problems in the data set
are at.
So this is a 12 billion parameter model.
The pass rate is 90%, which means that 90% of the samples
will pass the unit test.
This is something like anyone kind
of doing a first day of Python would be able to do.
So you increment all the elements of a list by 1.
Here's a problem where the pass rate is 17%.
So this is the solution I gave-- that's the problem I
gave earlier.
So you are given a non-empty list of integers.
You want to return the sum of all odd elements that
are in even positions.
And this might not sound that much harder to you,
but models can often get confused about, oh,
is odd referring to positions or elements?
And so here, you can actually see that it's
doing the right thing.
And finally, this is an example of one
of the harder problems in the data set.
So the password is under 1% here.
And what's going on here is actually
there's an encode function which takes a string.
It kind of chunks it up into groups of three characters.
And it does a cyclic shift on each character.
And you have to write a decoder, something
that reverses this operation.
So you can see that the model-- this is a real model
solution, so it trumps up the characters in the same way.
You can see that the cyclic shift is the opposite way.
So up there, it takes the first element of each group,
moves it to the end, and now takes
the last element of each group, moves it to the front.
Yeah?
OK, I'm wondering what's the effect
of-- so you had a couple of examples [INAUDIBLE]
in the comments.
So I'm wondering if the model will
be able to extrapolate what it's doing
by the examples [INAUDIBLE] underlying [INAUDIBLE]..
Right, yeah.
So some of our tasks, there are some examples in the docstring.
And some of them don't.
I think it's just to kind of match
the distribution of a real kind of task
we find in the real world.
In this case, it doesn't have it.
But definitely for the unit tests, none of those
appear within--
I'm just curious-- if you just give it the examples
and not give a description of the task [INAUDIBLE]..
Oh, I see, I see.
So can it do like pure induction, where you don't
tell the task at all, yeah.
I haven't tried it, to be honest.
I think it's worth a shot.
Yeah.
Thanks.
At this point, we've trained Codex models,.
We've evaluated on this metric.
But the thing is, was it worth all this trouble, right?
You already had these metrics like BLEU
that are match-based in language.
Couldn't we have just used this to [INAUDIBLE]??
We don't need an interpreter.
We don't need to generate so many samples.
And it would be great if it kind of
like separated out like this.
But what we find is that this is--
if you take four random problems from HumanEval
and you plot the distribution of BLEU scores
for correct and wrong solutions, you actually
find a lot of distribution overrule, right?
It's hard to distinguish the green
from the blue distributions.
And so this suggests that BLEU actually
isn't a very good metric for gauging functional correctness
and that we actually do need this new kind of metric
and this new data set.
So now, let's explore the setting where in pass @ k,
k is greater than 1.
And so the first observation we have here
is that the temperature that you sample at,
it affects your pass @ k.
And just for some intuition, if you do temperature zero
sampling, you're going to get the same sample
every single time you're doing artifact sampling.
So it doesn't matter how many samples you generate.
You're just going to get the same pass rate.
And if you want to generate 100 samples,
right, you can afford to make some mistakes.
You just want a very diverse set of samples.
So you can up the temperature.
And yo can see kind of as you up the temperature, the slope
of the kind of number of samples against pass rate,
it becomes steep.
And so you can kind of take the upper whole of this
and you can find the optimal temperature
for each number of samples.
And so this brings me to personally my favorite result
of the paper, which I call the unreasonable
effectiveness of sampling.
And so let me explain what's going on here.
This is the number of parameters in the model.
And here, you have pass rate @ 1 and pass rate @ 100.
And the reason I use this term unreasonable effectiveness
is that I think there's a world where,
if the orange line and the line weren't that far apart,
I might not be that surprised.
At these scales, the model, it rarely makes syntactical errors
anymore.
If you run it, it'll run and produce some kind of output.
So you could imagine a world where basically the model
has some approach in mind.
It's just repeatedly sampling that approach.
And it's just either right or wrong.
But instead what we find is that the model is actually
composing different parts and producing
functionally different things.
And you get this huge boost from under 30% to over 70%
just by sampling a lot of samples from the model.
So unfortunately, knowing that one of your samples is correct
isn't that useful if you don't have access to the unit tests.
And now one practical setting where
you would care about this is say you're
creating an autocomplete tool, right,
and you generate 100 samples.
But you don't want to show your user 100 samples
and have them pick one, right?
You want to kind of try to prefilter,
but you don't have unit tests.
So can we kind of approximate this oracle sampling
with some other ranking heuristic?
So here, I'm showing a couple of different heuristics,
like if you randomly pick one.
But the one that seems most promising
is to rank by meaning, not probability.
And it's maybe not theoretically well-grounded,
but in language, this kind of heuristic
is fairly strong as well.
So recall that what we're doing is
we have this evaluation set where we have
kind of standalone functions.
We want to produce solutions to them.
But when we're doing training, there's
a lot of code that isn't relevant for this task.
For instance, there's a lot of classes that we're seeing.
There's actually data classes, too,
which aren't relevant often.
Actually, there's a lot of incorrect code on GitHub too.
So we might be modeling incorrect solutions as well as
correct ones.
So one thing we thought was, let's fine-tune Codex
further on a couple of data sets where
they are standalone functions and you
have kind of more guaranteed correct solutions to that.
So what we did was we found these problems
from a couple of sources.
So one is competitive programming problems.
You can go on these sites.
Oftentimes, they'll just give you the unit tests.
Sometimes, when they don't give you the unit tests,
you can submit incorrect solutions
and they'll tell you the first one you failed on.
And you can kind of keep just doing that.
[LAUGHTER]
So you can get a lot of competitive programming
problems.
And another source is projects where continuous integration
is enabled.
So why are these useful?
Because you can actually kind of do an execution tracing.
So when you run the integration tests,
you can get all the inputs to functions
that are called and their outputs as well.
And so you actually have the true function body.
You know what the test output is supposed to be,
so you know kind of the ground truth inputs and outputs.
And these are kind of like two orthogonal data sets.
One helps you with algorithmic kind of tasks.
And one is more kind of like trying
to manipulate command line utilities and [INAUDIBLE] that.
So this brings us to the main figure of the Codex paper.
So really what we're seeing is a progression of capabilities.
So with GPT-3 on this HumanEval data set, the pass rate @ 1
is 0 basically.
You can generate one or two lines
coherently but never really a whole program coherently.
Now, when you fine tune on code, which
is Codex, this orange line, you start
to see some non-negligible performance on this data set.
When you do this additional supervised fine-tuning--
that's this green line--
you get even better pass rates.
And then if you kind of generate 100 samples from this model,
rerank with mean logp, even better pass rates.
And finally, of course, we have tests in Oracle.
It gives you the best pass rates.
So one question here is, can you actually
use a reranking tool, like put it in the model?
Can you use it as a backprop signal?
Yeah, yeah, so we we can explore that.
I don't know if I can say too much about those results.
Yeah, got it, got it.
But yeah.
And finally, I don't want to suggest
that these models are perfect.
They have a lot of limitations that human programmers
don't run into.
So one is like--
actually all generative models are--
autoregressive generative models,
we have some problems with binding.
So when there's a lot of variables going on,
like a lot of operations going on,
sometimes it's hard to figure out which operation
is binding to which variable.
So you can kind of see some examples of that on the left.
And one other kind of counterintuitive behavior
is composition.
So we can take a bunch of very simple building blocks,
like take a string and reverse it,
or delete every third character or something.
And a human, if you can train two of these operations,
you could probably train 10 of them.
But our models aren't able to do that yet.
Cool.
So moving on to the conclusion, we've
had four main points in today's talk.
So first, progress in neural language modeling
has been fairly rapid.
And GPT wasn't the result of a push on language modeling, more
of a result of work on pushing unsupervised learning
in language.
The third point is that autoregressive modeling
is universal.
And it can yield strong results, even when there
are strong inductive biases, like in images or in text
to image.
And finally, we can produce strong code generating models
by fine-tuning GPT-3 on code.
And sampling is an unreasonably effective way
to improve model performance.
Cool, and to end with some acknowledgments,
I want to thank my Codex primary co-authors, some mentors
at OpenAI, and the algorithms team, which
I've worked very closely with.
Great.
Thank you guys for your attention.
Ver más vídeos relacionados
![](https://i.ytimg.com/vi/khWq7tuNO_o/hq720.jpg)
How To Use GPT-4o (GPT4o Tutorial) Complete Guide With Tips and Tricks
![](https://i.ytimg.com/vi/h932t-0KD0w/hq720.jpg)
Claude 3 meglio di Chat GPT4 e Gemini! 🤯 Guida per utilizzare Claude 3 OPUS GRATIS [ita]
![](https://i.ytimg.com/vi/cZaNf2rA30k/hq720.jpg)
Introduction to Generative AI
![](https://i.ytimg.com/vi/OcycT1Jwsns/hq720.jpg)
How Computer Vision Works
![](/_next/static/media/default-video-cover.615af72e.png)
ChatGPT Explained Completely.
![](https://i.ytimg.com/vi/Y2wfIKQyd1I/hq720.jpg?v=6337e95c)
What is Recurrent Neural Network (RNN)? Deep Learning Tutorial 33 (Tensorflow, Keras & Python)
5.0 / 5 (0 votes)