Stanford CS25: V1 I Transformers in Language: The development of GPT Models, GPT3

Stanford Online
11 Jul 202248:38

Summary

TLDRThis video script presents a comprehensive overview of the remarkable progress in neural language modeling, culminating in the development of GPT-3, a powerful 175 billion parameter autoregressive transformer model. It explores the evolution of language models, from n-gram models to recurrent neural networks, LSTMs, and transformers, highlighting their increasing coherence and ability to generate realistic text. The script delves into the unsupervised learning approach, demonstrating how models like GPT-3 can perform various tasks, from reading comprehension to translation, without explicit fine-tuning. It also showcases the versatility of transformers in modeling other modalities like images and code, showcasing impressive results in tasks like image generation and code writing.

Takeaways

  • 🤖 Progress in neural language modeling has been rapid, driven by work on unsupervised learning in language.
  • 🧠 Autoregressive modeling with transformers is a universal approach that can yield strong results even in domains with strong inductive biases like images or text-to-image generation.
  • 📝 GPT models were not initially focused on language modeling itself, but rather on pushing the boundaries of unsupervised learning in the language domain.
  • 🔢 Scaling up model parameters and pretraining on large unlabeled datasets allows for zero-shot and few-shot capabilities to emerge in language, image, and code generation tasks.
  • 🖼️ Transformer models can be applied to model different modalities like images by treating them as sequences of pixels and using a next-pixel prediction objective.
  • 🖥️ Generating diverse samples from language models through techniques like increasing temperature and re-ranking by mean log probability can significantly improve performance on tasks like code generation.
  • 💻 Fine-tuning GPT-3 on code data and further supervised fine-tuning on function input-output examples can produce strong code-generating models like Codex.
  • 🧪 Evaluating code generation models using functional correctness metrics like pass rates on unit tests is more informative than traditional match-based metrics like BLEU.
  • 🌐 Transformer models can jointly model different modalities like text and images by training on concatenated sequences of text and image data.
  • ⚠️ While powerful, code generation models still have limitations like variable binding issues and difficulties with composition of operations.

Q & A

  • What was the main motivation behind GPT at OpenAI?

    -The GPT models were not originally created to push language modeling itself, but rather as a result of work on unsupervised learning in language.

  • How does GPT-3 differ from earlier GPT models in terms of performance?

    -With GPT-3's much larger scale (175 billion parameters vs 1 billion for GPT-2), even just sampling the first completion often produces results comparable to taking the best of multiple samples from GPT-2.

  • What is the key insight that allowed GPT-3 to perform well on different tasks?

    -The training process can be interpreted as meta-learning over a distribution of tasks, allowing GPT-3 to quickly adapt to new tasks based on the given prompt during inference.

  • How were the GPT models evaluated on tasks like reading comprehension and summarization?

    -The prompts were framed in a natural language format, allowing zero-shot evaluation by having the model continue generating text based on the provided context.

  • What is the key advantage of using transformers for modeling different modalities like images?

    -Transformers can ingest any sequence of bytes, allowing them to model various data modalities like images, audio, or video represented as sequences on computers.

  • How does DALL-E demonstrate the capability of transformers to model multiple modalities?

    -DALL-E was trained on the joint distribution of text captions and images, allowing it to generate images conditioned on text captions or perform zero-shot multi-modal transformations.

  • What was the main motivation behind Codex, the code generation model?

    -GPT-3 already showed rudimentary ability to write Python code, so the researchers wanted to explore training a model specifically on code data to enhance this capability.

  • What is the key advantage of the evaluation metric used for Codex over standard metrics like BLEU?

    -The pass@k metric based on unit tests provides a ground truth evaluation of functional correctness, which BLEU and other match-based metrics cannot capture effectively for code.

  • What is the 'unreasonable effectiveness of sampling' observed with Codex?

    -Sampling many solutions from the model and reranking them significantly improves the pass rate, showing that the model composes different approaches rather than simply resampling the same approach.

  • What are some of the key limitations of the current code generation models?

    -The models can struggle with maintaining proper variable bindings across complex operations and have difficulty composing multiple simple operations into more complex ones.

Outlines

00:00

📚 Evolution of Language Models

This paragraph discusses the progression of language models from the era of n-gram models to the development of neural network-based models, focusing on recurrent neural networks (RNNs), long short-term memory (LSTM) models, and the groundbreaking shift to transformer-based architectures like GPT-2 and GPT-3. It illustrates how each innovation contributed to improving the coherence and relevance of generated text, from producing largely incoherent gibberish to creating text that is not only coherent across multiple sentences but also maintains thematic consistency, albeit with occasional errors or nonsensical phrases. This evolution showcases the language models' growing ability to understand and generate human-like text, culminating in examples where GPT-2 and GPT-3 can produce impressively coherent stories and explanations.

05:03

🔍 Improving Coherence and Realism in Generated Text

The second paragraph delves into how advances in language models, specifically with GPT-3, have led to the generation of text that not only achieves greater coherence but also mimics the stylistic and thematic nuances of specific genres, such as novels. It addresses questions about the size and complexity of GPT-3 compared to its predecessors, emphasizing the significant increase in parameters (from 1 billion in GPT-2 to 175 billion in GPT-3) and how this scale contributes to the model's nuanced understanding and generation capabilities. The paragraph also touches on the concept of neural scaling laws, suggesting that the improvement in language models' performance can be anticipated based on the scaling of model size, training data, and computational resources.

10:03

🌐 From Supervised to Unsupervised Learning in Language

This section explores the shift from supervised learning approaches to unsupervised learning in the context of language modeling, highlighting the vast potential of leveraging the internet's extensive repository of unlabeled data. It outlines the challenges associated with unsupervised learning, such as the absence of direct objective alignment with desired downstream tasks, but also emphasizes the optimism in the language domain due to the availability of large amounts of text data. The paragraph elaborates on the utility of generative models, especially autoregressive models, in understanding and generating language by synthesizing diverse and coherent samples.

15:03

🚀 Leveraging Unsupervised Learning for Language Tasks

The fourth paragraph showcases how GPT-2, by being trained on large swaths of the internet, capitalizes on unsupervised learning to perform a variety of language tasks without task-specific fine-tuning. It demonstrates the concept of zero-shot learning through examples like reading comprehension, summarization, and translation, illustrating how GPT-2 can understand and respond to prompts in a context-aware manner. The discussion extends to the role of model scaling in enhancing zero-shot capabilities and the importance of finding effective measures to evaluate translation quality, underscoring the limitations of current metrics like BLEU.

20:04

🔬 Autoregressive Models and Their Applications Beyond Language

This paragraph discusses the application of autoregressive models, specifically the GPT architecture, beyond language tasks to other domains such as images, through a project called DALL-E. It highlights the flexibility of the transformer architecture in modeling different data modalities by converting images into a 'language' of pixels and then generating images based on text descriptions in a zero-shot fashion. The section emphasizes the universality of the autoregressive modeling approach and its effectiveness in handling tasks even where strong inductive biases exist, as demonstrated by successes in both text-to-image generation and code generation with Codex.

25:06

📈 Codex: Specializing GPT for Code Generation

The concluding paragraphs focus on Codex, a model specifically trained to generate code by fine-tuning GPT-3 on a large dataset of programming code. It details the motivation behind creating a model focused on code, the unique challenges of evaluating functional correctness in generated code, and the introduction of a new metric, pass@k, for this purpose. The discussion showcases how Codex significantly outperforms previous models in generating functionally correct code, highlighting the importance of sampling strategies and the potential for further improvements by integrating reranking techniques based on meaning rather than probability. The section closes with acknowledgments and reflections on the limitations of current models, pointing towards areas for future exploration and enhancement.

Mindmap

Keywords

💡Language Modeling

Language modeling is the process of building computational models that can understand and generate human language. In the video, language modeling is discussed as a key component of the advancements in natural language processing. Language models like GPT (Generative Pre-trained Transformer) are trained on large amounts of text data to understand and generate human-like text. The video traces the progress of language modeling from n-gram models to neural networks like RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory) to the more recent transformer-based models like GPT.

💡Unsupervised Learning

Unsupervised learning is a category of machine learning where models are trained on unlabeled data to identify patterns and relationships. In the video, the motivation behind GPT is described as a push towards unsupervised learning in the language domain. Unlike supervised learning, where models are trained on labeled data, unsupervised learning aims to leverage the vast amounts of unlabeled data available, such as text from the internet. The video explains that the hope is for models like GPT to learn powerful representations from this unlabeled data, which can then be adapted to various downstream tasks.

💡Autoregressive Modeling

Autoregressive modeling is a technique used in language models where the model generates text one token (word or character) at a time, based on the previous tokens. The video discusses how autoregressive models like GPT work by predicting the next word or token based on the previous context. This allows the model to generate coherent text by capturing the dependencies between words in a sentence or paragraph. The video highlights the effectiveness of autoregressive modeling in various tasks, such as text completion, question answering, and even code generation.

💡Transformer Architecture

The transformer architecture is a type of neural network architecture that has been highly successful in natural language processing tasks. The video discusses how the transformer architecture, used in models like GPT, has enabled significant advancements in language modeling. The transformer uses an attention mechanism to capture long-range dependencies in sequences, making it well-suited for language tasks. The video traces the progression from earlier architectures like RNNs and LSTMs to the transformer, which has enabled the development of large, powerful language models like GPT-3.

💡Zero-Shot Learning

Zero-shot learning refers to the ability of a model to perform tasks without being explicitly trained on those tasks. In the context of the video, the discussion revolves around how large language models like GPT-2 can perform tasks like reading comprehension, summarization, and translation in a zero-shot setting. This means that the model, which was trained only on the task of predicting the next word, can generalize to solve these other language tasks without any additional fine-tuning. The video highlights how this zero-shot capability emerges as the models are scaled up in size.

💡Few-Shot Learning

Few-shot learning is a learning paradigm where a model is provided with a few examples or prompts and is expected to generalize to solve a task. The video discusses how GPT-3, with its massive scale, can perform few-shot learning on a variety of tasks, such as arithmetic and word unscrambling. By providing a few examples of the task in the prompt, the model can adapt and solve new instances of the task. The video demonstrates how few-shot performance improves as the model size increases, enabling the model to learn new tasks from just a handful of examples.

💡Multimodal Learning

Multimodal learning refers to the ability of a model to process and integrate information from multiple modalities, such as text and images. The video discusses the DALL-E model, which can jointly model text and images. DALL-E is trained on pairs of text descriptions and corresponding images, enabling it to generate images based on text prompts or perform tasks like image-to-image translation guided by text instructions. The video highlights the versatility of the transformer architecture in handling different modalities, showcasing its potential for multimodal tasks.

💡Code Generation

Code generation is the task of generating computer programs or code snippets using machine learning models. The video introduces Codex, a model fine-tuned on a large corpus of code, which can generate code based on natural language prompts or specifications. The video discusses the evaluation of Codex on a dataset called HumanEval, which contains programming problems with unit tests. The video highlights the effectiveness of sampling multiple solutions from Codex and ranking them based on metrics like mean log probability to improve the model's performance on code generation tasks.

💡Neural Scaling Laws

Neural scaling laws refer to the observed phenomenon that the performance of neural networks tends to improve predictably as the model size (number of parameters) increases. The video mentions neural scaling laws in the context of explaining the significant performance gains observed in GPT-3 compared to its predecessors. The idea is that by scaling up the model size, the model's capabilities, such as zero-shot and few-shot learning, improve in a way that can be predicted by scaling laws. This provides motivation for exploring larger and larger models to unlock more powerful language understanding and generation capabilities.

💡Sampling

Sampling refers to the process of generating multiple outputs from a language model or any generative model. The video highlights the "unreasonable effectiveness of sampling" in the context of code generation with Codex. It demonstrates that by generating multiple samples from the model and ranking them using heuristics like mean log probability, the model's performance on code generation tasks can be significantly improved. This is because different samples from the model may capture different valid solutions, and sampling allows the model to explore this diversity. The video emphasizes the importance of sampling in extracting the full potential of large language models.

Highlights

The GPT model was not developed specifically for language modeling, but rather as a result of work on pushing unsupervised learning in language.

Autoregressive modeling is universal and can yield strong results even in domains with strong inductive biases, like images or text-to-image tasks.

By fine-tuning GPT-3 on code data and employing sampling, strong code-generating models can be produced, with the 'unreasonable effectiveness of sampling' significantly boosting model performance.

GPT-3 already had a rudimentary ability to write Python code from docstrings or descriptive method names, despite not being trained on much code data.

A new evaluation dataset called HumanEval was created, consisting of handwritten programming problems with function names, docstrings, solutions, and unit tests.

The 'pass@k' metric was introduced, measuring the average probability that at least one out of k samples passes the unit tests for a given problem.

Techniques like compressing runs of white space and fine-tuning from GPT-3 models were employed to make training more efficient.

Sampling at different temperatures affects the pass@k rate, with higher temperatures allowing for more diverse samples at the cost of lower individual sample quality.

Ranking samples by their mean log probability (mean log-p) rather than sampling probability can approximate the 'oracle sampling' performance without access to unit tests.

Fine-tuning Codex on additional data sources like competitive programming problems and projects with continuous integration tests further improved performance.

Generative models like Codex can struggle with variable binding and composition of simple operations, limitations that human programmers do not face.

Progress in neural language modeling has been rapid, driven by advances in unsupervised learning.

Autoregressive models can model any sequence of bytes, making them applicable to various modalities like images and audio.

Techniques like using contrastive loss during pretraining and scaling up model size significantly improved unsupervised learning capabilities.

The ability to distinguish between real and fake samples decreased as model size increased, approaching random chance for large language models like GPT-3.

Transcripts

play00:05

Great.

play00:06

OK, perfect.

play00:07

So a sample from this model looks like this.

play00:10

"They also point to ninety nine point

play00:12

six billion dollars from two hundred four

play00:13

oh six three percent."

play00:16

It's a bunch of kind of gibberish.

play00:18

So the sentence isn't too coherent,

play00:21

but at least the words do seem to be somewhat related,

play00:23

like they come from the same space.

play00:27

Now, jumping forwards to the beginning of the deep learning

play00:30

boom in 2011, we have language modeling with neural networks

play00:34

now, and in particular with recurrent neural networks.

play00:38

So you can get rid of this giant lookup table

play00:40

from the n-gram models.

play00:41

And instead, we can have our inputs be these tokens

play00:46

and let this kind of return cell remember some state

play00:50

and persistent state.

play00:51

So if we set up a neural model like this,

play00:54

we get a sample as shown below.

play00:56

"The meaning of life is the tradition

play00:58

of the ancient human reproduction--

play00:59

it is less favorable to the good boy for when to remove bigger."

play01:03

So again, this doesn't really make any sense,

play01:06

but it kind of starts to have the flow of a real sentence.

play01:12

Yeah, so jumping forward even more to 2016,

play01:14

we have LSTM models.

play01:17

And of course, LSTMs are an architectural innovation

play01:21

on top of RNNs.

play01:22

And they have better gradient flow,

play01:23

so they can better model long-term dependencies.

play01:28

And so with an LSTM model, we get a sample like this.

play01:32

"With even more new technologies coming onto the market

play01:35

quickly during the past three years,

play01:37

an increasing number of companies

play01:39

must tackle the ever changing and ever

play01:41

changing environmental challenges online."

play01:43

So this sentence is starting to make a little bit of sense,

play01:45

though there are clear artifacts, like the repetition

play01:48

of the phrase ever changing.

play01:53

Now, starting in 2018, we have our first autoregressive

play01:56

transformer based language models,

play01:58

which are even better at modeling these very

play02:00

long-term dependencies.

play02:03

And here, what I'm showing is an example of a completion.

play02:06

So in a completion, the user supplies the prompt.

play02:10

In this case, it's this text, Wings Over Kansas.

play02:14

And the model will continue from this prompt.

play02:18

So you can see that this completion

play02:19

is coherent across multiple sentences

play02:22

now, though there are notable spelling mistakes.

play02:25

So you see this whatever "daknfi" is.

play02:28

So it doesn't kind of make sense.

play02:33

And now we arrive at GPT-2, which is a 1.5 billion

play02:37

parameter transformer model.

play02:39

And I copied in what I personally

play02:41

found was the most compelling conclusion from GPT-2.

play02:45

And in contrast with the last slide, what this does

play02:49

is it sets up a clearly fake prompt.

play02:51

So we have something about finding unicorns and scientists

play02:56

in South America.

play02:58

And so the model has probably not

play03:00

seen this exact prompt before.

play03:01

It has to make up something that's consistent.

play03:04

So the thing I find most impressive is it does so,

play03:08

and it's coherent across multiple paragraphs.

play03:10

It invents this fictional Dr. Perez,

play03:13

and it persists Perez throughout multiple paragraphs.

play03:17

And I think it's very aptly named.

play03:20

You have him from University of La Paz.

play03:23

And yeah, we just have barely coherent completions

play03:26

at this point.

play03:27

So it's worth disclosing that this

play03:29

was the best of 10 samples.

play03:31

So we still had to sample multiple times

play03:34

to get a sample like this.

play03:39

And finally, to end this section--

play03:41

I'm sorry.

play03:41

Can I interrupt?

play03:41

Yeah, for sure.

play03:42

We're not just thinking of examples of the failing,

play03:44

the worst of the text.

play03:45

I can post them up, yes.

play03:48

[INAUDIBLE] what's bad and what's [INAUDIBLE]..

play03:50

Yes, yes, yes, yes.

play03:51

[INAUDIBLE]

play03:52

[LAUGHTER]

play03:55

Wait, sorry.

play03:55

One last question.

play03:56

When you have these 10-- you said

play03:58

we took the best of the 10.

play03:59

Best in what sense?

play04:00

Yeah, so this is human-judged.

play04:02

And I'll probably expand a little bit

play04:03

on that more today, yeah.

play04:05

So I want to end this kind of fly by overview with GPT-3.

play04:11

And since GPT-2 already produces such coherent text,

play04:15

how do you characterize GPT-3?

play04:17

And I would say that the best way

play04:19

to do so is that say you took the best of out of five

play04:24

or ten completions from GPT-2.

play04:26

That would be kind of your first completion from GPT-3.

play04:29

And of course, best is kind of a personal metric here.

play04:34

So here, I'm showing a completion from the book

play04:38

Three-Body Problem.

play04:39

And you can see that the impressive things

play04:43

about this completion are that it really stays

play04:45

true to the style of the novel.

play04:49

I think the second thing that kind of impressed

play04:50

me was just how poetic like the metaphors and similes that it

play04:54

produces are.

play04:55

So you have this stuff like blood

play04:57

was seeping through a jacket and a dark red flower

play04:59

was blooming on her chest, like these kind of very, very

play05:02

poetic and stylistic sentences.

play05:04

So it definitely understands it's part of a novel,

play05:06

and it's trying to generate this kind of prose

play05:09

in the same style.

play05:13

So as generated text becomes more and more coherent,

play05:16

I think one of the really--

play05:17

[INAUDIBLE] how much bigger is it in terms of the parameters,

play05:20

is GPT-3?

play05:21

Yeah, yeah, so it's 175 billion parameters versus GPT-2,

play05:24

which is around 1 billion.

play05:26

[INAUDIBLE]

play05:31

Do you feel like that very subtle increase in accuracy

play05:34

is the root cause of how much difference [INAUDIBLE]??

play05:36

Yeah, that's a very good question.

play05:40

So there's kind of stuff-- maybe we

play05:41

can dive into it a little bit after,

play05:43

but there is work on neural scaling laws.

play05:45

And so the idea is like, can you predict the performance

play05:47

of a larger model from a series of smaller models?

play05:50

And so I would rather characterize the increase

play05:52

in performance not by the small gain in perplexity,

play05:55

but whether it lines up with the projections.

play05:58

And in that sense, GPT-3 does.

play06:00

So yeah, that's some intuition for--

play06:03

yeah.

play06:04

I think personally, I hope OpenAI would have stopped

play06:06

the experiment if it did it.

play06:07

So yeah.

play06:08

No,

play06:09

I just think it's interesting for, this

play06:11

is more of a general thing.

play06:12

[INAUDIBLE]

play06:14

In machine learning, you see people

play06:15

pushing for like an extra 1% to probably 5% accuracy,

play06:20

but the models are increasing at a scale that's exponential.

play06:24

Right.

play06:24

So I wonder sometimes whether it's worth it

play06:27

and where you should stop [INAUDIBLE]..

play06:30

Right.

play06:31

Yeah, I think maybe this slide will get to it a little bit.

play06:33

But there's also some sense in which

play06:35

like as you reach kind of like the entropy floor of modeling,

play06:39

every halving kind of gives you--

play06:43

if you think about accuracy, it's not on a linear scale.

play06:47

A 1% early on isn't the same as that last 1%.

play06:50

And so those last bits really do help you squeeze

play06:55

a little bit out of that.

play06:56

That's obvious.

play06:58

Yep.

play06:58

Sorry.

play06:59

[INAUDIBLE] the [INAUDIBLE] access too?

play07:00

Oh yes.

play07:01

Sorry, this is accuracy [INAUDIBLE]..

play07:03

I will explain this slide.

play07:04

Cool.

play07:06

So as generated text becomes more and more realistic,

play07:09

I think one very natural question to ask

play07:11

is whether humans can still distinguish

play07:13

between real and fake attempts, right?

play07:16

And here we have--

play07:18

this is, of course, a very set up scenario.

play07:21

In all cases, the models wouldn't trick humans.

play07:24

But this is for news articles, we kind of

play07:27

presented GPT-3 generated samples

play07:29

against real news articles.

play07:31

And you can tell as the number of parameters increases,

play07:34

the ability of humans to distinguish

play07:36

between the real and fake articles--

play07:39

that ability goes down to near random chance.

play07:41

And, oh, yes?

play07:44

How did you generate the news articles?

play07:46

What prompts did you use?

play07:49

Oh, I'm actually not completely sure.

play07:52

So I didn't do this work particularly,

play07:55

but I think one possible approach would

play07:57

be to prime with a couple of news articles and then

play08:00

just to have a delimiter and just

play08:01

have it start generating news articles from there.

play08:04

Yeah?

play08:07

Any other quick questions?

play08:11

Great.

play08:12

So even with all of these impressive results,

play08:14

I think it's worth taking a step back at this point and asking,

play08:18

what do we really care about language modeling for?

play08:21

And what is it actually useful for?

play08:23

I think one can make the argument that it is actually

play08:26

a fairly narrow capability.

play08:28

Why would you just want some system that

play08:29

just continues text for you?

play08:32

And you could argue that there's more important tasks

play08:34

to solve, like summarization or translation.

play08:37

And I think most researchers at OpenAI

play08:39

would agree with this point of view.

play08:41

And in fact, GPT was not really a project

play08:44

that was focused on language modeling as an end goal,

play08:47

but mostly as a tool to solve a problem called

play08:50

unsupervised learning, which I'm going to go through

play08:53

in the next couple of slides.

play08:54

So I want to do a history of language modeling at OpenAI

play08:59

and hopefully motivate why we ended up

play09:01

at the GPT series of models, and kind of how we arrived there.

play09:05

And hopefully it will become much more intuitive

play09:08

after this section.

play09:11

So the deep learning boom started in 2012

play09:13

with AlexNet, which was a system that

play09:15

could take images and labels, and it could classify images

play09:19

to their labels.

play09:20

And what we found with AlexNet was these systems

play09:24

were able to generalize surprisingly well.

play09:26

You could take data sets that weren't necessarily

play09:28

in the training distribution, and you just

play09:30

flag pretty good features on it.

play09:32

And since then, this kind of supervised approach

play09:34

has been really, really powerful, right?

play09:36

We've been able to train models in many different domains

play09:39

to classify very accurately.

play09:42

And you can even have some guarantees

play09:44

that supervised learning will work well.

play09:46

So there's empirical risk minimization.

play09:49

But the problem with supervised learning

play09:51

is that oftentimes the labels are scarce, right,

play09:54

especially in language tasks.

play09:56

There isn't really that many kind of text

play09:58

paired with their summaries, or too many pairs

play10:00

across languages for instance.

play10:02

So collecting a lot of data can be not too hard, but actually

play10:07

scalably labeling all of that data,

play10:09

it could be very time consuming, and expensive.

play10:12

So the main problem with unsupervised learning

play10:15

is can we also learn from unlabeled data?

play10:18

And this is a lot scarier, because, all of a sudden,

play10:21

we're starting to optimize an objective, which

play10:23

isn't the one we care about downstream, right?

play10:25

So a lot of the guarantees that we used to have,

play10:28

we no longer have.

play10:30

And we can only kind of hope that we

play10:33

learn some features that are adaptable to a wide variety

play10:36

of downstream tasks.

play10:39

But nevertheless, there's a reason

play10:41

to be very optimistic in language.

play10:43

And the reason is that there is a huge trove of unlabeled data.

play10:47

And it's called the internet.

play10:48

And so the real question is, can we

play10:50

leverage all this available data from the internet

play10:53

to solve language tasks where we don't really

play10:55

have that much data?

play10:57

And the hope is that if we kind of pretrain

play11:00

this model on the internet, it will

play11:01

see all of these words used in different settings,

play11:03

kind of understand the relationships,

play11:05

and they'll be able to leverage this kind of understanding

play11:08

for any kind of task du jour.

play11:12

So now that we've established why language

play11:14

is such a good domain to try unsupervised learning in, let's

play11:17

talk about why use generative models for it,

play11:19

and also why use autoregressive generative models.

play11:23

And I do want to stress that a lot of the guarantees we have

play11:26

with supervised learning are no longer there for unsupervised

play11:28

learning.

play11:29

So some of these arguments will be

play11:30

a little bit kind of intuitive.

play11:34

And so the first argument I want to present is this quote

play11:37

by Richard Feynman which is pretty widespread,

play11:40

"What I cannot create, I do not understand."

play11:43

And there's the inverse of this idea, which we call analysis

play11:46

by synthesis.

play11:47

And it's "What I can create, I can also understand."

play11:49

And this has been studied by Josh Tenenbaum.

play11:53

There's definitely some kind of biological motivation

play11:57

as well for it.

play12:00

But the idea here is that if you're

play12:02

able to create a language model which can generate

play12:05

diverse samples that are coherent, then

play12:07

it must also build up representations

play12:09

that can help you solve language understanding tasks.

play12:13

And then the next question is, why do we

play12:15

use autoregressive models?

play12:17

You might argue that autoregressive models

play12:19

are a kind of local objective.

play12:22

You're just predicting the next words.

play12:23

You could do really well with some n-gram approximation.

play12:27

Why would it be good at solving things

play12:30

that allow you to summarize an entire piece of text?

play12:33

And so, an intuitive argument here

play12:35

could be, say that you wanted to do very well on language

play12:39

modeling for a mystery novel.

play12:41

And there's this grand reveal at the end,

play12:43

like, oh, the culprit was--

play12:45

and then you want to predict that next token.

play12:47

And to do really well at that task,

play12:49

you really need to have a good understanding

play12:51

of what happened in the story along with all the twists

play12:54

and turns, and maybe even some of this kind of like deductive

play12:56

reasoning built in.

play13:00

So the first sign of life--

play13:02

did you have a question?

play13:05

[INAUDIBLE]

play13:06

Oh, yeah.

play13:08

So the first sign of life we had at OpenAI

play13:10

was in the task of predicting whether Amazon reviews were

play13:13

positive or negative.

play13:15

And this was worked on in 2017.

play13:17

So instead of training a classifier

play13:19

in the kind of typical supervised way, what we did

play13:22

was we trained an LSTM model just

play13:24

to predict the next character in Amazon reviews.

play13:28

And when we trained a linear model on the features

play13:30

from this LSTM, what we found, surprisingly,

play13:33

was one of these cells or one of these neurons

play13:36

was firing in terms of predicting sentiment.

play13:40

And positive activations for this neuron

play13:43

corresponded to positive reviews,

play13:44

and negative activations to negative reviews.

play13:47

And this was despite not seeing any of the labels

play13:50

at training time.

play13:52

So you can even track kind of what this neuron

play13:55

value is across a sample.

play13:57

So it's a little bit hard to read,

play13:58

but these are reviews where maybe someone says,

play14:00

oh, I really liked this film, but I didn't like this part.

play14:03

And you can kind of see the sentiment switching as you

play14:05

go from positive to negative.

play14:09

So yeah, just predicting the next character

play14:12

resulted in-- oh yeah?

play14:14

Was there any sort of [INAUDIBLE] architecture

play14:17

to encourage this?

play14:20

No, this was just a pure LSTM.

play14:22

OK.

play14:22

So you guys came up with all the neurons,

play14:24

saw which ones were closest?

play14:25

Yeah, in the hidden state.

play14:26

Yeah.

play14:26

So you train a linear classifier on top of that.

play14:28

And one neuron is firing with--

play14:30

yeah, just outsized predictive power.

play14:33

Yeah, great.

play14:34

So next, GPT-1 was one of the first demonstrations

play14:37

that this kind of approach could work broadly for text.

play14:40

So GPT-1 was trained on the internet, not on Amazon reviews

play14:43

anymore.

play14:44

And it was fine tuned on a bunch of different downstream tasks.

play14:48

And one thing to stress here was, kind of to your point

play14:52

that the fine tuning was very--

play14:55

I guess minimally-- you're not kind

play14:58

of bashing the architecture apart and kind of repurposing

play15:03

a new module.

play15:04

So it's just a new head that classifies for your task.

play15:09

And this showed that you can use this approach

play15:11

not just for sentiment analysis, but also for entailments,

play15:15

and semantic similarity, and getting SotAs

play15:17

on a lot of these benchmarks downstream.

play15:21

So I've already presented GPT-2 from the point

play15:24

of view of a very powerful language model.

play15:26

And now, I think it's worth revisiting from the viewpoint

play15:29

of unsupervised learning.

play15:31

So like GPT-1, GPT-2 was trained on a large chunk

play15:34

of the internet.

play15:36

And it's only trained to predict the next token

play15:38

or word from previous words.

play15:40

But the key insight of GPT-2 is that many downstream tasks

play15:44

can be expressed naturally as a language model in the past.

play15:48

And yeah, so GPT-2 explores how well

play15:51

we can perform on downstream tasks

play15:52

simply by using this method without any fine tuning, right?

play15:56

So let me start with a couple of examples.

play15:58

So let's say you want to solve some reading comprehension

play16:01

benchmark.

play16:02

And this is usually set up as a prompt, which

play16:04

is some passage you have to read,

play16:05

and then a bunch of questions, which you have to answer.

play16:08

So you can literally just stick the entire project in context.

play16:11

You put a question, colon, you write out

play16:13

the question, answer, colon.

play16:15

And then have the model complete from there.

play16:17

And this gives you Zero-Shot reading comprehension.

play16:22

We can also use it for other tasks, like summarization.

play16:25

For instance, here's the beginning

play16:28

of a CNN article about kind of some archaeological finding.

play16:33

And you can just put TLDR after you see this passage.

play16:37

And the model, hopefully, if it's good enough,

play16:39

will produce good summaries.

play16:44

And the final example I want to show

play16:45

is that you can do Zero-Shot translation as well.

play16:48

So the way you would do this is if you wanted to convert,

play16:52

let's say, a French sentence into English,

play16:54

you could set up a prompt like the sentence,

play16:56

insert the French sentence, "translated from French

play16:58

to English means," and then the model will complete.

play17:02

And they can sometimes do this well.

play17:04

And one kind of critical thing to note

play17:06

here is here's a chart of performance

play17:09

as you increase the number of parameters.

play17:12

And all these models are trained on the same data sets,

play17:17

so the only kind of confounding variable is scale.

play17:20

And you can see that as we scale up the models,

play17:22

these kind of Zero-Shot capabilities

play17:24

emerge and kind of smoothly get better.

play17:28

So the role of scale is important here.

play17:32

And I think these are starting to approach--

play17:35

I guess they're not great benchmarks, but at least

play17:37

respectable benchmarks.

play17:39

[INAUDIBLE]

play17:41

Yeah, exactly.

play17:42

It's not going to be great in a lot of cases.

play17:44

And to be honest, the blue metric

play17:47

used for translation is actually often--

play17:49

thank you very much.

play17:51

It's not a great metric.

play17:52

What it does is it takes a reference solution.

play17:55

And basically, it does some kind of like n-gram comparison.

play17:59

So it is a big problem to have good translation

play18:04

metrics in NLP.

play18:08

And yeah, I think when I talk about code,

play18:10

I'll talk a little more about [INAUDIBLE]..

play18:16

So let's finally talk about how GPT-3 fits into this picture.

play18:20

So the primary insight of GPT-3 is

play18:22

that the training process itself can be interpreted

play18:25

in the context of metalearning, which is kind of like learning

play18:28

over a distribution of tasks.

play18:30

And during training, what the model

play18:31

is doing is it's developing certain kind of capabilities,

play18:34

it's picking up some set of skills in terms

play18:39

of modeling certain passages.

play18:41

And during inference time, what it's doing,

play18:45

it's kind of quickly picking up on what a task is based on what

play18:48

the prompt is so far, and adapting to that task

play18:51

to predict the next token.

play18:53

So you can kind of view this as outward loop of all the SGDs

play18:56

that you're doing during training,

play18:58

and this inward loop of kind of picking up

play19:00

on what the task is, and then modeling the next token.

play19:04

So you can imagine a lot of tasks being framed in this way.

play19:07

For instance, on the left, you can have addition.

play19:10

You have a lot of examples of machine context.

play19:13

And hopefully, that would help you with a new addition

play19:16

problem, or you can try to kind of unscramble

play19:19

a word for instance.

play19:20

And I'll explore results on these two kind of benchmarks

play19:23

in the next slides.

play19:25

So this setting, you can call two-shot arithmetic.

play19:28

And just to explain what's going on,

play19:30

you're taking the entire context slide of your transformer

play19:33

and you're putting in as many examples as will fit.

play19:36

And then finally, you put in the example

play19:39

that you would like to solve.

play19:40

So here, these examples could be these kind

play19:45

of first three addition problems,

play19:47

and then you have 31 plus 41 equals.

play19:50

And you ask the model to complete.

play19:52

So you notice that as the language model gets bigger,

play19:55

it's better able to recognize this task.

play19:58

And you can see that performance on addition, subtraction,

play20:02

even some kind of multiplication tasks

play20:04

increases sharply as you go towards 200 billion parameters.

play20:08

And there does seem to be kind of some step function change

play20:10

right here.

play20:12

And looking at word unscrambling,

play20:15

this is also true.

play20:16

So we have parameters again on the x-axis, we have accuracy.

play20:20

And each of these is a different kind of unscramble task.

play20:23

So this blue line is you kind of do

play20:25

a cyclic shift of the letters, and you wanted to uncycle.

play20:28

And there's a lot of other transforms

play20:30

you can do, like randomly inserting words for instance.

play20:36

So the final point here is that this

play20:39

is a pretty general phenomenon.

play20:40

We didn't just test it on these two aforementioned tasks.

play20:45

We tried an array of I think 40 plus tasks.

play20:48

And here, you can see how the Zero-Shot, One-Shot,

play20:50

and Few-Shot performance increases

play20:52

as we scale the models.

play20:54

So of course, they're all smoothly increasing.

play20:56

But one thing to be aware of is that the gap between Zero-Shot

play21:00

and Few-Shot is also improving as a function of scale.

play21:07

Awesome.

play21:09

So we've just seen that we can pretrain the--

play21:11

Oh.

play21:11

Go ahead.

play21:12

Sorry, with few-shot learning, I was curious, [INAUDIBLE]..

play21:18

One is the tasks themselves that were used.

play21:22

Two is the number of parameters.

play21:24

And then three, my understanding is also

play21:26

the quantity of [INAUDIBLE].

play21:28

I was curious, between those three, which ones--

play21:31

you've shown a lot of examples.

play21:32

The number of parameters definitely helps.

play21:34

I was curious though if you had a sense of the degree

play21:36

to which also the training tasks and the sophistication

play21:39

of the tasks, as well as the quantity of [INAUDIBLE]

play21:41

adjustments [INAUDIBLE].

play21:43

Yeah.

play21:44

So I guess I can dive--

play21:46

maybe it's something to save for or after.

play21:49

Yeah, let's dig into that after.

play21:51

Yes?

play21:51

Just a thought, [INAUDIBLE] a little bit, too, right?

play21:55

I guess GPT-2 and 3 aren't different.

play21:58

GPT-1 just has an extra classification head

play22:01

for certain tasks here.

play22:05

Great, yeah.

play22:06

Good questions.

play22:08

So yeah, we've just seen that we can use a transformer

play22:10

in this kind of pretrained the [INAUDIBLE] setups,

play22:13

where we have a lot of unlabeled data in the pretraining

play22:17

setting.

play22:17

And we have just a little bit of data

play22:19

in the fine-tuned settings.

play22:20

And we can solve a lot of language tasks in this way.

play22:24

And I would say this has become the dominant paradigm

play22:26

in language over the last couple of years.

play22:28

So there's follow up objectives, like BERT and T5,

play22:32

which have done extremely good at pushing the SotA.

play22:35

But there's nothing really that says

play22:36

that these transformer models have to be applied to language.

play22:40

The transformer is a sequence model.

play22:41

And as such, it can just ingest any sequence of bytes

play22:45

and model them.

play22:46

And when you think about this, all of the data

play22:48

that we consume, like videos or audio,

play22:51

they're represented on our computers

play22:52

as sequences of bytes, right?

play22:53

And so we might think, oh, could this approach

play22:57

be used to just model whatever modality we want?

play23:01

And I think this kind of paradigm

play23:04

is very at least interesting when we don't really

play23:08

have good inductive biases.

play23:09

We don't necessarily update them.

play23:11

But one question to ask is, does it even

play23:13

work when you do have really strong inductive biases?

play23:16

So I'm going to present some work that

play23:20

suggests that the answer is yes, it still

play23:22

works fairly well in this case in the domain of images, where

play23:26

convolutions are already so popular and proven out.

play23:30

And I'm going to show a second result very briefly here,

play23:32

which is DALL-E, which shows that it's strong enough

play23:35

to even ingest two different modalities

play23:37

and be able to jointly model them.

play23:42

So the first question is, how would you apply GPT to images?

play23:46

And there's a few things you have to do.

play23:48

You have to modify this autoregressive next word

play23:50

prediction objective.

play23:53

So the natural analog is you can think of images

play23:55

as a very strange language, where the words are pixels

play23:58

instead.

play23:59

And instead, you need to predict the next pixel at each point.

play24:03

And so we can just change the objective for the next word

play24:05

prediction to next pixel prediction.

play24:08

And of course, we want this kind of large-- yeah?

play24:10

[INAUDIBLE]

play24:14

Oh, yeah.

play24:14

So you just unroll it as a sequence.

play24:16

It's the same way it's stored on a computer.

play24:19

You just have a secret spice, yeah, yeah.

play24:21

Good question.

play24:23

So in the language setting, we pretrain

play24:25

on this large unlabeled data set on the internet,

play24:28

and we fine tuned on question answering

play24:31

or these other benchmarks.

play24:33

In images, one good analog of the situation

play24:35

is you can pretrain on ImageNet without the labels.

play24:38

If you have, let's say, a low resource-- a low data,

play24:41

sorry, setting like CIFAR.

play24:42

And you can try to attack CIFAR classification.

play24:46

And of course, in both settings, you can do fine tuning.

play24:48

In GPT, you can do Zero-Shot.

play24:49

And I would say the standard eval

play24:51

on images is you do linear probes,

play24:53

so you take features from your model.

play24:56

The model is frozen.

play24:57

You pass through CIFAR through-- do the model,

play24:59

get some features.

play25:00

And you see how predictive these features

play25:02

are of the CIFAR classes.

play25:06

Is it kind of PixelCNN which basically

play25:08

asks a model to predict the next pixel given the [INAUDIBLE]..

play25:12

Yeah.

play25:12

So PixelCNN is an instantiation of an autoregressive image

play25:16

generation model.

play25:17

So what we're asking here is, can we actually take

play25:19

the same transformer architecture that we

play25:21

use in language, don't make any modifications at all,

play25:24

and just throw--

play25:25

so there's no kind of 2D prior on this, yeah.

play25:32

So yeah, I'll call this model that we trained Image GPT-2,

play25:35

or IGPT for short.

play25:37

And here, you can see actually what some completions

play25:40

from the model look like.

play25:41

So on the left column what I'm feeding in

play25:44

is the pixels of the first half of the image.

play25:47

In the next four columns, what you're seeing

play25:49

is different model generated completions.

play25:53

In the right column here is the original reference image.

play25:56

And you can actually see that the model

play25:58

is kind of doing some interesting things, right?

play26:00

If you look at the last two rows,

play26:01

it's not coming up with kind of magically the same completion

play26:04

every single time.

play26:05

It's like putting these birds in different settings,

play26:07

sometimes adding reflections.

play26:10

It's putting this lighthouse in grassy areas

play26:12

and like watery areas for instance.

play26:15

So if you buy into this philosophy of analysis

play26:17

by synthesis, we definitely have some hint

play26:20

of the synthesis part.

play26:24

So I don't have time to go through all of the results with

play26:26

you, but I just want to say that it is fairly successful in this

play26:30

CIFAR setting where you don't have much labelled data.

play26:33

If you train a linear model on top of the features,

play26:36

you get better results than if you do the same approach

play26:40

with a ResNet trained on ImageNet with labels.

play26:43

So that's the typical approach in the paper.

play26:45

You train some ResNet on ImageNet,

play26:46

you get the features-- oh, yeah?

play26:48

[INAUDIBLE]

play26:49

Oh, yeah.

play26:49

And if you compare to this approach, a generative model

play26:53

on ImageNet without the labels, take the features.

play26:56

It's actually better predictive of [INAUDIBLE]..

play26:59

Yeah, [INAUDIBLE].

play27:00

What if the architecture for this is the same [INAUDIBLE]??

play27:02

Oh, yeah.

play27:03

[INAUDIBLE]

play27:03

Exactly, yeah.

play27:04

[INAUDIBLE]

play27:04

Yeah, yeah, yeah.

play27:05

It's the GPT architecture, yeah, yeah.

play27:09

So you can modify GPT to have like 2D bias.

play27:12

Like you can do 2D position embeddings.

play27:14

We'll be able to do that.

play27:15

We just want to see can you use the same exact approach.

play27:17

Yeah?

play27:18

So earlier, you said the data's just sequential.

play27:21

But also there's like metadata showing

play27:22

about how that sequential should be reconstructed at the end.

play27:25

So what's the width, for example.

play27:28

Oh, can you explain?

play27:29

Yeah.

play27:30

Sorry if I didn't say that well.

play27:31

So the data on this [INAUDIBLE]?

play27:33

Yes.

play27:34

OK.

play27:34

But when you want to transform this sequence into an image,

play27:37

you have metadata that will say something

play27:38

like-- just like in NumPy arrays, it'll say,

play27:41

here's the stride.

play27:42

So you're just going to rearrange it [INAUDIBLE]..

play27:44

I see.

play27:45

What I'm curious to know, is does

play27:46

GPT, before it's given an image, at least given

play27:48

this metadata [INAUDIBLE]?

play27:50

I see.

play27:51

Yeah, that's an extremely good question.

play27:52

Because I don't know how this problem is solved.

play27:55

Yeah.

play27:55

In this case, all the images have the same shape.

play27:59

Oh, OK.

play28:01

OK, cool.

play28:03

But we don't tell it like the concept of row

play28:05

within the model, yeah.

play28:07

But if all images are the same?

play28:08

Yeah, so it needs to learn it from the data.

play28:10

But yeah, the data looks the same.

play28:11

Got it.

play28:12

[INAUDIBLE] variable image shapes,

play28:15

then they can just submit [INAUDIBLE]..

play28:17

Yeah.

play28:17

Mhm.

play28:21

Aren't there a lot more pixels than there are

play28:24

[INAUDIBLE] sizes [INAUDIBLE]?

play28:26

Yes.

play28:26

This is pretty low resolution images.

play28:31

Yeah, so we can actually-- the models

play28:33

we're comparing against are trained on kind

play28:34

of high resolution images.

play28:35

So I think that makes it even more impressive.

play28:37

Yeah, we're just trading out the 32 by 32 res image, yeah.

play28:43

Cool.

play28:44

So if we fine tune these models for CIFAR classification,

play28:46

we can get 99% accuracy, which matches T5.

play28:51

T5, for instance, is a system which

play28:53

is pretrained on ImageNet with labels and then also fine tuned

play28:56

with labels.

play28:57

So yeah, it just kind of shows you,

play29:00

even this approach which doesn't really know about convolutions

play29:03

can do well.

play29:04

I think you're going to hear more about that next week

play29:06

with Lucas' talk.

play29:10

So by now, it shouldn't be surprising at all

play29:13

that you can model a lot of different modalities

play29:15

with transformers.

play29:17

So in DALL-E, we just ask, what about throwing

play29:20

two different modalities at the model

play29:22

and seeing if it can learn how to condition on text

play29:25

to produce an image.

play29:27

And for instance, one thing you might want it to do

play29:30

is like you provide one of these text captions,

play29:32

and you want it to generate some image like the one below.

play29:35

And the easy way to do this is just

play29:37

train a transformer on the concatenation

play29:39

of a caption and an image.

play29:41

And of course, in a lot of these situations,

play29:44

the idea is very simple, but the implementation and execution

play29:47

is where the difficulty is.

play29:49

And I'm not going to talk too much about that.

play29:51

I think the focus today is on language.

play29:53

But you can refer to the paper for a lot of those details.

play29:57

Could you describe if you have a caption that's-- a variable

play29:57

caption?

play30:02

Okay, yeah, so you have a max caption length,

play30:06

and you just kind of cut it off at that length.

play30:08

And you can pad up to that one.

play30:14

So you can see that it can generate fairly good samples.

play30:17

So if you want like a storefront with the word OpenAI on it,

play30:20

it's not perfect, but at least it's

play30:22

kind of like reverse OCR problem, where you

play30:25

take some text and render it.

play30:27

And it's kind of typically rendering it

play30:29

in like office-looking places.

play30:31

So that's one encouraging sign.

play30:34

But I do think my favorite results here are Zero-Shot

play30:38

in machine transformation.

play30:39

So what's going on here, is, for instance,

play30:41

if your prompt is "the exact same cat

play30:43

on the top as a sketch on the bottom,"

play30:46

and you feed in the top half of this image, which is a cat,

play30:50

and you ask it to complete the rest of the image,

play30:52

then it'll render the top cat actually as like a sketch.

play30:57

And you can do the same thing with flipping over

play30:59

photos for instance.

play31:01

You can zoom in to a photo.

play31:03

Of course, they're not perfect, but it

play31:05

has some understanding of what the text is trying to do, yeah.

play31:08

In the caption originally like in the training set,

play31:13

do they have wording such as extreme closeup view?

play31:17

I think that-- there probably are some examples like that,

play31:21

and that's probably where it's picking up

play31:22

some of this knowledge from, though we

play31:24

don't seek out these examples.

play31:26

It's just--

play31:26

[INAUDIBLE]

play31:27

Yeah exactly.

play31:28

[INAUDIBLE]

play31:30

OK, perfect.

play31:33

This is just-- we just go and do a massive web scrape.

play31:37

We're not trying to find examples like this, right?

play31:40

And so you can also do things like colorization, right?

play31:42

You can take the cat and color it red.

play31:44

And this has to kind of recognize what

play31:47

the object is in the figure.

play31:49

And yeah, here, you can do stuff like semantic transformations,

play31:54

like adding sunglasses into the cat.

play31:57

And you can put it on postage for instance.

play31:59

So it's just remarkable that you can

play32:01

do a lot of these like transform Zero-Shot.

play32:04

It wasn't trained to do these things specifically.

play32:11

Cool, so moving on, the last section of my talk today

play32:14

is on Codex, which is our most recently released code writing

play32:17

models.

play32:19

And the first question you should rightly ask here

play32:22

is, why train them all on code anyway?

play32:26

At this point, isn't it just another modality?

play32:28

And what is the novelty that there is at this point, right?

play32:33

So let me give you a couple of reasons.

play32:36

So first is that GPT-3 had a rudimentary ability

play32:39

to write Python code already from a docstring

play32:42

or a descriptive method name.

play32:43

And we actually didn't train it on much code data.

play32:46

Actually, I think there might have been active filtering

play32:49

to get rid of code data.

play32:50

And so we were surprised that there

play32:51

is this capability anyway.

play32:52

So we thought if we actually purposed a model

play32:55

and trained it on the large amount of code

play32:57

that we can find, maybe something interesting

play32:59

will happen there.

play33:01

Next, what sets apart code from other modalities

play33:04

is that there is a kind of ground truth

play33:07

correctness of a sample.

play33:09

And functions can be tested with unit tests and an interpreter.

play33:13

So this is very different from language,

play33:14

where to get a ground truth eval,

play33:16

you might need a human to come in.

play33:18

And even then, sometimes humans won't agree.

play33:19

Like, this is the better example or this

play33:21

isn't the better sample.

play33:24

Last thing is I used to dabble in competitive programming

play33:27

myself, and I really wanted to create a model that could solve

play33:29

problems that I couldn't.

play33:35

Go ahead.

play33:36

[INAUDIBLE]

play33:36

Is this the same thing [INAUDIBLE] get up on this?

play33:40

[INAUDIBLE]

play33:41

Yeah, exactly.

play33:41

[INAUDIBLE]

play33:44

Yeah, we wrote a paper on it, too, so, yeah.

play33:47

So I recognize that you use kind of a high level programming

play33:50

language where it's basically similar to like

play33:53

our human language.

play33:55

Have you guys ever tried to predict some even lower

play33:58

level operations like CPP, or--

play34:01

Yeah, I think there's follow-up work where we just

play34:06

train on a bunch of different languages.

play34:08

And I don't know the metrics off the top of my head,

play34:10

but I have seen some assembly writing models, cool.

play34:14

So I guess, yeah, continue on the [INAUDIBLE]..

play34:21

So we have this setting where we have unit test and interpreter.

play34:25

So how do we actually evaluate these models

play34:28

in a way that's kind of aware of these two concepts?

play34:30

So the first thing we did was we have a data set, a new data

play34:34

set, which is 164 handwritten programming problems.

play34:38

And these kind of have the format shown here.

play34:41

There's a function name, a docstring, there's a solution,

play34:45

and there's an average of around eight unit tests per problem.

play34:48

And why is it important that we hand wrote these?

play34:50

Well, the thing is we're training

play34:52

on such a large part of GitHub.

play34:54

If you said, OK, I'm going to take like some LeetCode

play34:56

problems, and I'm going to turn them into an evaluation.

play34:58

That's not going to work, because there's

play35:00

just so many GitHub repos that are like, oh, here's

play35:02

the solution to this LeetCode problem.

play35:04

So while this doesn't kind of guarantee

play35:06

that this problem isn't duplicated,

play35:08

at least someone wrote it without copying it

play35:11

from another source.

play35:14

So here's some kind of examples of a unit test

play35:17

that you would evaluate the previous function on.

play35:21

I think it should be fairly clear that we

play35:24

should be using this metric.

play35:25

This is the correct kind of ground truth metric to use.

play35:28

I mean, humans do use unit tests to evaluate code.

play35:31

And I would say if you're familiar with competitive

play35:33

programming, you can't manually judge

play35:35

all like tens of thousands of submissions that are coming in.

play35:38

You need the unit tests.

play35:39

And that is a fairly good placement.

play35:42

So one interesting point here was

play35:44

we had to create a sandbox environment

play35:46

to run these kind of generated solutions in.

play35:49

Because when you train on GitHub,

play35:50

there's a bunch of malicious code,

play35:52

there's a bunch of kind of insecure code.

play35:54

You don't want your model to be sampling

play35:56

that and kind of running that on your environment.

play36:00

Cool.

play36:00

So now that we have an evaluation

play36:02

data set, let's define a metric on them.

play36:05

And so the metric we're going to use is called pass @ K.

play36:08

And the definition is the average probability

play36:10

over all the problems that at least 1 out of K samples

play36:14

passes the unit tests.

play36:17

So if we evaluate this metric by just taking every problem

play36:20

and exactly generating k samples, it's actually not--

play36:25

there's high variance just kind of sampling it that way.

play36:28

Imagine the pass rate of a particular sample is around 1

play36:30

over k.

play36:32

This is kind of like an all or nothing metric.

play36:34

So what we do instead is we generate a much larger set

play36:39

of samples, n greater than k--

play36:40

most of the time, it's greater than 5k.

play36:43

And we count the number that are correct,

play36:45

and we compute this unbiased estimator.

play36:47

And it looks more complicated than it actually is.

play36:50

It's just complementary counting.

play36:51

You take the number of combos where all of them fail

play36:55

and subtract that out.

play36:59

Cool.

play37:00

So then we train our model.

play37:02

And like I alluded to earlier, there's

play37:06

about 160 gigabytes of code which is collected

play37:09

from 54 million repositories.

play37:12

For efficient training, what we did

play37:14

was we fine tuned from GPT-3 models of various sizes.

play37:17

And this isn't actually strictly necessary.

play37:20

We find that we can get to roughly the same final loss

play37:23

in performance without this, but it is slower

play37:26

to do it without this pretraining step.

play37:28

And so we already have these models;

play37:30

why not just fine tune them?

play37:32

And one extra trick to make training much faster here is--

play37:36

in code, there's a lot of runs of spaces, right,

play37:38

and those don't get compressed efficiently in language

play37:41

because you just don't see them very often.

play37:43

So they typically get broken up into like many separate tokens.

play37:48

So we introduce additionally some tokens that

play37:51

compress runs of white space.

play37:53

And that makes training maybe like 30% or 40% more efficient.

play37:57

So the token [INAUDIBLE]?

play38:00

Yeah, exactly, yeah.

play38:04

Great, so once we have these models,

play38:06

we can go and revisit the HumanEval data set.

play38:08

And I can share a couple of problems

play38:11

to give you a sense of where the models are at and also

play38:13

what kind of difficulty level the problems in the data set

play38:17

are at.

play38:18

So this is a 12 billion parameter model.

play38:20

The pass rate is 90%, which means that 90% of the samples

play38:24

will pass the unit test.

play38:25

This is something like anyone kind

play38:28

of doing a first day of Python would be able to do.

play38:30

So you increment all the elements of a list by 1.

play38:35

Here's a problem where the pass rate is 17%.

play38:38

So this is the solution I gave-- that's the problem I

play38:41

gave earlier.

play38:41

So you are given a non-empty list of integers.

play38:44

You want to return the sum of all odd elements that

play38:46

are in even positions.

play38:48

And this might not sound that much harder to you,

play38:50

but models can often get confused about, oh,

play38:52

is odd referring to positions or elements?

play38:55

And so here, you can actually see that it's

play38:57

doing the right thing.

play39:00

And finally, this is an example of one

play39:03

of the harder problems in the data set.

play39:04

So the password is under 1% here.

play39:07

And what's going on here is actually

play39:08

there's an encode function which takes a string.

play39:11

It kind of chunks it up into groups of three characters.

play39:14

And it does a cyclic shift on each character.

play39:16

And you have to write a decoder, something

play39:18

that reverses this operation.

play39:21

So you can see that the model-- this is a real model

play39:24

solution, so it trumps up the characters in the same way.

play39:27

You can see that the cyclic shift is the opposite way.

play39:30

So up there, it takes the first element of each group,

play39:33

moves it to the end, and now takes

play39:35

the last element of each group, moves it to the front.

play39:37

Yeah?

play39:39

OK, I'm wondering what's the effect

play39:42

of-- so you had a couple of examples [INAUDIBLE]

play39:45

in the comments.

play39:47

So I'm wondering if the model will

play39:49

be able to extrapolate what it's doing

play39:52

by the examples [INAUDIBLE] underlying [INAUDIBLE]..

play39:54

Right, yeah.

play39:55

So some of our tasks, there are some examples in the docstring.

play39:59

And some of them don't.

play40:00

I think it's just to kind of match

play40:01

the distribution of a real kind of task

play40:04

we find in the real world.

play40:07

In this case, it doesn't have it.

play40:08

But definitely for the unit tests, none of those

play40:10

appear within--

play40:11

I'm just curious-- if you just give it the examples

play40:15

and not give a description of the task [INAUDIBLE]..

play40:18

Oh, I see, I see.

play40:18

So can it do like pure induction, where you don't

play40:21

tell the task at all, yeah.

play40:23

I haven't tried it, to be honest.

play40:26

I think it's worth a shot.

play40:27

Yeah.

play40:28

Thanks.

play40:32

At this point, we've trained Codex models,.

play40:35

We've evaluated on this metric.

play40:37

But the thing is, was it worth all this trouble, right?

play40:40

You already had these metrics like BLEU

play40:42

that are match-based in language.

play40:45

Couldn't we have just used this to [INAUDIBLE]??

play40:47

We don't need an interpreter.

play40:48

We don't need to generate so many samples.

play40:51

And it would be great if it kind of

play40:52

like separated out like this.

play40:55

But what we find is that this is--

play40:58

if you take four random problems from HumanEval

play41:01

and you plot the distribution of BLEU scores

play41:03

for correct and wrong solutions, you actually

play41:06

find a lot of distribution overrule, right?

play41:08

It's hard to distinguish the green

play41:12

from the blue distributions.

play41:14

And so this suggests that BLEU actually

play41:16

isn't a very good metric for gauging functional correctness

play41:18

and that we actually do need this new kind of metric

play41:22

and this new data set.

play41:27

So now, let's explore the setting where in pass @ k,

play41:31

k is greater than 1.

play41:33

And so the first observation we have here

play41:34

is that the temperature that you sample at,

play41:37

it affects your pass @ k.

play41:40

And just for some intuition, if you do temperature zero

play41:43

sampling, you're going to get the same sample

play41:45

every single time you're doing artifact sampling.

play41:47

So it doesn't matter how many samples you generate.

play41:50

You're just going to get the same pass rate.

play41:53

And if you want to generate 100 samples,

play41:55

right, you can afford to make some mistakes.

play41:58

You just want a very diverse set of samples.

play42:00

So you can up the temperature.

play42:02

And yo can see kind of as you up the temperature, the slope

play42:05

of the kind of number of samples against pass rate,

play42:07

it becomes steep.

play42:09

And so you can kind of take the upper whole of this

play42:12

and you can find the optimal temperature

play42:13

for each number of samples.

play42:19

And so this brings me to personally my favorite result

play42:21

of the paper, which I call the unreasonable

play42:24

effectiveness of sampling.

play42:26

And so let me explain what's going on here.

play42:28

This is the number of parameters in the model.

play42:30

And here, you have pass rate @ 1 and pass rate @ 100.

play42:34

And the reason I use this term unreasonable effectiveness

play42:36

is that I think there's a world where,

play42:39

if the orange line and the line weren't that far apart,

play42:42

I might not be that surprised.

play42:44

At these scales, the model, it rarely makes syntactical errors

play42:48

anymore.

play42:49

If you run it, it'll run and produce some kind of output.

play42:52

So you could imagine a world where basically the model

play42:56

has some approach in mind.

play42:57

It's just repeatedly sampling that approach.

play42:59

And it's just either right or wrong.

play43:00

But instead what we find is that the model is actually

play43:03

composing different parts and producing

play43:06

functionally different things.

play43:08

And you get this huge boost from under 30% to over 70%

play43:12

just by sampling a lot of samples from the model.

play43:18

So unfortunately, knowing that one of your samples is correct

play43:22

isn't that useful if you don't have access to the unit tests.

play43:27

And now one practical setting where

play43:30

you would care about this is say you're

play43:31

creating an autocomplete tool, right,

play43:33

and you generate 100 samples.

play43:35

But you don't want to show your user 100 samples

play43:38

and have them pick one, right?

play43:39

You want to kind of try to prefilter,

play43:41

but you don't have unit tests.

play43:43

So can we kind of approximate this oracle sampling

play43:47

with some other ranking heuristic?

play43:50

So here, I'm showing a couple of different heuristics,

play43:53

like if you randomly pick one.

play43:55

But the one that seems most promising

play43:58

is to rank by meaning, not probability.

play44:00

And it's maybe not theoretically well-grounded,

play44:05

but in language, this kind of heuristic

play44:08

is fairly strong as well.

play44:13

So recall that what we're doing is

play44:15

we have this evaluation set where we have

play44:18

kind of standalone functions.

play44:19

We want to produce solutions to them.

play44:21

But when we're doing training, there's

play44:24

a lot of code that isn't relevant for this task.

play44:26

For instance, there's a lot of classes that we're seeing.

play44:29

There's actually data classes, too,

play44:30

which aren't relevant often.

play44:32

Actually, there's a lot of incorrect code on GitHub too.

play44:34

So we might be modeling incorrect solutions as well as

play44:38

correct ones.

play44:39

So one thing we thought was, let's fine-tune Codex

play44:44

further on a couple of data sets where

play44:46

they are standalone functions and you

play44:48

have kind of more guaranteed correct solutions to that.

play44:52

So what we did was we found these problems

play44:55

from a couple of sources.

play44:56

So one is competitive programming problems.

play44:58

You can go on these sites.

play45:00

Oftentimes, they'll just give you the unit tests.

play45:02

Sometimes, when they don't give you the unit tests,

play45:04

you can submit incorrect solutions

play45:05

and they'll tell you the first one you failed on.

play45:07

And you can kind of keep just doing that.

play45:09

[LAUGHTER]

play45:10

So you can get a lot of competitive programming

play45:12

problems.

play45:13

And another source is projects where continuous integration

play45:18

is enabled.

play45:19

So why are these useful?

play45:21

Because you can actually kind of do an execution tracing.

play45:25

So when you run the integration tests,

play45:27

you can get all the inputs to functions

play45:29

that are called and their outputs as well.

play45:31

And so you actually have the true function body.

play45:33

You know what the test output is supposed to be,

play45:35

so you know kind of the ground truth inputs and outputs.

play45:38

And these are kind of like two orthogonal data sets.

play45:41

One helps you with algorithmic kind of tasks.

play45:44

And one is more kind of like trying

play45:45

to manipulate command line utilities and [INAUDIBLE] that.

play45:52

So this brings us to the main figure of the Codex paper.

play45:55

So really what we're seeing is a progression of capabilities.

play45:58

So with GPT-3 on this HumanEval data set, the pass rate @ 1

play46:02

is 0 basically.

play46:05

You can generate one or two lines

play46:06

coherently but never really a whole program coherently.

play46:10

Now, when you fine tune on code, which

play46:13

is Codex, this orange line, you start

play46:14

to see some non-negligible performance on this data set.

play46:18

When you do this additional supervised fine-tuning--

play46:21

that's this green line--

play46:23

you get even better pass rates.

play46:24

And then if you kind of generate 100 samples from this model,

play46:29

rerank with mean logp, even better pass rates.

play46:32

And finally, of course, we have tests in Oracle.

play46:35

It gives you the best pass rates.

play46:37

So one question here is, can you actually

play46:39

use a reranking tool, like put it in the model?

play46:41

Can you use it as a backprop signal?

play46:43

Yeah, yeah, so we we can explore that.

play46:46

I don't know if I can say too much about those results.

play46:49

Yeah, got it, got it.

play46:50

But yeah.

play46:53

And finally, I don't want to suggest

play46:54

that these models are perfect.

play46:56

They have a lot of limitations that human programmers

play46:59

don't run into.

play46:59

So one is like--

play47:02

actually all generative models are--

play47:04

autoregressive generative models,

play47:05

we have some problems with binding.

play47:07

So when there's a lot of variables going on,

play47:09

like a lot of operations going on,

play47:10

sometimes it's hard to figure out which operation

play47:13

is binding to which variable.

play47:14

So you can kind of see some examples of that on the left.

play47:17

And one other kind of counterintuitive behavior

play47:19

is composition.

play47:20

So we can take a bunch of very simple building blocks,

play47:23

like take a string and reverse it,

play47:25

or delete every third character or something.

play47:28

And a human, if you can train two of these operations,

play47:30

you could probably train 10 of them.

play47:32

But our models aren't able to do that yet.

play47:38

Cool.

play47:39

So moving on to the conclusion, we've

play47:42

had four main points in today's talk.

play47:44

So first, progress in neural language modeling

play47:46

has been fairly rapid.

play47:48

And GPT wasn't the result of a push on language modeling, more

play47:52

of a result of work on pushing unsupervised learning

play47:55

in language.

play47:57

The third point is that autoregressive modeling

play47:59

is universal.

play48:00

And it can yield strong results, even when there

play48:02

are strong inductive biases, like in images or in text

play48:06

to image.

play48:07

And finally, we can produce strong code generating models

play48:10

by fine-tuning GPT-3 on code.

play48:13

And sampling is an unreasonably effective way

play48:15

to improve model performance.

play48:18

Cool, and to end with some acknowledgments,

play48:20

I want to thank my Codex primary co-authors, some mentors

play48:24

at OpenAI, and the algorithms team, which

play48:27

I've worked very closely with.

play48:29

Great.

play48:30

Thank you guys for your attention.

Rate This

5.0 / 5 (0 votes)

Related Tags
AINLPCodingImagesUnsupervised LearningTransformersLanguage ModelingOpenAIMachine LearningDeep Learning