LLM Foundations (LLM Bootcamp)

The Full Stack
11 May 202347:47

Summary

TLDRThe transcript discusses the Transformer architecture and its significance in machine learning, highlighting its use in models like GPT, T5, and BERT. It explains the concept of attention mechanisms, the role of pre-training and fine-tuning, and the evolution of large language models (LLMs). The talk also touches on the challenges of machine learning, such as the complexity of natural language processing and the importance of data sets in training. The presenter provides insights into the future of AI, emphasizing the potential of models that combine reasoning with information retrieval.

Takeaways

  • ๐ŸŒŸ The talk introduces the Transformer architecture and its significance in the field of machine learning, highlighting its adaptability across various tasks.
  • ๐Ÿค– The distinction between traditional programming (Software 1.0) and the machine learning mindset (Software 2.0) is explained, emphasizing the shift from algorithmic to data-driven approaches.
  • ๐Ÿ“ˆ Theไธ‰็งไธป่ฆ็š„ๆœบๅ™จๅญฆไน ็ฑปๅž‹่ขซๆฆ‚่ฟฐ๏ผšๆ— ็›‘็ฃๅญฆไน ใ€็›‘็ฃๅญฆไน ๅ’ŒๅผบๅŒ–ๅญฆไน ๏ผŒๅ„่‡ช้€‚็”จไบŽไธๅŒ็š„ๆ•ฐๆฎ็ป“ๆž„ๅ’Œ็›ฎๆ ‡ใ€‚
  • ๐Ÿง  The inspiration behind neural networks and deep learning is drawn from the brain's structure and function, with the perceptron model being a key building block.
  • ๐Ÿ”ข Computers process inputs and outputs as numerical vectors or matrices, requiring text to be tokenized and converted into numerical representations.
  • ๐Ÿ‹๏ธโ€โ™‚๏ธ The training process of neural networks involves backpropagation, where the loss function guides the adjustment of weights to improve predictions.
  • ๐Ÿ”„ The importance of splitting data into training, validation, and test sets is emphasized for model evaluation and to prevent overfitting.
  • ๐Ÿ“š The concept of pre-training and fine-tuning is introduced, where a large model is trained on general data and then further trained on specific tasks.
  • ๐ŸŒ The rise of model hubs like Hugging Face demonstrates the growing accessibility and sharing of pre-trained models and datasets.
  • ๐Ÿ”„ The Transformer model's architecture, including its encoder and decoder components, is explained, along with the concept of attention mechanisms.
  • ๐Ÿš€ The continuous growth and development of language models like GPT-3 and its successors are highlighted, showcasing the trend towards larger models with more parameters.

Q & A

  • What are the four key topics discussed in the transcript?

    -The four key topics discussed are the Transformer architecture, notable large language models (LLMs) such as GPT, details of running a Transformer, and the foundations of machine learning.

  • What is the difference between software 1.0 and software 2.0 in the context of programming?

    -Software 1.0 refers to traditional programming where a person writes code for a robot to take input and produce output based on algorithmic rules. Software 2.0, on the other hand, involves writing a robot that uses training data to produce another robot, which then takes input data and produces output. The second robot is not algorithmic but is driven by parameters learned from the training data.

  • What are the three main types of machine learning mentioned in the transcript?

    -The three main types of machine learning are unsupervised learning, supervised learning, and reinforcement learning.

  • How do neural networks and deep learning relate to the brain?

    -Neural networks and deep learning are inspired by the brain. They attempt to mimic the way neurons in the brain receive inputs, process them, and produce outputs. The concept of a perceptron, which is a model of a neuron, is used to create neural networks by stacking perceptrons in layers.

  • What is the significance of GPUs in the advancement of deep learning?

    -GPUs, which were originally developed for graphics and video games, are highly efficient at performing matrix multiplications. Since neural networks and deep learning involveๅคง้‡็š„ matrix multiplications, the application of GPUs has significantly accelerated the training and development of these models.

  • What is the role of the validation set in machine learning?

    -The validation set is used to prevent overfitting in machine learning models. It allows developers to evaluate the model's performance on unseen data during training and adjust the model as needed, ensuring that the model generalizes well to new data.

  • How does the Transformer architecture handle the issue of input and output sequence lengths?

    -The Transformer architecture uses a mechanism called attention, which allows the model to weigh the importance of different parts of the input sequence when predicting the next token in the output sequence. This mechanism enables the model to handle variable lengths of input and output sequences effectively.

  • What is the purpose of positional encoding in the Transformer model?

    -Positional encoding is added to the input embeddings to provide the model with information about the order of the tokens within the sequence. Since the Transformer architecture does not inherently consider the order of tokens, positional encoding helps the model understand the sequence's structure and the relative positions of the tokens.

  • What is the concept of pre-training and fine-tuning in machine learning models?

    -Pre-training involves training a large model on a vast amount of data to learn general representations. Fine-tuning then involves further training this pre-trained model on a smaller, more specific dataset to adapt the model for a particular task or domain.

  • How does the concept of instruction tuning enhance the capabilities of large language models?

    -Instruction tuning involves fine-tuning a pre-trained model on a dataset of instructions and desired outputs. This process improves the model's ability to follow instructions and perform tasks in a zero-shot or few-shot context, without the need for additional examples or prompts.

  • What is the significance of including code in the training data for large language models?

    -Including code in the training data has been found to improve the performance of large language models on non-code tasks. It enhances the model's understanding of logical structures and problem-solving, which can be beneficial for a wide range of applications.

Outlines

00:00

๐ŸŒŸ Introduction to Machine Learning and Transformers

The speaker begins by outlining the diverse audience and the topics to be covered, including the Transformer architecture and notable large language models (LLMs) like GPT. The foundational concepts of machine learning are briefly reviewed, highlighting the shift from software 1.0 to software 2.0, which involves training a model with data to produce outputs. The different types of machine learning are introduced: unsupervised, supervised, and reinforcement learning. The complexity of machine learning is discussed, emphasizing the challenge of interpreting varied inputs and the importance of understanding the structure of data. The limitations of traditional methods like logistic regression and support vector machines are noted, leading into a discussion on neural networks and deep learning as the dominant approach, inspired by the human brain's neural structure.

05:02

๐Ÿง  Deep Dive into Neural Networks and Training Methods

The paragraph delves into the concept of neural networks, drawing parallels with the human brain's neurons. The perceptron model and its evolution into the multi-layer perceptron are explained. The importance of weights in neural networks and the role of GPUs in accelerating matrix multiplications for deep learning are highlighted. The training process of neural networks is described, including the use of mini-batches, loss functions, backpropagation, and the division of data into training, validation, and test sets. The speaker also touches on pre-training and fine-tuning strategies, emphasizing the value of sharing pre-trained models and the rapid growth of model hubs like Hugging Face.

10:02

๐Ÿš€ Understanding the Transformer Architecture

The Transformer architecture is introduced as a revolutionary approach to machine learning tasks. Its origins from the 'Attention is All You Need' paper and its application beyond translation tasks are discussed. The speaker simplifies the complex structure of the Transformer by focusing on the decoder part and explaining the process of text completion. The concept of tokenization and the conversion of text into numerical vectors are introduced, setting the stage for understanding how Transformers handle inputs and produce outputs.

15:04

๐Ÿงฌ The Mechanics of Attention Mechanism

The attention mechanism within the Transformer model is explored, highlighting its ability to weigh the importance of different tokens for predicting the next token. The concept of query, key, and value vectors is introduced, along with the mathematical formulation of attention as a weighted sum of inputs. The speaker explains the rationale behind attention, emphasizing its efficiency in focusing on relevant tokens. The paragraph also introduces the idea of multi-head attention, allowing the model to learn multiple ways of transforming inputs simultaneously, and the concept of masking attention to limit the model's view to previously seen inputs during training.

20:04

๐ŸŒ Positional Encoding and the Transformer's Parallel Processing

The importance of positional encoding in Transformers is discussed, addressing the lack of inherent order in the attention mechanism. The speaker explains how positional encoding vectors are added to embeddings to provide a sense of sequence. The concept of skip or residual connections is introduced, explaining how they facilitate the backward propagation of loss through the model. The role of layer normalization in maintaining uniformity during training is also covered. The paragraph concludes with an overview of the Transformer's feed-forward layer and the repetition of the entire process across multiple layers.

25:05

๐Ÿ“ˆ Scaling Up: Parameters, Layers, and Attention Heads

The speaker discusses the scaling of Transformer models, focusing on the number of layers, embedding dimensions, and attention heads. The GPT-3 model is highlighted for its massive parameter count and layer structure. The distribution of parameters across different operations within the model is considered, and the efficiency of parallel processing in Transformers is praised. The speaker also mentions the work of Anthropic in trying to understand the inner workings of Transformers and the potential of these models as general-purpose differentiable computers.

30:05

๐Ÿ“š Notable Large Language Models and Their Impact

The speaker reviews notable large language models, starting with BERT, which introduced the concept of bidirectional encoding. The T5 model's text-to-text transfer approach and its training on a diverse corpus are discussed. GPT models are introduced as generative pre-trained Transformers, with a focus on GPT-2 and its training data. The speaker also touches on the encoding process used by GPT models and the release of GPT-3, which demonstrated abilities like few-shot and zero-shot learning. The training data for GPT-3 is described, including its sources and the unique aspects of its dataset.

35:07

๐Ÿ” Exploring the Chinchilla Model and Instruction Tuning

The speaker introduces the Chinchilla model, developed by DeepMind, which optimizes the distribution of compute resources between model size and data quantity. The model's performance is compared to larger models, demonstrating the effectiveness of training on more data. The open-source LLaMA model from Meta AI is mentioned, along with its training data. The inclusion of code in training data is discussed, along with the benefits it provides. The concept of instruction tuning is explored, explaining the shift from text completion to instruction following. The process of fine-tuning pre-trained models with human feedback is described, and the impact of fine-tuning on model capabilities is discussed.

40:07

๐Ÿ› ๏ธ Enhancing Models with Retrieval and Future Directions

The speaker discusses the RETRO model from DeepMind, which combines a smaller model with a large database of knowledge for fact retrieval. The potential of this approach to enhance reasoning and coding abilities is highlighted, although the current limitations are acknowledged. The speaker reflects on the future of LLMs and the potential for models that can both learn from data and retrieve information as needed.

Mindmap

Keywords

๐Ÿ’กMachine Learning

Machine learning is a subset of artificial intelligence that provides systems the ability to learn from and make decisions based on data. In the context of the video, it forms the foundation of the discussed Transformer architecture and large language models (LLMs) like GPT, T5, and BERT, which are trained to recognize patterns and generate human-like text based on input data.

๐Ÿ’กTransformer Architecture

The Transformer architecture is a deep learning model introduced in the paper 'Attention Is All You Need'. It revolutionized natural language processing by effectively handling long-range dependencies in data. The architecture is based on self-attention mechanisms, allowing the model to weigh the importance of different parts of the input data relative to each other. In the video, the Transformer is highlighted as the dominant approach in modern machine learning for tasks like translation, text understanding, and generation.

๐Ÿ’กGPT (Generative Pre-trained Transformer)

GPT, or Generative Pre-trained Transformer, is a type of language prediction model that uses unsupervised learning to generate coherent and contextually relevant text. GPT models are trained on a diverse range of internet text and can perform a variety of language tasks, from translation to question-answering, based on the input they receive. The video mentions GPT as an example of a notable LLM and discusses its ability to learn from a large dataset and exhibit few-shot and zero-shot learning capabilities.

๐Ÿ’กBERT (Bidirectional Encoder Representations from Transformers)

BERT, or Bidirectional Encoder Representations from Transformers, is a language model that focuses on understanding the context of words in a sentence by considering the entire sequence of words rather than just the immediate surroundings of a word. Unlike GPT, BERT is designed for a wide range of natural language understanding tasks, such as sentiment analysis and question-answering. The model is pre-trained on a large corpus of text and then fine-tuned for specific tasks.

๐Ÿ’กT5 (Text-to-Text Transfer Transformer)

T5, or Text-to-Text Transfer Transformer, is a model that็ปŸไธ€es various natural language processing tasks into a single text-to-text format. Instead of having separate models for different tasks like translation or summarization, T5 converts all tasks into an input-to-output format where the input and output are both text strings. This approach simplifies the model training and allows for easier adaptation to new tasks.

๐Ÿ’กAttention Mechanism

The attention mechanism is a crucial component of the Transformer architecture that allows the model to weigh the importance of different parts of the input data. It operates by calculating a weighted sum of the input vectors, with the weights determined by the relevance of each input to the output being produced. This mechanism enables the model to focus on certain parts of the input sequence while ignoring others, which is particularly useful in tasks like translation where not all parts of the input are equally important for predicting the next word.

๐Ÿ’กEmbedding

Embedding in machine learning and NLP refers to the process of representing categorical data, such as words, as dense numerical vectors in a continuous space. This technique allows the model to understand the semantic relationships between words by capturing their meanings through vector representations. Embeddings are learned from the data during the training process and are a fundamental part of how models like GPT and BERT process text.

๐Ÿ’กFine-Tuning

Fine-tuning is a machine learning technique where a pre-trained model is further trained on a smaller, more specific dataset to adapt it for a particular task. This process leverages the knowledge the model has gained during pre-training to quickly and effectively learn new tasks. In the context of the video, fine-tuning is used to specialize large language models for tasks like medical imaging or specific NLP applications, often resulting in improved performance compared to training a model from scratch.

๐Ÿ’กZero-Shot Learning

Zero-shot learning is a machine learning paradigm where a model is capable of recognizing and performing tasks without any prior training or experience with those tasks. In the context of the video, this refers to the ability of models like GPT-3 to perform well on tasks even when they are not explicitly trained on examples of those tasks. The model uses its general understanding of a wide range of topics to infer how to tackle new problems.

๐Ÿ’กInstruction Tuning

Instruction tuning is a process in which a pre-trained language model is further trained on a dataset of instructions and corresponding desired outputs. This fine-tuning approach aims to improve the model's ability to follow instructions and perform tasks in a manner that closely aligns with human expectations. The goal is to enable the model to understand and execute tasks based on a single prompt or description, without the need for multiple examples.

๐Ÿ’กRetrieval Enhancing

Retrieval enhancing is a machine learning approach where a model is trained to retrieve and utilize relevant information from a large database to assist in its decision-making process. This method is used to improve the model's performance on tasks that require external knowledge by providing it with a source of information that it can query when needed. In the context of the video, retrieval enhancing is discussed as a potential future direction for LLMs, where a smaller, more efficient model could look up facts or data from a database to complement its reasoning abilities.

Highlights

Introduction to machine learning and the Transformer architecture, including its applications and variations.

Explanation of the differences between software 1.0 and software 2.0, highlighting the shift from traditional programming to machine learning mindset.

Overview of various types of machine learning: unsupervised learning, supervised learning, and reinforcement learning.

Discussion on the challenges of machine learning, such as the infinite variety of inputs and the complexity of the natural world.

Description of neural networks and deep learning, drawing inspiration from the brain's structure and function.

Explanation of how neural networks are stored and processed as vectors and matrices of numbers.

The role of GPUs in accelerating matrix multiplications, which is crucial for neural network computations.

Training neural networks using batch data, loss functions, and backpropagation to adjust parameters.

The importance of splitting data into training, validation, and test sets to avoid overfitting and ensure model robustness.

Pre-training and fine-tuning strategies for machine learning models, leveraging large datasets and model hubs like Hugging Face.

The Transformer architecture's dominance in machine learning tasks, originating from the 'Attention is All You Need' paper.

Detailed breakdown of the Transformer decoder, including tokenization, embedding, and the attention mechanism.

The concept of multi-head attention and its ability to learn multiple ways of transforming inputs simultaneously.

The use of positional encoding to incorporate the order of tokens within the Transformer model.

The addition of feed-forward layers and layer normalization in the Transformer architecture for further representation enhancement.

Scaling up the model size, embedding dimension, and number of attention heads in GPT-3 to achieve better performance.

Overview of notable large language models like BERT, T5, and GPT, and their contributions to NLP and AI.

Instruction tuning and its impact on improving model performance for zero-shot and few-shot learning tasks.

The potential future of AI with retrieval-enhancing models like RETRO, combining smaller models with external knowledge databases.

Transcripts

play00:01

[Music]

play00:03

all right I'm gonna get started I'm

play00:06

going to talk about

play00:07

four things

play00:09

lunch is in an hour so

play00:13

um

play00:13

I'm gonna speed run some key ideas and

play00:15

just machine learning we have a diverse

play00:16

audience we have machine learning

play00:19

experts and research scientists we also

play00:21

have

play00:22

Executives and investors that have never

play00:25

um you know trained a logistic

play00:27

regression or anything like that

play00:30

we're going to talk about the

play00:31

Transformer architecture

play00:32

we're going to talk about notable llms

play00:34

you might have heard of

play00:36

you know a little thing called GPT

play00:38

there's also other ones like T5 Birch

play00:40

and chill and so on that you should

play00:42

probably know

play00:43

and we'll talk about some just details

play00:45

of running a Transformer

play00:47

so the foundations of machine learning

play00:49

this I expect this to be

play00:52

not needed for most of you

play00:54

I still think it's worth sharing because

play00:56

for some of you it is needed and also

play00:58

just to get on the same page about what

play01:00

what's happening

play01:01

you know software 1.0 in Andre krapati's

play01:05

terminology

play01:06

is traditional programming right where a

play01:08

person writes basically a robot that

play01:11

then takes input and produces output and

play01:14

the robot is entirely algorithmic so you

play01:16

you know the the person that has to

play01:18

specify all possible edge cases for the

play01:20

input

play01:20

and have robust tests with output and so

play01:24

on

play01:24

with software 2.0 mindset which is the

play01:27

machine learning mindset

play01:29

the person

play01:30

writes a robot that then takes a bunch

play01:33

of training data and produces another

play01:35

robot that that's going to take input

play01:37

data and produce an output

play01:39

and you can't really test that second

play01:41

one because it's not that second robot

play01:44

is not algorithmic it's it's now driven

play01:46

by a bunch of parameters that you don't

play01:48

really have much visibility into you

play01:50

only have visibility into your training

play01:51

system so that really changes the

play01:53

mindset of what what's actually

play01:55

happening what type of machine learnings

play01:57

are there

play01:58

there's unsupervised learning it's like

play02:00

generative AI used to find structure in

play02:03

the data generate more data

play02:05

their supervised learning which you get

play02:07

some data as input then you produce

play02:09

something that looks a little different

play02:10

as as output usually a label for that

play02:12

input data

play02:14

or maybe a prediction about what's going

play02:16

to come next

play02:17

and then you have reinforcement learning

play02:19

which you have agents that act in an

play02:22

environment they collect rewards learn

play02:24

to act and these have traditionally been

play02:26

pretty separate but they've mostly

play02:30

converged on just

play02:32

really is just supervised learning

play02:34

sometimes called self-supervised

play02:35

learning where you can formulate

play02:38

everything it's just a supervised

play02:39

problem so if you're doing generative

play02:41

problems you can formulate it as this

play02:44

first bit of data

play02:46

is labeled with the continuation of the

play02:49

data so that's kind of supervised

play02:50

formulation

play02:51

and then reinforcement learning you can

play02:53

formulate as given the state of the

play02:56

world what is the next move that that

play02:59

would collect the most rewards you don't

play03:01

have to do anything special you can just

play03:02

treat it as supervised learning

play03:04

to a computer inputs and outputs are

play03:07

always just numbers

play03:08

and that's kind of important to remember

play03:10

because we might see something like this

play03:12

I mean it really isn't a photograph but

play03:15

we definitely see it as a photograph of

play03:17

of Abe Lincoln

play03:19

and then we read output and it's really

play03:21

just a bunch of letters but we have

play03:23

meaning to the word Lincoln a machine

play03:25

doesn't have any of that right they just

play03:26

see a bunch of numbers and then they see

play03:29

some other numbers that are token

play03:31

vocabulary ads

play03:33

so everything is just a vector or Matrix

play03:35

of numbers to a machine learning

play03:37

computer

play03:38

and we ask it to predict things about

play03:40

the natural world why is that hard well

play03:44

there's an infinite variety of inputs

play03:46

that all can mean the same thing so

play03:49

let's say people are talking about a

play03:50

movie they watched someone might say you

play03:52

know I love the movie that's pretty easy

play03:54

to interpret they use the word love and

play03:55

movie but then someone else might say

play03:57

you know as good as The Godfather

play03:59

now you need to know like is the

play04:00

Godfather another movie would this

play04:03

person consider it to be a good movie

play04:04

and so on or you know someone might just

play04:07

say something unintelligible to you

play04:10

um but it might mean the same thing and

play04:12

the computer has to learn all of that

play04:15

and uh meaningful differences can be

play04:17

very tiny so like you know these muffins

play04:20

really do look like Chihuahuas but

play04:22

very different in function

play04:25

and the structure of the world is

play04:27

complex even if you have a face

play04:29

recognizer

play04:30

and you trained it on well lit faces but

play04:33

now you're in this dramatic lighting

play04:35

setting and you only see half the face

play04:36

it might totally fail

play04:38

and it's it's because there's physical

play04:40

structure to to how the image is formed

play04:42

that um

play04:44

just makes everything way more complex

play04:46

than it seems like it should be

play04:48

but how is it done so there's many

play04:50

methods for machine learning through the

play04:52

gears the simplest one is called

play04:54

logistic regression probably learned

play04:56

that in some college course uh in the

play04:59

90s people you support Vector machines

play05:01

there's decision trees that people do

play05:03

for tabular data a lot like the XG boost

play05:06

Library

play05:07

but one approach to machine learning

play05:09

really is dominant nowadays and that's

play05:11

neural networks and it's also called

play05:13

Deep learning

play05:15

um

play05:16

and the inspiration for that comes from

play05:19

the the brain which is like the one

play05:21

thing that we know to be really

play05:22

intelligent in the world

play05:24

people started working on this in the

play05:26

40s and the 50s they took a look at what

play05:29

and the brain is doing it's composed of

play05:32

a bunch of neurons

play05:33

um and neuron receives electrical inputs

play05:36

on one end and if there's enough input

play05:39

to the neuron then it fires and sends

play05:43

its output down to other neurons

play05:46

and the Brain itself has inputs like

play05:49

through my vision to my senses and

play05:51

outputs like me speaking or moving

play05:54

and the formalization that people came

play05:56

up with is that they decided to

play05:58

represent the neuron as this model

play06:00

called the perceptron

play06:02

which has inputs coming in which is just

play06:05

numbers and the numbers get multiplied

play06:07

by some weights if the sum of the

play06:10

weighted inputs is passed some kind of

play06:13

threshold then the neuron you know the

play06:15

perceptron fires in a sense that's

play06:18

modeled with a step function

play06:19

mathematically

play06:21

and then to create a brain you take a

play06:24

bunch of perceptrons

play06:26

and you stock them in layers and you

play06:29

cannot you know every every perception

play06:31

on one layer to every perception on the

play06:33

next layer and that's called the

play06:34

multi-layer perceptron and it's kind of

play06:37

like a brain

play06:38

um now how is it stored to them in a

play06:41

machine it's just a vector of numbers

play06:43

right so like what's important about the

play06:44

perceptron are the weights and that's

play06:47

just what are the things you're

play06:48

multiplying inputs by

play06:50

and then a layer of perceptrons is then

play06:53

a matrix of numbers

play06:55

and the whole neural network is a set of

play06:57

matrices of numbers

play06:59

and we call these parameters so

play07:01

parameters of the neural network are

play07:03

just all the perceptron weights inside

play07:05

of that Network

play07:06

and or sometimes you call them weights

play07:08

as well

play07:09

and all the neural network operations

play07:11

are just Matrix multiplications for this

play07:13

reason

play07:14

and one thing people figured out is that

play07:17

gpus which were developed for graphics

play07:19

like video games are really fast at

play07:21

Matrix multiplications neural networks

play07:24

are just doing Matrix multiplications so

play07:26

that kicked off the deep learning

play07:28

Revolution when people applied gpus to

play07:31

actually running neural networks

play07:34

how do you train in neural network so

play07:36

let's say you have a big X data so maybe

play07:39

it's images maybe it's text and you have

play07:41

labels why so like labels like cat and

play07:44

not a dog

play07:46

so you take a little batch

play07:48

sometimes called a mini batch of data

play07:50

Little X

play07:51

you use your current model so you use

play07:54

your current neural network which starts

play07:56

out with just random weights to run the

play07:59

X kind of through the network and out

play08:02

comes a prediction so let's call that y

play08:04

Prime we'll compute what's known as a

play08:07

loss function on the ground truth label

play08:10

Y and the prediction y Prime

play08:12

the most uh you know prevalent loss is

play08:16

the cross entropy loss which sounds

play08:18

complicated but it really is just this

play08:20

function you just multiply the ground

play08:23

Truth by the log of the predictions and

play08:27

you sum it up

play08:28

and then that gives you

play08:31

um some numbers that you then what's

play08:34

known as bat propagate through all the

play08:36

layers of the model this is too

play08:37

complicated to go into

play08:38

but basically you just think of the

play08:42

network pushing a prediction if the

play08:44

prediction is correct it gets signal

play08:46

that the parameters are great if the

play08:49

prediction is not correct it gets

play08:51

signaled that the parameters need to be

play08:52

adjusted in the direction of making the

play08:54

predictions more correct

play08:56

and then you repeat that until your loss

play08:59

stops decreasing essentially

play09:02

and in order to actually do machine

play09:05

learning you always split your data into

play09:06

a training set a validation set and a

play09:09

test set the training set is the largest

play09:11

why do you need the validation set

play09:13

because you might overfit so if you

play09:16

train too much the parameter might you

play09:19

know the network might be really good on

play09:21

gesture training data but actually it

play09:22

becomes

play09:23

worse than it used to be on the

play09:25

validation data so we kind of look at

play09:27

the validation loss and when that stops

play09:30

improving that's when we stop training

play09:32

or maybe we set the some frame some

play09:35

hyper parameters about the model such as

play09:38

how many layers does it have how many uh

play09:41

or like what activation function does it

play09:42

have

play09:43

and the test set should really be left

play09:45

alone as much as possible and it's

play09:48

really for measuring

play09:49

how your train model is going to work in

play09:52

production so it's like you shouldn't be

play09:54

looking at the test set you should only

play09:56

look at it basically once when you're

play09:57

done

play09:58

and all of this applies to your

play10:00

experimentation with prompts too so if

play10:02

you're not doing traditional machine

play10:03

learning it's not like you have to

play10:05

forget about the validation side uh you

play10:08

really should be having this mindset

play10:10

even if you're not doing like if you

play10:12

don't have a loss function but you're

play10:14

just kind of looking at some prompts

play10:16

and you're trying to figure out which

play10:17

one's better there's still that notion

play10:19

of a validation set and a potentially a

play10:21

test set

play10:23

some some more terminology uh

play10:25

pre-training you might hear so that just

play10:28

basically means training like training a

play10:29

large model on a lot of data

play10:31

and the reason it's called pre-training

play10:33

is because oftentimes you would take

play10:35

that large model

play10:36

and then train it a little bit more with

play10:38

less data and that's called fine tuning

play10:41

and the reason you do that is because

play10:43

maybe you have a lot of label data in

play10:46

just general internet imagery Like

play10:48

Liquor style images but you don't have a

play10:51

lot of data in um you know Medical

play10:54

Imaging x-rays or something but you

play10:57

might train a model on just flicker

play10:59

images and then fine tune it on your

play11:01

medical images and it'll work better

play11:04

than if you only trained on the medical

play11:05

images

play11:07

people share pre-trained models

play11:10

thankfully there's a number of model

play11:12

hubs hugging face is the most popular

play11:16

it has 180 000 models and last time I

play11:19

gave this lecture if they only had like

play11:22

I think 90 000 models

play11:24

and they have 30 000 data sets and last

play11:27

time was like you know a year or two ago

play11:29

and they only had like five thousand

play11:31

data sets so growing very rapidly and

play11:33

they have models for anything you might

play11:35

want to do in machine learning

play11:38

and before

play11:40

um around you know 2020

play11:42

like that might mean that each type of

play11:45

model in the model Hub is its own neural

play11:48

network architecture like people would

play11:50

use convolutional neural networks for

play11:53

computer vision they would use recurrent

play11:55

neural networks for natural language

play11:56

processing they would have special

play11:58

thanks for reinforcement learning and so

play12:00

on

play12:02

but

play12:03

nowadays uh basically Transformer model

play12:07

is is all that's used for all kinds of

play12:09

machine learning tasks

play12:11

so the Transformer architecture came out

play12:14

of a paper called attention is all you

play12:16

need from 2017 and attention is all you

play12:18

need today you don't really need the

play12:21

Wi-Fi you know you should pay attention

play12:24

um

play12:25

and that they formulate an architecture

play12:28

that set state-of-the-art results on

play12:31

translation tasks that's kind of all

play12:33

they applied to but then other people

play12:35

quickly started applying the same

play12:36

architecture to like other NLP tasks be

play12:39

state-of-the-art on those and then

play12:41

vision and so on

play12:43

but it looks pretty complicated when you

play12:45

like see the whole diagram but it's

play12:47

actually just like two of the same thing

play12:49

like there's two halves to it that are

play12:51

basically the same so we're just going

play12:53

to look at one half of it

play12:55

um that's called the decoder

play12:57

so the overview of the Transformer

play12:59

decoder

play13:00

is

play13:01

in this x position let's say the task is

play13:04

to complete text just like gpg

play13:06

um GPT models are doing

play13:09

so if you see text like the ground truth

play13:11

text is like it's a blue sundress

play13:14

for whatever reason that's like text

play13:15

that the model is being trained on right

play13:17

now so you would see it's a blue and the

play13:19

task is to predict the word sundress

play13:23

the inputs down here on the bottom

play13:27

it's not going to be text it's going to

play13:29

be a sequence of tokens so like it's a

play13:33

blue

play13:34

and the output is going to be a

play13:36

probability distribution over the

play13:39

potential next token

play13:40

so the input is a sequence of vectors

play13:43

the output is a is a vector that's a

play13:45

probability distribution

play13:48

and to run inference which means like to

play13:51

get results out of this network

play13:54

what we're going to do is we're going to

play13:55

take the probability distribution sample

play13:58

an actual token from it then append it

play14:00

to the inputs so let's say we sampled

play14:02

the word you know it's a blue but we

play14:04

sampled the word house so now we have

play14:06

the input it's a blue house

play14:08

and then we're going to run that through

play14:10

the model again see the probability

play14:12

distribution over the next token sample

play14:14

it append it and so on

play14:17

and that's how that's how chat GPT is

play14:20

doing what it's doing that's it's seeing

play14:21

what you typed then it's sample to the

play14:23

next word samples appends it samples and

play14:25

that's word and so on

play14:27

so in more detail the inputs need to be

play14:30

vectors of numbers

play14:32

and so we have text how do we turn text

play14:35

into vectors of numbers

play14:37

so first we turn it into tokens

play14:40

this is the actual tokenization that GPT

play14:43

free is doing

play14:46

so there's like a starter sequence token

play14:48

it apostrophe s a blue Sun rest and so

play14:52

on

play14:53

so each one of those tokens we'll talk

play14:56

about in a second or we'll talk about

play14:58

how this tokenization was found a little

play15:01

later but for now just like this is what

play15:03

it is and each one is actually an ID

play15:06

right in a vocabulary it's not award

play15:08

it's it's just a number

play15:10

and furthermore it's not actually just a

play15:12

number it's actually a vectoring and you

play15:15

can represent a number as a vector with

play15:18

this thing called one hot encoding

play15:20

so like the number three you can

play15:22

represent by an all zero Vector that has

play15:25

one in the third position and zeros

play15:27

everywhere else

play15:29

and that could be the input to our

play15:31

Network that you know we we could just

play15:34

go with this

play15:35

but we're going to do something

play15:36

different a little bit different which

play15:38

is called embedding

play15:40

so the reason we're doing this is

play15:41

because one hot vectors are bad

play15:43

representations of words or tokens

play15:46

so like the word cat is going to have

play15:49

vocabulary ID you know 30 the word

play15:52

kitten is going to have vocabulary you

play15:53

know 32 or something but the distance

play15:56

between them is as large as the distance

play15:59

between you know the word cat and any

play16:02

other word in the vocabulary so there's

play16:03

no notion of similarity of any of any

play16:07

token

play16:08

and there's a simple solution to us

play16:10

which is we can learn an embedding

play16:12

Matrix which takes uh your one-hod

play16:15

vocabulary encoding and embeds it into

play16:19

something that is a dense factor of your

play16:21

choice or like of the dimensionalities

play16:24

of your choice

play16:26

so let's say if your vocabulary size is

play16:29

like 30 000 you can turn that into an

play16:32

embedding size of like 512 and all you

play16:35

have to do is just learn a matrix that's

play16:37

size 30 000 by 512 and this is like the

play16:41

simplest neural network layer type

play16:44

that's kind of all you need to

play16:45

understand is like we're turning words

play16:47

into dance uh embeddings

play16:51

we're going to send those embeddings

play16:52

into the model I'm going to skip

play16:53

positional encoding for now and go into

play16:56

this

play16:57

mask multi-hat attention but we're going

play17:00

to ignore the words masked and

play17:02

multi-head for now we're just going to

play17:03

talk about attention

play17:05

so the key Insight of attention is as

play17:08

tokens come into the

play17:11

um the the the model remember the task

play17:15

is to predict the most likely next token

play17:18

we're seeing some previous tokens

play17:21

but they're not all equally important to

play17:23

what the next token should be right

play17:25

there's like

play17:27

there's some things that just very

play17:29

closely follow previous tokens and

play17:32

there's some things at the beginning of

play17:33

the sentence that don't even matter to

play17:34

like what you're going to predict next

play17:37

so this notion of attention was

play17:39

introduced in uh 2015 for translation

play17:42

tasks and in Translation let's say We're

play17:45

translating English to French

play17:47

and uh in the English you know it's the

play17:51

agreement on the European economic area

play17:53

was signed

play17:54

uh in August 1992. so the word sign like

play17:58

to to predict the word sign what do you

play18:01

actually need to know about the previous

play18:03

sequence which is in French

play18:05

you don't really care what was sign

play18:06

right but what you do care is like how

play18:09

do you say signs in French because We're

play18:11

translating French to English

play18:13

so in French there's um

play18:17

which just like it you know it's like

play18:20

the past tense of signed uh and in

play18:22

English it's just was signed but the

play18:25

word science itself already is past

play18:27

tense so you don't actually need even

play18:29

the word was so you can kind of see why

play18:31

it was useful for translation but the

play18:34

idea is very general

play18:36

and the um formalization is like let's

play18:38

say you have a sequence of vectors X

play18:40

and you have an output sequence of

play18:43

actors and each output Vector is going

play18:46

to be a weighted sum of the inputs and

play18:50

the weights

play18:51

are going to be just the dot products

play18:53

between the input vectors there's no

play18:55

learning at all right now we're just

play18:57

saying we have to produce some outputs

play18:59

all we have are inputs

play19:02

the each output is going to be a sum of

play19:04

the inputs but we're going to weight the

play19:06

sum by basically dot product which is

play19:09

kind of like similarity between the

play19:11

input vectors

play19:12

and to make it nice we're just going to

play19:14

make the the weight sum to one but it's

play19:17

not important

play19:19

so you know looking graphically this is

play19:21

a figure from Lucas Bears Transformers

play19:25

lecture

play19:26

uh

play19:28

you know we have input factors and we're

play19:31

producing output let's say y sub I so

play19:36

what we're going to do is we're going to

play19:37

take the factor x sub I that you know I

play19:40

part of the input and kind of like

play19:43

dot product it with all the other inputs

play19:46

and then the the value of the dot

play19:49

product is going to be our attention

play19:51

weight

play19:52

so now we have an attention weight a

play19:55

little vector and then we're going to

play19:56

apply that attention weight again to the

play19:58

inputs to this time sum them up and

play20:01

produce the output

play20:03

that's kind of all that's happening

play20:05

here's another view of this this is from

play20:07

Peter plum

play20:10

so we're producing I'll put y sub 2. and

play20:13

what we're going to do is

play20:15

sum the weighted inputs so x sub 1 x of

play20:20

2x of 3x of 4.

play20:22

and the weight for each one is going to

play20:24

be as as described and what we can

play20:27

notice is that every input

play20:29

is used in three different ways so it's

play20:31

used as a query so for like y sub 2 the

play20:35

vector x sub 2 is used as the query and

play20:38

it gets compared to the Keys which are

play20:40

all the other input vectors

play20:42

and then that produces the weight and

play20:45

then the weight is multiplied by the

play20:47

values and then summed up to produce the

play20:49

output so each input Vector plays three

play20:52

different roles in you know in the

play20:54

course of this attention mechanism as a

play20:56

query as a key to some other query and

play21:00

then as a value to be summed up to the

play21:02

output

play21:03

and that's fine and dandy but like why

play21:06

do we do this and also there's like no

play21:09

like it might help but it might not help

play21:10

there's no learning involved so far

play21:12

so what we're going to do is we're going

play21:15

to project the inputs into different

play21:17

roles project means you take a vector

play21:19

you multiply by a matrix now you have a

play21:21

different Vector that's like you can

play21:23

think of it as like being rotated or

play21:25

stretched or both in some space

play21:28

and so we're going to do is we take the

play21:30

input projected one way to be the query

play21:32

another way to be the key and a third

play21:35

way to be the value

play21:37

and uh graphically you know you have

play21:40

your inputs you might actually even

play21:41

change the the dimension of them right

play21:44

so it's like you might have

play21:46

four dimensional vectors coming in but

play21:48

the projection makes them eight

play21:49

dimensional in practice this isn't

play21:51

really done for uh like GPT style models

play21:54

but it could be done

play21:57

and the key thing here is like now we

play22:00

have three matrices that we can learn

play22:02

and once we learn them we've basically

play22:04

learned a good way to do attention

play22:06

what does it mean to be multi-head

play22:08

attention

play22:09

so we can learn

play22:11

simultaneously

play22:13

several different ways of transforming

play22:15

inputs into queries keys and values

play22:18

so here's like three headed detention

play22:20

and we're showing the query Matrix so

play22:22

there's like three different ones that

play22:25

we can do simultaneously and

play22:29

when we actually implement it in the

play22:32

math it's just a single Matrix anyway so

play22:34

it's a

play22:36

it seems more scary than it is

play22:39

uh and then the last thing is masked why

play22:41

are we masking attention

play22:43

so I talked about inference but in

play22:46

training what we have is a sequence of

play22:48

tokens like it's a blue and then it's

play22:50

kind of blanked out

play22:52

and then we have the ground truth

play22:53

outputs which is like we know it's

play22:56

supposed to be at the blue sundress so

play22:58

we actually start a blue sundress and

play23:00

blanked out

play23:01

and

play23:03

um that's our ground truth outputs the

play23:04

actual outputs of the model are

play23:06

probability distributions over potential

play23:08

tokens

play23:10

so

play23:11

crucially to the thing to understand is

play23:13

all of the probability all of the

play23:15

outputs are computed at the same time so

play23:17

it's like I put in the sequence and I

play23:19

produce the potential outputs for every

play23:22

subsequence at the same time

play23:24

so like if I am predicting the word a I

play23:28

should only see the word it's if I'm

play23:30

predicting the word blue I should see

play23:31

it's a blue and then if I'm predicting

play23:34

where it's undress I should see it's a

play23:36

blue sundress

play23:38

and so that means that when I'm

play23:41

predicting the word Sandra or when I'm

play23:42

predicting the word blue I shouldn't see

play23:44

future things I should only see the

play23:46

things that have already happened

play23:47

in the input so instead of the full self

play23:50

attention we have this mask

play23:51

self-attention which is limited to just

play23:54

the part of the input that's already

play23:55

been seen it's implemented by

play23:58

multiplying the attention weight matrix

play24:00

by just the mass Matrix

play24:02

and

play24:04

you know conceptually what's what's this

play24:06

doing

play24:06

so like a token comes in

play24:09

um it gets augmented in some way with

play24:11

like previously seen tokens that seem

play24:13

relevant

play24:14

so the previously seen means math is

play24:16

like the mass part we seem relevant

play24:18

that's the Learned attention part

play24:21

and then we do this in several ways

play24:23

simultaneously that's the multiple head

play24:25

part

play24:26

and the thing that's kind of

play24:28

counterintuitive is actually there's no

play24:30

notion of position so far there's a

play24:31

notion of what you've seen what you

play24:32

haven't seen but inside of what you have

play24:35

seen

play24:36

there's no ordering

play24:38

and so that's where the positional

play24:39

encoding comes in so if you look at

play24:41

these uh you know equations there's no

play24:43

order anywhere it's like you just have a

play24:46

bag of bacteries and you're producing

play24:48

and you're just summing them up

play24:50

but if you see like something like this

play24:51

movie is great it's exactly the same as

play24:53

any other permutation of that

play24:57

so a trick to fix that is like

play25:00

we're gonna add special position

play25:02

encoding vectors to our embedding

play25:04

vectors

play25:06

and it seems like it shouldn't work but

play25:08

like it really is that simple there's

play25:10

some complication as to how you

play25:13

how do you like formulate these position

play25:15

coding factors but you can do it like

play25:17

very

play25:18

you don't have to do anything very

play25:20

complicated you could just like have a

play25:22

incrementing vector that you just add

play25:25

and the magic of attention figures out

play25:27

that it should pay attention to the

play25:29

position if it's relevant

play25:32

um then when stuff comes out of the

play25:35

attention we're gonna add it up and Norm

play25:37

it so the adding part is like you see

play25:39

all those arrows that go around the

play25:41

attention block

play25:42

so that is often called the skip

play25:45

connection or a residual connection and

play25:48

basically we want to like the output we

play25:51

want to not only go through the the

play25:52

module like the attention module but we

play25:54

also just want to add a little bit of

play25:55

the original input

play25:57

and the reason we do this is because

play25:59

when we backdrop we're going to go

play26:02

through all of the arrows backwards and

play26:05

the fact that we can go around a layer

play26:07

is quite nice because we can propagate

play26:10

the loss all the way from the end of the

play26:12

model back to the first layer of the

play26:14

model

play26:15

and by the way this is possible because

play26:17

we're not changing the dimension of the

play26:19

output it's always it's all the same

play26:20

shape so it's like the input embedding

play26:22

determines the dimension of this whole

play26:24

Transformer model

play26:26

and then the layer Norm is like

play26:30

basically the motivation is neural Nets

play26:33

learn the best when everything is

play26:35

uniform has uniform mean and standard

play26:37

deviation

play26:38

but as you actually apply these matrices

play26:41

to to your inputs the means and standard

play26:43

deviations get blown out

play26:45

and solar normalization is like I you

play26:48

know it's a hack where you basically

play26:50

take things and you just reset them back

play26:52

to a uniform mean and standard deviation

play26:55

and you do that between every operation

play26:58

it seems inelegant which is why I think

play27:01

it took people a while to start doing it

play27:03

but once you start doing it it's very

play27:04

effective

play27:05

then the feed forward layer is like that

play27:07

standard multi-layer perceptron that I

play27:09

showed you in the beginning with just

play27:10

one hidden layer

play27:12

and the conceptual view is like the

play27:13

token that's been augmented with other

play27:16

relevant tokens

play27:18

comes into the speed forward layer and

play27:20

it like upgrades its representation so

play27:23

that's like the best intuition I have

play27:24

about it like

play27:26

you know if you start out at Ward level

play27:28

then okay we're going to mix with other

play27:31

words we've seen now we're going to go

play27:32

into the feed forward layer and like

play27:35

upgrade to something more like thoughts

play27:38

or something like more semantic meaning

play27:39

than the nominal meaning of the words

play27:43

and then this whole thing gets repeated

play27:45

a number of times

play27:46

[Music]

play27:47

like in the gpt3 model for example it

play27:50

ranges from 12 layers to 96 layers of

play27:53

this Transformer layer there's also the

play27:56

embedding Dimension that you can change

play27:58

and then there's the number of attention

play28:00

heads in practice I think people scale

play28:02

I'm sorry people scale

play28:05

all these hyper parameters together so

play28:08

it's like if you're increasing the

play28:09

number layers you're also going to

play28:10

increase the dimension of the number of

play28:11

attention heads

play28:13

for gpt3 being famously 175 billion

play28:17

parameters 96 layers 12 000

play28:20

embedding Dimension 96 attention heads

play28:25

and another thing to to think about is

play28:28

like

play28:28

those 175 billion parameters how do how

play28:32

are they distributed between the types

play28:34

of of operations

play28:36

and if it's that large it's mostly the

play28:39

feed forward layer that takes up the

play28:41

weights but for a small Network like the

play28:44

gpt3 small a large part of the weights

play28:47

is also the embedding and the attention

play28:49

itself

play28:51

so why does this work so well so Andre

play28:54

has a great tweet that says the

play28:56

Transformers magnificent neural network

play28:58

architecture

play28:59

because it's a general purpose

play29:01

differentiable computer it is expressive

play29:04

in the forward pass it's optimizable via

play29:06

backdrop and it's efficient because

play29:09

everything is happening in parallel

play29:13

there's some line of you know some lines

play29:15

of work try to figure out exactly how

play29:17

expressive the Transformer is a cool

play29:20

result is this rasp rasp paper which is

play29:23

basically a programming language that

play29:26

should be implementable inside of a

play29:27

transformer

play29:29

so they see like the the example on the

play29:32

on the right here

play29:33

is like a two layer Transformer Network

play29:36

that reverses strings

play29:38

and they wrote it as this programming

play29:41

language but it can actually compile

play29:43

down to like Transformer weights that'll

play29:45

execute that every time and there's the

play29:49

inverse problems like well given the

play29:51

weights can we decompile it to a program

play29:53

and the answer is no we don't know how

play29:55

to do that yet

play29:56

and we actually mostly just don't

play29:58

understand what the Transformer is doing

play30:01

some people are trying most notably

play30:02

anthropic and you should check out their

play30:05

grade blog posts if if you're interested

play30:07

so um like induction heads is an

play30:10

interesting result

play30:11

where like one thing they observed is as

play30:14

you add multiple layers of attention or

play30:16

sorry multiple heads of attention

play30:19

you can notice this thing where like you

play30:22

know like the the the model basically

play30:24

figures out how to use the second head

play30:26

and uh

play30:28

and

play30:29

um

play30:30

there's other interesting blog posts

play30:33

so you know you might have a question

play30:35

like okay should I be able to code this

play30:37

up

play30:37

I don't think it's necessary right

play30:39

especially if you're just building kind

play30:40

of AI Power Products

play30:42

but it's fun it's not that difficult

play30:45

it's probably worth doing the reason I

play30:47

say it's not that difficult is because

play30:49

of um this beautiful man who uh recorded

play30:52

a bunch of YouTube videos that really

play30:54

walk you through it and the final you

play30:56

know gpt2 re-implementation is like less

play30:58

than 400 lines of code including

play31:01

his own attention block his own MLP

play31:03

block and so on

play31:05

and there's more resources I want to get

play31:07

through some notable large language

play31:08

models

play31:10

so start with three easy pieces

play31:12

there's Bert

play31:14

there's T5

play31:16

and there's GPT

play31:18

and these kind of cover the gamut of

play31:21

large Transformer models Bert was the

play31:24

first one to be to be uh to be uh

play31:27

popularized it stands for bi-directional

play31:30

encoder representation from Transformers

play31:33

so this is actually taking just the

play31:35

encoder part of the Transformer which is

play31:38

the same as what we covered except the

play31:40

attention is not masked

play31:42

um which means that in order to produce

play31:44

the output

play31:45

the Transformer is actually allowed to

play31:46

look at the entire sequence not just the

play31:49

sequence that precedes the the output

play31:53

uh it's you know large for the times but

play31:57

not large by current standards but as a

play31:59

hundred million parameters

play32:01

and what they did is they took some

play32:04

Corpus of text they masked out about 15

play32:06

of all words just randomly

play32:08

and then they trained it on this uh task

play32:12

of like

play32:13

you know predict the masked words

play32:14

correctly

play32:16

and this was great and you know now it's

play32:19

dated but at the time it was very useful

play32:21

and kind of became a building block that

play32:23

you could build other NLP applications

play32:25

on top of

play32:26

as the first step

play32:28

T5 is the first or uh you know T5

play32:33

took that Transformer architecture from

play32:34

the original 2017 paper

play32:37

and applied it to a somewhat new task

play32:39

which is

play32:41

uh the text to text transfer

play32:44

so that means that both the input and

play32:46

the output are text strings and the text

play32:48

string actually encodes the task to be

play32:51

done in in the string so if you look at

play32:53

the bottom here it says like translate

play32:54

Eon to d e so English to German

play32:57

this is good and then the output would

play32:59

be like does this boot or the task might

play33:02

be summarized and then a paragraph to

play33:04

summarize

play33:05

and so on so this innovation of just

play33:09

like encoding the task in the actual

play33:11

input string and then just thinking of

play33:14

everything as just translation

play33:15

essentially but you don't have to be

play33:17

limited to translating languages you can

play33:19

just translate input strings to Output

play33:21

strings in all kinds of ways and they

play33:24

tested a bunch of architectures they

play33:25

founded that the encoder decoder

play33:27

actually was the best for them it was

play33:29

large 11 billion parameters and it's

play33:31

actually still a contender like there's

play33:33

uh you know more updated t5s released

play33:37

and it's a great choice for fine tuning

play33:38

potentially

play33:40

what it was trained on is something

play33:42

called the Colossal Queen crawl Corpus

play33:44

C4

play33:46

they started with common crawl which is

play33:48

a like a non-profit that just crawls the

play33:51

internet makes it available

play33:52

a 10 billion web pages

play33:55

but they filtered it down to like around

play33:57

160 billion tokens which is a line

play34:00

by uh you know discarding short Pages

play34:03

removing offensive words and Pages uh

play34:07

interestingly removing things that had

play34:10

code

play34:11

so if it had any code on the page they

play34:13

would remove the whole page

play34:15

and then they de-duplicated it because

play34:17

they don't want the same data more than

play34:19

once

play34:20

and then they fine-tuned it or like

play34:22

chained it later on some academic

play34:23

supervised tasks

play34:25

for a bunch of NLP tasks

play34:28

GPT is the third easy piece which is the

play34:33

generative pre-trained Transformer and

play34:35

this one is decoder only so bird was

play34:36

encoder only this one's decoder only so

play34:38

it uses a mask tension and uh because

play34:41

it's predicting the next token it's what

play34:43

we covered

play34:44

the largest gpt2 model was 1.5 billion

play34:48

and it was trained on not common crawl

play34:52

because they thought it was just too too

play34:54

noisy so they formed their own data set

play34:56

called Web text where they scraped links

play34:58

from Reddit that had at least three

play35:01

Karma which was like you know probably a

play35:03

useful link

play35:04

and then you duplicated it some

play35:06

heuristic filtering but it was uh eight

play35:09

million documents or so

play35:11

and I want to talk about the encoding so

play35:14

how does the GPT tokenize so this is how

play35:17

it actually does it there's like a

play35:18

tokenizer on the on the on the on the

play35:20

open AI website

play35:22

so one thing you might notice is like

play35:24

some words mapped to one token but some

play35:26

don't

play35:27

and then a Unicode character is

play35:29

representable but it's like a lot of

play35:31

tokens for some reason

play35:33

and the numbers are interestingly

play35:35

tokenized at the bottom where it's like

play35:37

one two three is it's actually its own

play35:38

token and then 45 is its own token

play35:41

so

play35:43

uh this is a middle ground called byte

play35:45

pair encoding it's a middle ground

play35:46

between old school tokenization where

play35:49

you would take each word and tokenize it

play35:51

and throw out words that like weren't

play35:52

frequent enough replace them with a

play35:54

special like out of a Cavalry token

play35:56

and the goal like the gold tokenization

play36:00

would be just to use UTF 8 bytes it just

play36:02

doesn't work empirically uh you know

play36:04

wasn't found to work so the middle

play36:06

ground is you merge some frequently

play36:09

occurring things and you and you set

play36:11

tokens for them but you're able to fall

play36:13

back to like bytes if you need to

play36:17

um gpt3 came out in 2020 and it was just

play36:20

like gpt2 like the architecture is

play36:21

exactly the same but it was 100 times

play36:24

larger

play36:25

and because it was so much larger it

play36:27

started exhibiting these abilities of

play36:31

like few shot learning which was

play36:34

not that surprising but also zero shot

play36:36

learning where you could just describe

play36:37

the task and then it would be able to do

play36:39

a really good job doing it

play36:41

and it seemed like it was just getting

play36:43

better and better the more parameters

play36:45

you added uh and um

play36:48

and it definitely is better the more

play36:50

examples you give it but it's also

play36:51

pretty good with just zero shot examples

play36:54

it was trained on

play36:56

the original webtext Corpus but also

play36:59

just the raw common crawl filter down

play37:02

and also a selection of books from some

play37:06

sketchy sources and then all of

play37:09

Wikipedia

play37:11

and it's interesting to look at what the

play37:13

top pages

play37:14

in the web text and the common crawl

play37:16

data sets are

play37:18

so for the web text it's a bunch of new

play37:19

sites it's like Huffington Post New York

play37:22

Times BBC Twitter the guardian and then

play37:25

for common crawl it's a lot of patents

play37:27

for some reason then a bunch of news but

play37:30

also some papers like science papers

play37:33

okay in total 500 billion tokens they

play37:36

only chained on 300 billion tokens so

play37:38

they actually didn't even see this whole

play37:39

Corpus in training so that's another

play37:42

counter-intuitive thing about this llm

play37:44

training is actually only see each data

play37:47

point once

play37:50

that's not quite true if it's kind of

play37:52

sampled but like

play37:54

the the the mindset is like you're only

play37:57

seeing something once you get one shot

play37:59

to predict on it

play38:00

and for gpt4 we don't really know what's

play38:03

going on because given both the

play38:05

competitive landscape and the safety

play38:07

implications and no further details

play38:09

about the architecture

play38:11

or the data set construction or the

play38:13

Training Method or anything like that

play38:14

but it's safe to assume that it's pretty

play38:16

large because that's the trend like the

play38:19

more computation you use to train these

play38:21

AI systems the better they get

play38:23

and people keep training larger and

play38:25

larger ones

play38:26

and that points to the bitter lesson of

play38:28

Rich Sutton who's a reinforcement

play38:30

learning professor

play38:32

which is basically like no matter how

play38:33

hard you try to come up with like cool

play38:36

math and algorithms and stuff you're

play38:38

going to get beat by someone just

play38:40

stacking more layers

play38:41

and uh

play38:43

and that you know that that is bitter

play38:47

but we can still do some science so like

play38:49

what exactly is the relationship between

play38:51

increasing the model size and the amount

play38:54

of compute and increasing the data set

play38:56

size

play38:56

so scientists at deepmind set out to

play38:58

answer this uh with a paper called

play39:01

training compute optimal LMS which is

play39:05

commonly known as chinchilla because

play39:07

that's the name of the model that they

play39:08

eventually trained but what they did is

play39:10

they came up with formulas for like

play39:13

answering the question if I had a fixed

play39:15

compute budget

play39:17

how should I distribute it should I add

play39:20

more parameters to my model or should I

play39:22

train a smaller model on more data or

play39:25

just go through the data more times

play39:28

and what they found is that like most

play39:30

llms in literature had too many

play39:33

parameters for the amount of data that

play39:34

they saw

play39:35

and so to capitalize on this they

play39:37

trained this chinchilla model which is

play39:39

only 70 billion and they showed that it

play39:41

actually beat the performance of a model

play39:44

four times the size called gopher but

play39:48

what happened is that it was four times

play39:50

smaller but it solved four times more

play39:51

data so it was trained on like 1.4

play39:52

trillion tokens whereas all the other

play39:55

models were only trained on 300 billion

play39:58

why 300 billion because that was the

play40:00

gpt3 paper and everyone else just just

play40:03

started you know they just wanted to

play40:05

replicate gpg3 so they only did 300

play40:07

billion I'm not sure

play40:09

but note that this is still actually not

play40:12

even going through all the data that we

play40:13

have you could keep training it and

play40:16

having the model see the data over and

play40:18

over again

play40:19

it might help I think that's kind of an

play40:21

open question right now

play40:23

so llama came out recently as an open

play40:25

source chinchilla optimal llm from meta

play40:30

research

play40:31

they released several sizes from 7

play40:35

billion to 65 billion and all of them

play40:38

saw at least one trillion tokens

play40:40

and The Benchmark competitively against

play40:42

gpt3 and other state-of-the-art llms

play40:45

it's it is open source but it's

play40:46

non-commercial license for the weights

play40:49

for the pre-trained weights

play40:51

what was it trained on it was trained on

play40:55

you know cut like a custom common crawl

play40:58

filtering C4

play41:01

GitHub

play41:02

Wikipedia some books um some scientific

play41:05

papers and recently it was replicated

play41:07

this data set was replicated by an

play41:09

effort called red pajama which is also

play41:11

training models to replicate llama

play41:14

but what's interesting here is like

play41:15

GitHub why is GitHub in here

play41:18

so why would we include code in the

play41:20

training data because remember like T5

play41:21

paper actually removed code from the

play41:23

training data but now we're adding it

play41:25

again like five percent of the total

play41:27

training data so I think the answer is

play41:29

just empirically people found that when

play41:32

you include code it actually improves

play41:35

performance on non-code tasks I think

play41:38

openly I found this with their codex

play41:39

model which is the first model where

play41:42

they trained it on some code but then

play41:44

they they actually started with gpt3 so

play41:47

they trained gpt3 then they fine-tuned

play41:49

it on code and they saw that it was good

play41:51

on code but it was actually better on

play41:53

like reasoning tasks than gpt3 was

play41:56

and so since then people have been

play41:57

adding code there's an open source data

play41:59

set called the stack and it's all from

play42:01

GitHub basically but they try to respect

play42:03

licenses

play42:04

so check it out if you're interested and

play42:07

then there's another important part to

play42:08

this llm story which is instruction

play42:10

tuning

play42:11

so

play42:13

the when gpg3 was published people were

play42:16

kind of mine their minds were blown just

play42:17

by few shots like the fact that you can

play42:19

produce or provide some examples of what

play42:22

you want and then the model just kind of

play42:23

gets it and starts doing it that is that

play42:26

is cool it's it's also sometimes called

play42:28

in context learning uh but the mindset

play42:31

is really like text completion right

play42:32

it's like you're completing what I've

play42:34

already started

play42:36

but by now and by the time achieve chat

play42:39

GPT release the mindset is that things

play42:41

should be zero shot so it's like I

play42:43

shouldn't have to provide examples I

play42:45

should just say what I want the model to

play42:47

do and then I should figure out how to

play42:48

do it and so that's the instruction

play42:50

following mindset

play42:51

and the way we got from text completion

play42:54

to instruction following is with

play42:56

supervised fine tuning so if we want the

play42:58

model to do a good job on stuff like

play43:00

this like zero shot tasks then we need

play43:02

that in the data set but there's very

play43:05

little text on the internet that's like

play43:06

of this form

play43:08

so what we can do is we can gather our

play43:10

own data set of zero shot inputs and

play43:13

like great outputs

play43:15

and um

play43:17

and then fine tune the pre-train model

play43:19

on the data set and and profit and

play43:22

that's exactly what openai started doing

play43:24

they hired thousands of contractors to

play43:26

gather this data

play43:28

they published the paper about

play43:30

doing that and also doing it even more

play43:32

advancedly with reinforcement learning

play43:36

I don't think we need the details not

play43:39

very important but basically like once

play43:41

you train the model with this

play43:42

reinforcement learning from Human

play43:44

feedback it becomes much better at

play43:46

following instructions than the base GPT

play43:48

model

play43:49

and so they released that as text

play43:51

DaVinci zero two

play43:53

Chad GPT was like further reinforcement

play43:56

learning trained on not even just your

play43:59

shot tasks but like whole conversations

play44:02

and uh introduces the chat ml format

play44:04

where you have like user and assistant

play44:06

messages and Special Assistant message

play44:09

and kind of interesting to think about

play44:12

the GPT lineage so like gbt3 came out of

play44:15

2020 it's called DaVinci

play44:16

and then open AI experimented trained it

play44:20

on code that became the Codex models

play44:22

like the code DaVinci zero zero one

play44:25

and they also experiment with

play44:26

instruction tuning so that became

play44:28

instructive entry beta and text DaVinci

play44:30

001

play44:32

and then they kind of realized that like

play44:34

they really just need to see a lot of

play44:35

code even in the pre-training and so

play44:37

they trained code DaVinci zero zero two

play44:39

this is all conjecture by the way it's

play44:42

not like for sure but um it's both

play44:44

language model abilities and code

play44:46

generation abilities then new

play44:47

instruction tune it and then you can

play44:49

further fine tune it for kind of

play44:53

like standard GPT applications or

play44:55

specifically for chat GPT applications

play44:58

but fine tuning is not free it's really

play45:00

great but it imposes what's called an

play45:02

alignment tax

play45:03

which is that the zero shot ability

play45:06

increases but the few shot learning

play45:09

ability probably decreases and the

play45:13

model's confidence and its answers also

play45:15

becomes less well calibrated

play45:17

um so you can kind of think of it as

play45:18

like

play45:19

the base model

play45:22

before fine-tuning

play45:24

is

play45:25

kind of knows what it knows and it'll

play45:28

complete Touch for you in like the way

play45:30

that it knows how to do them but then

play45:31

you teach it to complete text in

play45:33

different ways and because you're like

play45:35

teaching it this different thing it kind

play45:37

of gets confused about what it actually

play45:39

knows

play45:41

uh interestingly it's possible to steal

play45:43

this like fine-tuning that's what

play45:47

United Steel but the Llama model that we

play45:51

saw was quickly fine-tuned by Stanford

play45:53

team on

play45:55

um some fine-tuned some some instruction

play45:59

instructions but the instructions they

play46:02

actually didn't pay contractors to get

play46:03

they just ask gpt3 certain they gave

play46:07

gpt3 instructions and then gbt3 would

play46:09

like do it and then they would take that

play46:11

as an example for a llama so it's only

play46:14

cost them 600 to reproduce uh you know

play46:19

I mean it's not as good as gpt3

play46:21

instruction following but it's it's

play46:22

pretty good

play46:24

there is a data set for instructions

play46:26

tuning uh specifically in the chat in

play46:30

the chat

play46:31

um you know Paradigm called open

play46:33

assistant

play46:34

and there's one last idea I want to

play46:35

share which is retrieval enhancing

play46:38

so this is a model called retro from

play46:40

deepmind and basically it's like we have

play46:42

these large models because they have to

play46:44

learn a lot of facts uh about the world

play46:47

and also they have to be good at

play46:49

reasoning and like writing code and

play46:50

stuff like that

play46:51

but can we train a smaller model that's

play46:53

like only good at reasoning and writing

play46:56

code but then if it needs to like say

play46:58

facts about the world it just kind of

play47:00

looks them up from some database

play47:02

so what they did is they like burden

play47:04

coded a bunch of senses and stored them

play47:06

in like a trillion token database

play47:09

and then had a small model uh train

play47:11

where it would be able to fetch things

play47:13

from this database and kind of have them

play47:16

in the context

play47:18

they haven't been able to get it to work

play47:20

as well as just like large language

play47:21

models but I think that's a matter of

play47:24

time like I think the this approach kind

play47:26

of points to the future of LOL

play47:30

foreign

play47:33

[Music]

Rate This
โ˜…
โ˜…
โ˜…
โ˜…
โ˜…

5.0 / 5 (0 votes)

Related Tags
MachineLearningTransformersGPTModelsNeuralNetworksAITrainingSoftwareEvolutionTechInnovationDataScienceAIResearchModelOptimization