LLM Foundations (LLM Bootcamp)
Summary
TLDRThe transcript discusses the Transformer architecture and its significance in machine learning, highlighting its use in models like GPT, T5, and BERT. It explains the concept of attention mechanisms, the role of pre-training and fine-tuning, and the evolution of large language models (LLMs). The talk also touches on the challenges of machine learning, such as the complexity of natural language processing and the importance of data sets in training. The presenter provides insights into the future of AI, emphasizing the potential of models that combine reasoning with information retrieval.
Takeaways
- 🌟 The talk introduces the Transformer architecture and its significance in the field of machine learning, highlighting its adaptability across various tasks.
- 🤖 The distinction between traditional programming (Software 1.0) and the machine learning mindset (Software 2.0) is explained, emphasizing the shift from algorithmic to data-driven approaches.
- 📈 The三种主要的机器学习类型被概述:无监督学习、监督学习和强化学习,各自适用于不同的数据结构和目标。
- 🧠 The inspiration behind neural networks and deep learning is drawn from the brain's structure and function, with the perceptron model being a key building block.
- 🔢 Computers process inputs and outputs as numerical vectors or matrices, requiring text to be tokenized and converted into numerical representations.
- 🏋️♂️ The training process of neural networks involves backpropagation, where the loss function guides the adjustment of weights to improve predictions.
- 🔄 The importance of splitting data into training, validation, and test sets is emphasized for model evaluation and to prevent overfitting.
- 📚 The concept of pre-training and fine-tuning is introduced, where a large model is trained on general data and then further trained on specific tasks.
- 🌐 The rise of model hubs like Hugging Face demonstrates the growing accessibility and sharing of pre-trained models and datasets.
- 🔄 The Transformer model's architecture, including its encoder and decoder components, is explained, along with the concept of attention mechanisms.
- 🚀 The continuous growth and development of language models like GPT-3 and its successors are highlighted, showcasing the trend towards larger models with more parameters.
Q & A
What are the four key topics discussed in the transcript?
-The four key topics discussed are the Transformer architecture, notable large language models (LLMs) such as GPT, details of running a Transformer, and the foundations of machine learning.
What is the difference between software 1.0 and software 2.0 in the context of programming?
-Software 1.0 refers to traditional programming where a person writes code for a robot to take input and produce output based on algorithmic rules. Software 2.0, on the other hand, involves writing a robot that uses training data to produce another robot, which then takes input data and produces output. The second robot is not algorithmic but is driven by parameters learned from the training data.
What are the three main types of machine learning mentioned in the transcript?
-The three main types of machine learning are unsupervised learning, supervised learning, and reinforcement learning.
How do neural networks and deep learning relate to the brain?
-Neural networks and deep learning are inspired by the brain. They attempt to mimic the way neurons in the brain receive inputs, process them, and produce outputs. The concept of a perceptron, which is a model of a neuron, is used to create neural networks by stacking perceptrons in layers.
What is the significance of GPUs in the advancement of deep learning?
-GPUs, which were originally developed for graphics and video games, are highly efficient at performing matrix multiplications. Since neural networks and deep learning involve大量的 matrix multiplications, the application of GPUs has significantly accelerated the training and development of these models.
What is the role of the validation set in machine learning?
-The validation set is used to prevent overfitting in machine learning models. It allows developers to evaluate the model's performance on unseen data during training and adjust the model as needed, ensuring that the model generalizes well to new data.
How does the Transformer architecture handle the issue of input and output sequence lengths?
-The Transformer architecture uses a mechanism called attention, which allows the model to weigh the importance of different parts of the input sequence when predicting the next token in the output sequence. This mechanism enables the model to handle variable lengths of input and output sequences effectively.
What is the purpose of positional encoding in the Transformer model?
-Positional encoding is added to the input embeddings to provide the model with information about the order of the tokens within the sequence. Since the Transformer architecture does not inherently consider the order of tokens, positional encoding helps the model understand the sequence's structure and the relative positions of the tokens.
What is the concept of pre-training and fine-tuning in machine learning models?
-Pre-training involves training a large model on a vast amount of data to learn general representations. Fine-tuning then involves further training this pre-trained model on a smaller, more specific dataset to adapt the model for a particular task or domain.
How does the concept of instruction tuning enhance the capabilities of large language models?
-Instruction tuning involves fine-tuning a pre-trained model on a dataset of instructions and desired outputs. This process improves the model's ability to follow instructions and perform tasks in a zero-shot or few-shot context, without the need for additional examples or prompts.
What is the significance of including code in the training data for large language models?
-Including code in the training data has been found to improve the performance of large language models on non-code tasks. It enhances the model's understanding of logical structures and problem-solving, which can be beneficial for a wide range of applications.
Outlines
🌟 Introduction to Machine Learning and Transformers
The speaker begins by outlining the diverse audience and the topics to be covered, including the Transformer architecture and notable large language models (LLMs) like GPT. The foundational concepts of machine learning are briefly reviewed, highlighting the shift from software 1.0 to software 2.0, which involves training a model with data to produce outputs. The different types of machine learning are introduced: unsupervised, supervised, and reinforcement learning. The complexity of machine learning is discussed, emphasizing the challenge of interpreting varied inputs and the importance of understanding the structure of data. The limitations of traditional methods like logistic regression and support vector machines are noted, leading into a discussion on neural networks and deep learning as the dominant approach, inspired by the human brain's neural structure.
🧠 Deep Dive into Neural Networks and Training Methods
The paragraph delves into the concept of neural networks, drawing parallels with the human brain's neurons. The perceptron model and its evolution into the multi-layer perceptron are explained. The importance of weights in neural networks and the role of GPUs in accelerating matrix multiplications for deep learning are highlighted. The training process of neural networks is described, including the use of mini-batches, loss functions, backpropagation, and the division of data into training, validation, and test sets. The speaker also touches on pre-training and fine-tuning strategies, emphasizing the value of sharing pre-trained models and the rapid growth of model hubs like Hugging Face.
🚀 Understanding the Transformer Architecture
The Transformer architecture is introduced as a revolutionary approach to machine learning tasks. Its origins from the 'Attention is All You Need' paper and its application beyond translation tasks are discussed. The speaker simplifies the complex structure of the Transformer by focusing on the decoder part and explaining the process of text completion. The concept of tokenization and the conversion of text into numerical vectors are introduced, setting the stage for understanding how Transformers handle inputs and produce outputs.
🧬 The Mechanics of Attention Mechanism
The attention mechanism within the Transformer model is explored, highlighting its ability to weigh the importance of different tokens for predicting the next token. The concept of query, key, and value vectors is introduced, along with the mathematical formulation of attention as a weighted sum of inputs. The speaker explains the rationale behind attention, emphasizing its efficiency in focusing on relevant tokens. The paragraph also introduces the idea of multi-head attention, allowing the model to learn multiple ways of transforming inputs simultaneously, and the concept of masking attention to limit the model's view to previously seen inputs during training.
🌐 Positional Encoding and the Transformer's Parallel Processing
The importance of positional encoding in Transformers is discussed, addressing the lack of inherent order in the attention mechanism. The speaker explains how positional encoding vectors are added to embeddings to provide a sense of sequence. The concept of skip or residual connections is introduced, explaining how they facilitate the backward propagation of loss through the model. The role of layer normalization in maintaining uniformity during training is also covered. The paragraph concludes with an overview of the Transformer's feed-forward layer and the repetition of the entire process across multiple layers.
📈 Scaling Up: Parameters, Layers, and Attention Heads
The speaker discusses the scaling of Transformer models, focusing on the number of layers, embedding dimensions, and attention heads. The GPT-3 model is highlighted for its massive parameter count and layer structure. The distribution of parameters across different operations within the model is considered, and the efficiency of parallel processing in Transformers is praised. The speaker also mentions the work of Anthropic in trying to understand the inner workings of Transformers and the potential of these models as general-purpose differentiable computers.
📚 Notable Large Language Models and Their Impact
The speaker reviews notable large language models, starting with BERT, which introduced the concept of bidirectional encoding. The T5 model's text-to-text transfer approach and its training on a diverse corpus are discussed. GPT models are introduced as generative pre-trained Transformers, with a focus on GPT-2 and its training data. The speaker also touches on the encoding process used by GPT models and the release of GPT-3, which demonstrated abilities like few-shot and zero-shot learning. The training data for GPT-3 is described, including its sources and the unique aspects of its dataset.
🔍 Exploring the Chinchilla Model and Instruction Tuning
The speaker introduces the Chinchilla model, developed by DeepMind, which optimizes the distribution of compute resources between model size and data quantity. The model's performance is compared to larger models, demonstrating the effectiveness of training on more data. The open-source LLaMA model from Meta AI is mentioned, along with its training data. The inclusion of code in training data is discussed, along with the benefits it provides. The concept of instruction tuning is explored, explaining the shift from text completion to instruction following. The process of fine-tuning pre-trained models with human feedback is described, and the impact of fine-tuning on model capabilities is discussed.
🛠️ Enhancing Models with Retrieval and Future Directions
The speaker discusses the RETRO model from DeepMind, which combines a smaller model with a large database of knowledge for fact retrieval. The potential of this approach to enhance reasoning and coding abilities is highlighted, although the current limitations are acknowledged. The speaker reflects on the future of LLMs and the potential for models that can both learn from data and retrieve information as needed.
Mindmap
Keywords
💡Machine Learning
💡Transformer Architecture
💡GPT (Generative Pre-trained Transformer)
💡BERT (Bidirectional Encoder Representations from Transformers)
💡T5 (Text-to-Text Transfer Transformer)
💡Attention Mechanism
💡Embedding
💡Fine-Tuning
💡Zero-Shot Learning
💡Instruction Tuning
💡Retrieval Enhancing
Highlights
Introduction to machine learning and the Transformer architecture, including its applications and variations.
Explanation of the differences between software 1.0 and software 2.0, highlighting the shift from traditional programming to machine learning mindset.
Overview of various types of machine learning: unsupervised learning, supervised learning, and reinforcement learning.
Discussion on the challenges of machine learning, such as the infinite variety of inputs and the complexity of the natural world.
Description of neural networks and deep learning, drawing inspiration from the brain's structure and function.
Explanation of how neural networks are stored and processed as vectors and matrices of numbers.
The role of GPUs in accelerating matrix multiplications, which is crucial for neural network computations.
Training neural networks using batch data, loss functions, and backpropagation to adjust parameters.
The importance of splitting data into training, validation, and test sets to avoid overfitting and ensure model robustness.
Pre-training and fine-tuning strategies for machine learning models, leveraging large datasets and model hubs like Hugging Face.
The Transformer architecture's dominance in machine learning tasks, originating from the 'Attention is All You Need' paper.
Detailed breakdown of the Transformer decoder, including tokenization, embedding, and the attention mechanism.
The concept of multi-head attention and its ability to learn multiple ways of transforming inputs simultaneously.
The use of positional encoding to incorporate the order of tokens within the Transformer model.
The addition of feed-forward layers and layer normalization in the Transformer architecture for further representation enhancement.
Scaling up the model size, embedding dimension, and number of attention heads in GPT-3 to achieve better performance.
Overview of notable large language models like BERT, T5, and GPT, and their contributions to NLP and AI.
Instruction tuning and its impact on improving model performance for zero-shot and few-shot learning tasks.
The potential future of AI with retrieval-enhancing models like RETRO, combining smaller models with external knowledge databases.
Transcripts
[Music]
all right I'm gonna get started I'm
going to talk about
four things
lunch is in an hour so
um
I'm gonna speed run some key ideas and
just machine learning we have a diverse
audience we have machine learning
experts and research scientists we also
have
Executives and investors that have never
um you know trained a logistic
regression or anything like that
we're going to talk about the
Transformer architecture
we're going to talk about notable llms
you might have heard of
you know a little thing called GPT
there's also other ones like T5 Birch
and chill and so on that you should
probably know
and we'll talk about some just details
of running a Transformer
so the foundations of machine learning
this I expect this to be
not needed for most of you
I still think it's worth sharing because
for some of you it is needed and also
just to get on the same page about what
what's happening
you know software 1.0 in Andre krapati's
terminology
is traditional programming right where a
person writes basically a robot that
then takes input and produces output and
the robot is entirely algorithmic so you
you know the the person that has to
specify all possible edge cases for the
input
and have robust tests with output and so
on
with software 2.0 mindset which is the
machine learning mindset
the person
writes a robot that then takes a bunch
of training data and produces another
robot that that's going to take input
data and produce an output
and you can't really test that second
one because it's not that second robot
is not algorithmic it's it's now driven
by a bunch of parameters that you don't
really have much visibility into you
only have visibility into your training
system so that really changes the
mindset of what what's actually
happening what type of machine learnings
are there
there's unsupervised learning it's like
generative AI used to find structure in
the data generate more data
their supervised learning which you get
some data as input then you produce
something that looks a little different
as as output usually a label for that
input data
or maybe a prediction about what's going
to come next
and then you have reinforcement learning
which you have agents that act in an
environment they collect rewards learn
to act and these have traditionally been
pretty separate but they've mostly
converged on just
really is just supervised learning
sometimes called self-supervised
learning where you can formulate
everything it's just a supervised
problem so if you're doing generative
problems you can formulate it as this
first bit of data
is labeled with the continuation of the
data so that's kind of supervised
formulation
and then reinforcement learning you can
formulate as given the state of the
world what is the next move that that
would collect the most rewards you don't
have to do anything special you can just
treat it as supervised learning
to a computer inputs and outputs are
always just numbers
and that's kind of important to remember
because we might see something like this
I mean it really isn't a photograph but
we definitely see it as a photograph of
of Abe Lincoln
and then we read output and it's really
just a bunch of letters but we have
meaning to the word Lincoln a machine
doesn't have any of that right they just
see a bunch of numbers and then they see
some other numbers that are token
vocabulary ads
so everything is just a vector or Matrix
of numbers to a machine learning
computer
and we ask it to predict things about
the natural world why is that hard well
there's an infinite variety of inputs
that all can mean the same thing so
let's say people are talking about a
movie they watched someone might say you
know I love the movie that's pretty easy
to interpret they use the word love and
movie but then someone else might say
you know as good as The Godfather
now you need to know like is the
Godfather another movie would this
person consider it to be a good movie
and so on or you know someone might just
say something unintelligible to you
um but it might mean the same thing and
the computer has to learn all of that
and uh meaningful differences can be
very tiny so like you know these muffins
really do look like Chihuahuas but
very different in function
and the structure of the world is
complex even if you have a face
recognizer
and you trained it on well lit faces but
now you're in this dramatic lighting
setting and you only see half the face
it might totally fail
and it's it's because there's physical
structure to to how the image is formed
that um
just makes everything way more complex
than it seems like it should be
but how is it done so there's many
methods for machine learning through the
gears the simplest one is called
logistic regression probably learned
that in some college course uh in the
90s people you support Vector machines
there's decision trees that people do
for tabular data a lot like the XG boost
Library
but one approach to machine learning
really is dominant nowadays and that's
neural networks and it's also called
Deep learning
um
and the inspiration for that comes from
the the brain which is like the one
thing that we know to be really
intelligent in the world
people started working on this in the
40s and the 50s they took a look at what
and the brain is doing it's composed of
a bunch of neurons
um and neuron receives electrical inputs
on one end and if there's enough input
to the neuron then it fires and sends
its output down to other neurons
and the Brain itself has inputs like
through my vision to my senses and
outputs like me speaking or moving
and the formalization that people came
up with is that they decided to
represent the neuron as this model
called the perceptron
which has inputs coming in which is just
numbers and the numbers get multiplied
by some weights if the sum of the
weighted inputs is passed some kind of
threshold then the neuron you know the
perceptron fires in a sense that's
modeled with a step function
mathematically
and then to create a brain you take a
bunch of perceptrons
and you stock them in layers and you
cannot you know every every perception
on one layer to every perception on the
next layer and that's called the
multi-layer perceptron and it's kind of
like a brain
um now how is it stored to them in a
machine it's just a vector of numbers
right so like what's important about the
perceptron are the weights and that's
just what are the things you're
multiplying inputs by
and then a layer of perceptrons is then
a matrix of numbers
and the whole neural network is a set of
matrices of numbers
and we call these parameters so
parameters of the neural network are
just all the perceptron weights inside
of that Network
and or sometimes you call them weights
as well
and all the neural network operations
are just Matrix multiplications for this
reason
and one thing people figured out is that
gpus which were developed for graphics
like video games are really fast at
Matrix multiplications neural networks
are just doing Matrix multiplications so
that kicked off the deep learning
Revolution when people applied gpus to
actually running neural networks
how do you train in neural network so
let's say you have a big X data so maybe
it's images maybe it's text and you have
labels why so like labels like cat and
not a dog
so you take a little batch
sometimes called a mini batch of data
Little X
you use your current model so you use
your current neural network which starts
out with just random weights to run the
X kind of through the network and out
comes a prediction so let's call that y
Prime we'll compute what's known as a
loss function on the ground truth label
Y and the prediction y Prime
the most uh you know prevalent loss is
the cross entropy loss which sounds
complicated but it really is just this
function you just multiply the ground
Truth by the log of the predictions and
you sum it up
and then that gives you
um some numbers that you then what's
known as bat propagate through all the
layers of the model this is too
complicated to go into
but basically you just think of the
network pushing a prediction if the
prediction is correct it gets signal
that the parameters are great if the
prediction is not correct it gets
signaled that the parameters need to be
adjusted in the direction of making the
predictions more correct
and then you repeat that until your loss
stops decreasing essentially
and in order to actually do machine
learning you always split your data into
a training set a validation set and a
test set the training set is the largest
why do you need the validation set
because you might overfit so if you
train too much the parameter might you
know the network might be really good on
gesture training data but actually it
becomes
worse than it used to be on the
validation data so we kind of look at
the validation loss and when that stops
improving that's when we stop training
or maybe we set the some frame some
hyper parameters about the model such as
how many layers does it have how many uh
or like what activation function does it
have
and the test set should really be left
alone as much as possible and it's
really for measuring
how your train model is going to work in
production so it's like you shouldn't be
looking at the test set you should only
look at it basically once when you're
done
and all of this applies to your
experimentation with prompts too so if
you're not doing traditional machine
learning it's not like you have to
forget about the validation side uh you
really should be having this mindset
even if you're not doing like if you
don't have a loss function but you're
just kind of looking at some prompts
and you're trying to figure out which
one's better there's still that notion
of a validation set and a potentially a
test set
some some more terminology uh
pre-training you might hear so that just
basically means training like training a
large model on a lot of data
and the reason it's called pre-training
is because oftentimes you would take
that large model
and then train it a little bit more with
less data and that's called fine tuning
and the reason you do that is because
maybe you have a lot of label data in
just general internet imagery Like
Liquor style images but you don't have a
lot of data in um you know Medical
Imaging x-rays or something but you
might train a model on just flicker
images and then fine tune it on your
medical images and it'll work better
than if you only trained on the medical
images
people share pre-trained models
thankfully there's a number of model
hubs hugging face is the most popular
it has 180 000 models and last time I
gave this lecture if they only had like
I think 90 000 models
and they have 30 000 data sets and last
time was like you know a year or two ago
and they only had like five thousand
data sets so growing very rapidly and
they have models for anything you might
want to do in machine learning
and before
um around you know 2020
like that might mean that each type of
model in the model Hub is its own neural
network architecture like people would
use convolutional neural networks for
computer vision they would use recurrent
neural networks for natural language
processing they would have special
thanks for reinforcement learning and so
on
but
nowadays uh basically Transformer model
is is all that's used for all kinds of
machine learning tasks
so the Transformer architecture came out
of a paper called attention is all you
need from 2017 and attention is all you
need today you don't really need the
Wi-Fi you know you should pay attention
um
and that they formulate an architecture
that set state-of-the-art results on
translation tasks that's kind of all
they applied to but then other people
quickly started applying the same
architecture to like other NLP tasks be
state-of-the-art on those and then
vision and so on
but it looks pretty complicated when you
like see the whole diagram but it's
actually just like two of the same thing
like there's two halves to it that are
basically the same so we're just going
to look at one half of it
um that's called the decoder
so the overview of the Transformer
decoder
is
in this x position let's say the task is
to complete text just like gpg
um GPT models are doing
so if you see text like the ground truth
text is like it's a blue sundress
for whatever reason that's like text
that the model is being trained on right
now so you would see it's a blue and the
task is to predict the word sundress
the inputs down here on the bottom
it's not going to be text it's going to
be a sequence of tokens so like it's a
blue
and the output is going to be a
probability distribution over the
potential next token
so the input is a sequence of vectors
the output is a is a vector that's a
probability distribution
and to run inference which means like to
get results out of this network
what we're going to do is we're going to
take the probability distribution sample
an actual token from it then append it
to the inputs so let's say we sampled
the word you know it's a blue but we
sampled the word house so now we have
the input it's a blue house
and then we're going to run that through
the model again see the probability
distribution over the next token sample
it append it and so on
and that's how that's how chat GPT is
doing what it's doing that's it's seeing
what you typed then it's sample to the
next word samples appends it samples and
that's word and so on
so in more detail the inputs need to be
vectors of numbers
and so we have text how do we turn text
into vectors of numbers
so first we turn it into tokens
this is the actual tokenization that GPT
free is doing
so there's like a starter sequence token
it apostrophe s a blue Sun rest and so
on
so each one of those tokens we'll talk
about in a second or we'll talk about
how this tokenization was found a little
later but for now just like this is what
it is and each one is actually an ID
right in a vocabulary it's not award
it's it's just a number
and furthermore it's not actually just a
number it's actually a vectoring and you
can represent a number as a vector with
this thing called one hot encoding
so like the number three you can
represent by an all zero Vector that has
one in the third position and zeros
everywhere else
and that could be the input to our
Network that you know we we could just
go with this
but we're going to do something
different a little bit different which
is called embedding
so the reason we're doing this is
because one hot vectors are bad
representations of words or tokens
so like the word cat is going to have
vocabulary ID you know 30 the word
kitten is going to have vocabulary you
know 32 or something but the distance
between them is as large as the distance
between you know the word cat and any
other word in the vocabulary so there's
no notion of similarity of any of any
token
and there's a simple solution to us
which is we can learn an embedding
Matrix which takes uh your one-hod
vocabulary encoding and embeds it into
something that is a dense factor of your
choice or like of the dimensionalities
of your choice
so let's say if your vocabulary size is
like 30 000 you can turn that into an
embedding size of like 512 and all you
have to do is just learn a matrix that's
size 30 000 by 512 and this is like the
simplest neural network layer type
that's kind of all you need to
understand is like we're turning words
into dance uh embeddings
we're going to send those embeddings
into the model I'm going to skip
positional encoding for now and go into
this
mask multi-hat attention but we're going
to ignore the words masked and
multi-head for now we're just going to
talk about attention
so the key Insight of attention is as
tokens come into the
um the the the model remember the task
is to predict the most likely next token
we're seeing some previous tokens
but they're not all equally important to
what the next token should be right
there's like
there's some things that just very
closely follow previous tokens and
there's some things at the beginning of
the sentence that don't even matter to
like what you're going to predict next
so this notion of attention was
introduced in uh 2015 for translation
tasks and in Translation let's say We're
translating English to French
and uh in the English you know it's the
agreement on the European economic area
was signed
uh in August 1992. so the word sign like
to to predict the word sign what do you
actually need to know about the previous
sequence which is in French
you don't really care what was sign
right but what you do care is like how
do you say signs in French because We're
translating French to English
so in French there's um
which just like it you know it's like
the past tense of signed uh and in
English it's just was signed but the
word science itself already is past
tense so you don't actually need even
the word was so you can kind of see why
it was useful for translation but the
idea is very general
and the um formalization is like let's
say you have a sequence of vectors X
and you have an output sequence of
actors and each output Vector is going
to be a weighted sum of the inputs and
the weights
are going to be just the dot products
between the input vectors there's no
learning at all right now we're just
saying we have to produce some outputs
all we have are inputs
the each output is going to be a sum of
the inputs but we're going to weight the
sum by basically dot product which is
kind of like similarity between the
input vectors
and to make it nice we're just going to
make the the weight sum to one but it's
not important
so you know looking graphically this is
a figure from Lucas Bears Transformers
lecture
uh
you know we have input factors and we're
producing output let's say y sub I so
what we're going to do is we're going to
take the factor x sub I that you know I
part of the input and kind of like
dot product it with all the other inputs
and then the the value of the dot
product is going to be our attention
weight
so now we have an attention weight a
little vector and then we're going to
apply that attention weight again to the
inputs to this time sum them up and
produce the output
that's kind of all that's happening
here's another view of this this is from
Peter plum
so we're producing I'll put y sub 2. and
what we're going to do is
sum the weighted inputs so x sub 1 x of
2x of 3x of 4.
and the weight for each one is going to
be as as described and what we can
notice is that every input
is used in three different ways so it's
used as a query so for like y sub 2 the
vector x sub 2 is used as the query and
it gets compared to the Keys which are
all the other input vectors
and then that produces the weight and
then the weight is multiplied by the
values and then summed up to produce the
output so each input Vector plays three
different roles in you know in the
course of this attention mechanism as a
query as a key to some other query and
then as a value to be summed up to the
output
and that's fine and dandy but like why
do we do this and also there's like no
like it might help but it might not help
there's no learning involved so far
so what we're going to do is we're going
to project the inputs into different
roles project means you take a vector
you multiply by a matrix now you have a
different Vector that's like you can
think of it as like being rotated or
stretched or both in some space
and so we're going to do is we take the
input projected one way to be the query
another way to be the key and a third
way to be the value
and uh graphically you know you have
your inputs you might actually even
change the the dimension of them right
so it's like you might have
four dimensional vectors coming in but
the projection makes them eight
dimensional in practice this isn't
really done for uh like GPT style models
but it could be done
and the key thing here is like now we
have three matrices that we can learn
and once we learn them we've basically
learned a good way to do attention
what does it mean to be multi-head
attention
so we can learn
simultaneously
several different ways of transforming
inputs into queries keys and values
so here's like three headed detention
and we're showing the query Matrix so
there's like three different ones that
we can do simultaneously and
when we actually implement it in the
math it's just a single Matrix anyway so
it's a
it seems more scary than it is
uh and then the last thing is masked why
are we masking attention
so I talked about inference but in
training what we have is a sequence of
tokens like it's a blue and then it's
kind of blanked out
and then we have the ground truth
outputs which is like we know it's
supposed to be at the blue sundress so
we actually start a blue sundress and
blanked out
and
um that's our ground truth outputs the
actual outputs of the model are
probability distributions over potential
tokens
so
crucially to the thing to understand is
all of the probability all of the
outputs are computed at the same time so
it's like I put in the sequence and I
produce the potential outputs for every
subsequence at the same time
so like if I am predicting the word a I
should only see the word it's if I'm
predicting the word blue I should see
it's a blue and then if I'm predicting
where it's undress I should see it's a
blue sundress
and so that means that when I'm
predicting the word Sandra or when I'm
predicting the word blue I shouldn't see
future things I should only see the
things that have already happened
in the input so instead of the full self
attention we have this mask
self-attention which is limited to just
the part of the input that's already
been seen it's implemented by
multiplying the attention weight matrix
by just the mass Matrix
and
you know conceptually what's what's this
doing
so like a token comes in
um it gets augmented in some way with
like previously seen tokens that seem
relevant
so the previously seen means math is
like the mass part we seem relevant
that's the Learned attention part
and then we do this in several ways
simultaneously that's the multiple head
part
and the thing that's kind of
counterintuitive is actually there's no
notion of position so far there's a
notion of what you've seen what you
haven't seen but inside of what you have
seen
there's no ordering
and so that's where the positional
encoding comes in so if you look at
these uh you know equations there's no
order anywhere it's like you just have a
bag of bacteries and you're producing
and you're just summing them up
but if you see like something like this
movie is great it's exactly the same as
any other permutation of that
so a trick to fix that is like
we're gonna add special position
encoding vectors to our embedding
vectors
and it seems like it shouldn't work but
like it really is that simple there's
some complication as to how you
how do you like formulate these position
coding factors but you can do it like
very
you don't have to do anything very
complicated you could just like have a
incrementing vector that you just add
and the magic of attention figures out
that it should pay attention to the
position if it's relevant
um then when stuff comes out of the
attention we're gonna add it up and Norm
it so the adding part is like you see
all those arrows that go around the
attention block
so that is often called the skip
connection or a residual connection and
basically we want to like the output we
want to not only go through the the
module like the attention module but we
also just want to add a little bit of
the original input
and the reason we do this is because
when we backdrop we're going to go
through all of the arrows backwards and
the fact that we can go around a layer
is quite nice because we can propagate
the loss all the way from the end of the
model back to the first layer of the
model
and by the way this is possible because
we're not changing the dimension of the
output it's always it's all the same
shape so it's like the input embedding
determines the dimension of this whole
Transformer model
and then the layer Norm is like
basically the motivation is neural Nets
learn the best when everything is
uniform has uniform mean and standard
deviation
but as you actually apply these matrices
to to your inputs the means and standard
deviations get blown out
and solar normalization is like I you
know it's a hack where you basically
take things and you just reset them back
to a uniform mean and standard deviation
and you do that between every operation
it seems inelegant which is why I think
it took people a while to start doing it
but once you start doing it it's very
effective
then the feed forward layer is like that
standard multi-layer perceptron that I
showed you in the beginning with just
one hidden layer
and the conceptual view is like the
token that's been augmented with other
relevant tokens
comes into the speed forward layer and
it like upgrades its representation so
that's like the best intuition I have
about it like
you know if you start out at Ward level
then okay we're going to mix with other
words we've seen now we're going to go
into the feed forward layer and like
upgrade to something more like thoughts
or something like more semantic meaning
than the nominal meaning of the words
and then this whole thing gets repeated
a number of times
[Music]
like in the gpt3 model for example it
ranges from 12 layers to 96 layers of
this Transformer layer there's also the
embedding Dimension that you can change
and then there's the number of attention
heads in practice I think people scale
I'm sorry people scale
all these hyper parameters together so
it's like if you're increasing the
number layers you're also going to
increase the dimension of the number of
attention heads
for gpt3 being famously 175 billion
parameters 96 layers 12 000
embedding Dimension 96 attention heads
and another thing to to think about is
like
those 175 billion parameters how do how
are they distributed between the types
of of operations
and if it's that large it's mostly the
feed forward layer that takes up the
weights but for a small Network like the
gpt3 small a large part of the weights
is also the embedding and the attention
itself
so why does this work so well so Andre
has a great tweet that says the
Transformers magnificent neural network
architecture
because it's a general purpose
differentiable computer it is expressive
in the forward pass it's optimizable via
backdrop and it's efficient because
everything is happening in parallel
there's some line of you know some lines
of work try to figure out exactly how
expressive the Transformer is a cool
result is this rasp rasp paper which is
basically a programming language that
should be implementable inside of a
transformer
so they see like the the example on the
on the right here
is like a two layer Transformer Network
that reverses strings
and they wrote it as this programming
language but it can actually compile
down to like Transformer weights that'll
execute that every time and there's the
inverse problems like well given the
weights can we decompile it to a program
and the answer is no we don't know how
to do that yet
and we actually mostly just don't
understand what the Transformer is doing
some people are trying most notably
anthropic and you should check out their
grade blog posts if if you're interested
so um like induction heads is an
interesting result
where like one thing they observed is as
you add multiple layers of attention or
sorry multiple heads of attention
you can notice this thing where like you
know like the the the model basically
figures out how to use the second head
and uh
and
um
there's other interesting blog posts
so you know you might have a question
like okay should I be able to code this
up
I don't think it's necessary right
especially if you're just building kind
of AI Power Products
but it's fun it's not that difficult
it's probably worth doing the reason I
say it's not that difficult is because
of um this beautiful man who uh recorded
a bunch of YouTube videos that really
walk you through it and the final you
know gpt2 re-implementation is like less
than 400 lines of code including
his own attention block his own MLP
block and so on
and there's more resources I want to get
through some notable large language
models
so start with three easy pieces
there's Bert
there's T5
and there's GPT
and these kind of cover the gamut of
large Transformer models Bert was the
first one to be to be uh to be uh
popularized it stands for bi-directional
encoder representation from Transformers
so this is actually taking just the
encoder part of the Transformer which is
the same as what we covered except the
attention is not masked
um which means that in order to produce
the output
the Transformer is actually allowed to
look at the entire sequence not just the
sequence that precedes the the output
uh it's you know large for the times but
not large by current standards but as a
hundred million parameters
and what they did is they took some
Corpus of text they masked out about 15
of all words just randomly
and then they trained it on this uh task
of like
you know predict the masked words
correctly
and this was great and you know now it's
dated but at the time it was very useful
and kind of became a building block that
you could build other NLP applications
on top of
as the first step
T5 is the first or uh you know T5
took that Transformer architecture from
the original 2017 paper
and applied it to a somewhat new task
which is
uh the text to text transfer
so that means that both the input and
the output are text strings and the text
string actually encodes the task to be
done in in the string so if you look at
the bottom here it says like translate
Eon to d e so English to German
this is good and then the output would
be like does this boot or the task might
be summarized and then a paragraph to
summarize
and so on so this innovation of just
like encoding the task in the actual
input string and then just thinking of
everything as just translation
essentially but you don't have to be
limited to translating languages you can
just translate input strings to Output
strings in all kinds of ways and they
tested a bunch of architectures they
founded that the encoder decoder
actually was the best for them it was
large 11 billion parameters and it's
actually still a contender like there's
uh you know more updated t5s released
and it's a great choice for fine tuning
potentially
what it was trained on is something
called the Colossal Queen crawl Corpus
C4
they started with common crawl which is
a like a non-profit that just crawls the
internet makes it available
a 10 billion web pages
but they filtered it down to like around
160 billion tokens which is a line
by uh you know discarding short Pages
removing offensive words and Pages uh
interestingly removing things that had
code
so if it had any code on the page they
would remove the whole page
and then they de-duplicated it because
they don't want the same data more than
once
and then they fine-tuned it or like
chained it later on some academic
supervised tasks
for a bunch of NLP tasks
GPT is the third easy piece which is the
generative pre-trained Transformer and
this one is decoder only so bird was
encoder only this one's decoder only so
it uses a mask tension and uh because
it's predicting the next token it's what
we covered
the largest gpt2 model was 1.5 billion
and it was trained on not common crawl
because they thought it was just too too
noisy so they formed their own data set
called Web text where they scraped links
from Reddit that had at least three
Karma which was like you know probably a
useful link
and then you duplicated it some
heuristic filtering but it was uh eight
million documents or so
and I want to talk about the encoding so
how does the GPT tokenize so this is how
it actually does it there's like a
tokenizer on the on the on the on the
open AI website
so one thing you might notice is like
some words mapped to one token but some
don't
and then a Unicode character is
representable but it's like a lot of
tokens for some reason
and the numbers are interestingly
tokenized at the bottom where it's like
one two three is it's actually its own
token and then 45 is its own token
so
uh this is a middle ground called byte
pair encoding it's a middle ground
between old school tokenization where
you would take each word and tokenize it
and throw out words that like weren't
frequent enough replace them with a
special like out of a Cavalry token
and the goal like the gold tokenization
would be just to use UTF 8 bytes it just
doesn't work empirically uh you know
wasn't found to work so the middle
ground is you merge some frequently
occurring things and you and you set
tokens for them but you're able to fall
back to like bytes if you need to
um gpt3 came out in 2020 and it was just
like gpt2 like the architecture is
exactly the same but it was 100 times
larger
and because it was so much larger it
started exhibiting these abilities of
like few shot learning which was
not that surprising but also zero shot
learning where you could just describe
the task and then it would be able to do
a really good job doing it
and it seemed like it was just getting
better and better the more parameters
you added uh and um
and it definitely is better the more
examples you give it but it's also
pretty good with just zero shot examples
it was trained on
the original webtext Corpus but also
just the raw common crawl filter down
and also a selection of books from some
sketchy sources and then all of
Wikipedia
and it's interesting to look at what the
top pages
in the web text and the common crawl
data sets are
so for the web text it's a bunch of new
sites it's like Huffington Post New York
Times BBC Twitter the guardian and then
for common crawl it's a lot of patents
for some reason then a bunch of news but
also some papers like science papers
okay in total 500 billion tokens they
only chained on 300 billion tokens so
they actually didn't even see this whole
Corpus in training so that's another
counter-intuitive thing about this llm
training is actually only see each data
point once
that's not quite true if it's kind of
sampled but like
the the the mindset is like you're only
seeing something once you get one shot
to predict on it
and for gpt4 we don't really know what's
going on because given both the
competitive landscape and the safety
implications and no further details
about the architecture
or the data set construction or the
Training Method or anything like that
but it's safe to assume that it's pretty
large because that's the trend like the
more computation you use to train these
AI systems the better they get
and people keep training larger and
larger ones
and that points to the bitter lesson of
Rich Sutton who's a reinforcement
learning professor
which is basically like no matter how
hard you try to come up with like cool
math and algorithms and stuff you're
going to get beat by someone just
stacking more layers
and uh
and that you know that that is bitter
but we can still do some science so like
what exactly is the relationship between
increasing the model size and the amount
of compute and increasing the data set
size
so scientists at deepmind set out to
answer this uh with a paper called
training compute optimal LMS which is
commonly known as chinchilla because
that's the name of the model that they
eventually trained but what they did is
they came up with formulas for like
answering the question if I had a fixed
compute budget
how should I distribute it should I add
more parameters to my model or should I
train a smaller model on more data or
just go through the data more times
and what they found is that like most
llms in literature had too many
parameters for the amount of data that
they saw
and so to capitalize on this they
trained this chinchilla model which is
only 70 billion and they showed that it
actually beat the performance of a model
four times the size called gopher but
what happened is that it was four times
smaller but it solved four times more
data so it was trained on like 1.4
trillion tokens whereas all the other
models were only trained on 300 billion
why 300 billion because that was the
gpt3 paper and everyone else just just
started you know they just wanted to
replicate gpg3 so they only did 300
billion I'm not sure
but note that this is still actually not
even going through all the data that we
have you could keep training it and
having the model see the data over and
over again
it might help I think that's kind of an
open question right now
so llama came out recently as an open
source chinchilla optimal llm from meta
research
they released several sizes from 7
billion to 65 billion and all of them
saw at least one trillion tokens
and The Benchmark competitively against
gpt3 and other state-of-the-art llms
it's it is open source but it's
non-commercial license for the weights
for the pre-trained weights
what was it trained on it was trained on
you know cut like a custom common crawl
filtering C4
GitHub
Wikipedia some books um some scientific
papers and recently it was replicated
this data set was replicated by an
effort called red pajama which is also
training models to replicate llama
but what's interesting here is like
GitHub why is GitHub in here
so why would we include code in the
training data because remember like T5
paper actually removed code from the
training data but now we're adding it
again like five percent of the total
training data so I think the answer is
just empirically people found that when
you include code it actually improves
performance on non-code tasks I think
openly I found this with their codex
model which is the first model where
they trained it on some code but then
they they actually started with gpt3 so
they trained gpt3 then they fine-tuned
it on code and they saw that it was good
on code but it was actually better on
like reasoning tasks than gpt3 was
and so since then people have been
adding code there's an open source data
set called the stack and it's all from
GitHub basically but they try to respect
licenses
so check it out if you're interested and
then there's another important part to
this llm story which is instruction
tuning
so
the when gpg3 was published people were
kind of mine their minds were blown just
by few shots like the fact that you can
produce or provide some examples of what
you want and then the model just kind of
gets it and starts doing it that is that
is cool it's it's also sometimes called
in context learning uh but the mindset
is really like text completion right
it's like you're completing what I've
already started
but by now and by the time achieve chat
GPT release the mindset is that things
should be zero shot so it's like I
shouldn't have to provide examples I
should just say what I want the model to
do and then I should figure out how to
do it and so that's the instruction
following mindset
and the way we got from text completion
to instruction following is with
supervised fine tuning so if we want the
model to do a good job on stuff like
this like zero shot tasks then we need
that in the data set but there's very
little text on the internet that's like
of this form
so what we can do is we can gather our
own data set of zero shot inputs and
like great outputs
and um
and then fine tune the pre-train model
on the data set and and profit and
that's exactly what openai started doing
they hired thousands of contractors to
gather this data
they published the paper about
doing that and also doing it even more
advancedly with reinforcement learning
I don't think we need the details not
very important but basically like once
you train the model with this
reinforcement learning from Human
feedback it becomes much better at
following instructions than the base GPT
model
and so they released that as text
DaVinci zero two
Chad GPT was like further reinforcement
learning trained on not even just your
shot tasks but like whole conversations
and uh introduces the chat ml format
where you have like user and assistant
messages and Special Assistant message
and kind of interesting to think about
the GPT lineage so like gbt3 came out of
2020 it's called DaVinci
and then open AI experimented trained it
on code that became the Codex models
like the code DaVinci zero zero one
and they also experiment with
instruction tuning so that became
instructive entry beta and text DaVinci
001
and then they kind of realized that like
they really just need to see a lot of
code even in the pre-training and so
they trained code DaVinci zero zero two
this is all conjecture by the way it's
not like for sure but um it's both
language model abilities and code
generation abilities then new
instruction tune it and then you can
further fine tune it for kind of
like standard GPT applications or
specifically for chat GPT applications
but fine tuning is not free it's really
great but it imposes what's called an
alignment tax
which is that the zero shot ability
increases but the few shot learning
ability probably decreases and the
model's confidence and its answers also
becomes less well calibrated
um so you can kind of think of it as
like
the base model
before fine-tuning
is
kind of knows what it knows and it'll
complete Touch for you in like the way
that it knows how to do them but then
you teach it to complete text in
different ways and because you're like
teaching it this different thing it kind
of gets confused about what it actually
knows
uh interestingly it's possible to steal
this like fine-tuning that's what
United Steel but the Llama model that we
saw was quickly fine-tuned by Stanford
team on
um some fine-tuned some some instruction
instructions but the instructions they
actually didn't pay contractors to get
they just ask gpt3 certain they gave
gpt3 instructions and then gbt3 would
like do it and then they would take that
as an example for a llama so it's only
cost them 600 to reproduce uh you know
I mean it's not as good as gpt3
instruction following but it's it's
pretty good
there is a data set for instructions
tuning uh specifically in the chat in
the chat
um you know Paradigm called open
assistant
and there's one last idea I want to
share which is retrieval enhancing
so this is a model called retro from
deepmind and basically it's like we have
these large models because they have to
learn a lot of facts uh about the world
and also they have to be good at
reasoning and like writing code and
stuff like that
but can we train a smaller model that's
like only good at reasoning and writing
code but then if it needs to like say
facts about the world it just kind of
looks them up from some database
so what they did is they like burden
coded a bunch of senses and stored them
in like a trillion token database
and then had a small model uh train
where it would be able to fetch things
from this database and kind of have them
in the context
they haven't been able to get it to work
as well as just like large language
models but I think that's a matter of
time like I think the this approach kind
of points to the future of LOL
foreign
[Music]
5.0 / 5 (0 votes)