How to Build an LLM from Scratch | An Overview
Summary
TLDRThe video provides an overview of key considerations when building a large language model from scratch in 2024, a now more feasible endeavor thanks to advances in AI. It steps through the process, from curating high-quality diverse training data, to designing an efficient Transformer architecture, to leveraging techniques like mixed precision to train at scale, to evaluating model performance on benchmarks. While still resource-intensive, building an LL.M may make sense for certain applications. The video concludes by noting base models are usually then customized via prompt engineering or fine-tuning.
Takeaways
- 😊 Building LLMs is gaining popularity due to increased interest after ChatGPT release
- 📈 Costs to train LLMs range from $100K (10B parameters) to $1.5M (100B parameters)
- 🗃️ High quality and diverse training data is critical for LLM performance
- ⚙️ Transformers with causal decoding are the most popular LLM architecture
- 👩💻 Many design choices exist when constructing LLM architectures
- 🚦 Parallelism, mixed precision, and optimizers boost LLM training efficiency
- 📊 Hyperparameters like batch size, learning rate, and dropout affect stability
- 📈 LLMs should balance model size, compute, and training data to prevent over/underfitting
- ✅ Benchmark datasets help evaluate capabilities on tasks like QA and common sense
- 🔄 Fine-tuning and prompt engineering can adapt pretrained LLMs for downstream uses
Q & A
What are the four main steps involved in building a large language model from scratch?
-The four main steps are: 1) Data curation 2) Model architecture 3) Training the model at scale 4) Evaluating the model.
What type of model architecture is commonly used for large language models?
-Transformers have emerged as the state-of-the-art architecture for large language models.
Why is data curation considered the most important step when building a large language model?
-Data curation is critical because the quality of the model is driven by the quality of the data. Large language models require large, high-quality training data sets.
What are some key considerations when preparing the training data?
-Some key data preparation steps include: quality filtering, deduplication, privacy redaction, and tokenization.
What are some common training techniques used to make it feasible to train large language models?
-Popular training techniques include mixed precision training, 3D parallelism, zero redundancy optimizers, checkpointing, weight decay, and gradient clipping.
How can you evaluate a text generation model on multiple choice benchmark tasks?
-You can create prompt templates with a few shot examples to guide the model to return one of the multiple choice tokens as its response.
What are some pros and cons of prompt engineering versus model fine-tuning?
-Prompt engineering avoids changing the original model but requires more effort to create effective prompts. Fine-tuning adapts the model for a specific use case but risks degrading performance on other tasks.
What are some examples of quality filtering approaches for training data?
-Classifier-based filtering using a text classification model, heuristic-based rules of thumb to filter text, or a combination of both approaches.
What considerations go into determining model size and training time?
-You generally want around 20 tokens per model parameter in the training data. And a 10x increase in model parameters requires around a 100x increase in computational operations.
Why might building a large language model from scratch not be necessary?
-Using an existing model with prompt engineering or fine-tuning is better suited for most use cases. Building from scratch has high costs and only makes sense in certain specialized cases.
Outlines
😄 Intro to building language models from scratch
The paragraph introduces the topic of building large language models from scratch. It notes the increasing interest in this from businesses and organizations post-ChatGPT. It highlights considerations like when it might make sense to build vs using existing models, and breaks down the process into 4 key steps: data curation, model architecture, training at scale, and evaluation.
📚 Data curation for language models
The paragraph discusses data curation, noting this is the most time consuming but important step. It covers sourcing training data, highlighting common sources like the internet and public/private datasets. It also covers data diversity, showing how different models use different compositions of data, and data preparation like quality filtering, de-duplication, privacy redaction and tokenization.
🚧 Model architecture decisions
The paragraph provides an overview of model architecture decisions when building a language model. It focuses on Transformer models, explaining the encoder-decoder structure and detailing specific considerations like residual connections, normalization strategies, activation functions, positional encodings, and model size in relation to training data.
⚙️ Training language models at scale
The paragraph discusses training large language models, which requires leveraging computational tricks and techniques. It covers mixed precision training, 3D parallelism, zero redundancy optimization, and training stability techniques like checkpointing, weight decay, and gradient clipping. It also notes common hyperparameter choices.
📈 Evaluating language models
The paragraph focuses on evaluating trained language models using benchmarks like the open LLM leaderboard. It details strategies for multiple choice tasks using prompt templating. It also covers evaluation options for open-ended tasks like human evaluation, NLP metrics, and using auxiliary classifiers.
😃 What's next after training a model
The closing paragraph notes that a trained model is often just the starting point for building something practical. It highlights two directions - prompt engineering to use it as-is, or fine-tuning the model for a specific use case. It concludes by noting the pros and cons of these approaches.
Mindmap
Keywords
💡large language model
💡training data
💡Transformer
💡residual connections
💡mixed precision training
💡position embeddings
💡prompt engineering
💡checkpointing
💡model parallelism
💡model evaluation
Highlights
Building large language models was an esoteric and specialized activity reserved mainly for cutting edge AI research but today many businesses and enterprises have interest in building them
Bloomberg GPT is a large language model specifically built to handle tasks in the space of finance
Building an LLM from scratch often not necessary and using prompt engineering or fine-tuning an existing model is better suited
Back of the napkin math shows training a 10 billion parameter model costs around $100,000 in compute and a 100 billion parameter model costs around $1.5 million
Data curation is the most important and time consuming part of building an LLM
LLMs require large training sets, like half a trillion to 3.5 trillion tokens which equals about a million novels to a billion news articles
Common Crawl, C4, Falcon Refined Web, and the Pile are popular publicly available training data sets for LLMs
Private data sources can provide strategic advantage for business applications of LLMs
ALPACA model used GPT-3 generated structured text as training data in iterative loops
Data diversity in training sets leads to general purpose models good at wide variety of tasks
Transformers have emerged as state-of-the-art model architecture for LLMs due to their use of attention
Causal language modeling with decoder-only architecture most popular for LLMs
Mixed precision, model/pipeline/data parallelism, zero redundancy optimization are tricks to reduce LLM training time
Checkpointing, weight decay, gradient clipping important for training stability at scale
Open LLM leaderboard provides model evaluation on Arc, HellaSwag, MMLU, and TruthfulQA benchmarks
Transcripts
hey everyone I'm sha and this is the
sixth video in the larger series on how
to use large language models in practice
in this video I'm going to review key
aspects and considerations for building
a large language model from scratch if
you Googled this topic even just one
year ago you'd probably see something
very different than we see today
building large language models was a
very esoteric and specialized activity
reserved mainly for Cutting Edge AI
research but today if you Google how to
build an llm from scratch or should I
build a large language model you'll see
a much different story with all the
excitement surrounding large language
models post chat GPT we now have an
environment where a lot of businesses
and Enterprises and other organizations
have an interest in building these
models perhaps one of the most notable
examples comes from Bloomberg in
Bloomberg GPT which is a large language
model that was specifically built to
handle tasks in the space of Finance
however the way I see it building a
large language model from scratch is
often not necessary for the vast
majority of llm use cases using
something like prompt engineering or
fine-tuning in existing model is going
to be much better suited than building a
large language model from scratch with
that being said it is valuable to better
understand what it takes to build one of
these models from scratch and when it
might make sense to do it before diving
into the technical aspects of building a
large language model let's do some back
the napkin math to get a sense of the
financial costs that we're talking about
here taking as a baseline llama 2 the
relatively recent large language model
put out by meta these were the
computational costs associated with the
7 billion parameter version and 70
billion parameter versions of the model
so you can see for llama 27b it took
about 180,000 th000 GPU hours to train
that model while for 70b a model 10
times as large it required 10 times as
much compute so this required 1.7
million GPU hours so if we just do what
physicists love to do we can just take
orders of magnitude and based on the
Llama 2 numbers we'll say a 10 billion
parameter model takes on the order of
100,000 GPU hours to train while 100
billion parameter model takes about a
million GPU hours to train so how can we
trans at this into a dollar amount here
we have two options option one is we can
rent the gpus and compute that we need
to train our model via any of the big
cloud providers out there a Nvidia a100
what was used to train llama 2 is going
to be on the order of $1 to $2 per GPU
per hour so just doing some simple
multiplication here that means the 10
billion parameter model is going to be
on the order of1 15 $50,000 just to
train and the 100 billion parameter
model will be on the order of $1.5
million to train alternatively instead
of renting the compute you can always
buy the hardware in that case we just
have to take into consideration the
price of these gpus so let's say an a100
is about $110,000 and you want to form a
GPU cluster which is about 1,000 gpus
the hardware costs alone are going to be
on the order of like $10 million but
that's not the only cost when you're
running a cluster like this for weeks it
consumes a tremendous amount of energy
and so you also have to take into
account the energy cost so let's say
training a 100 billion parameter model
consumes about 1,000 megawatt hours of
energy and let's just say the price of
energy is about $100 per megawatt hour
then that means the marginal cost of
training a 100 billion parameter model
is going to be on the order of $100,000
okay so now that you've realized you
probably won't be training a large
language model anytime soon or maybe you
are I don't know let's dive into the
technical aspects of building one of
these models I'm going to break the
process down into four steps one is data
curation two is the model architecture
three is training the model at scale and
four is evaluating the model okay so
starting with data curation I would
assert that this is the most important
and perhaps most time consuming part of
the process and this comes from the
basic principle of machine learning of
garbage in garbage out put another way
the quality of your model is driven by
the quality of your data so it's super
important that you get the training data
right especially if you're going to be
investing millions of dollars in this
model but this presents a problem large
language models require large training
data sets and so just to get a sense of
this gpt3 was trained on half a trillion
tokens llama 2 was trained on two
trillion tokens and the more recent
Falcon 180b was trained on 3.5 trillion
tokens and if you're not familiar with
tokens you can check out the previous
video in the series where I talk more
about what tokens are and why they're
important but here we can say that as
far as training data go we're talking
about a trillion words of text or in
other words about a million novels or a
billion news articles so we're talking
about a tremendous amount of data going
through a trillion words of text and
ensuring data quality is a tremendous
effort and undertaking and so a natural
question is where do we even get all
this text the most common place is the
internet the internet consist of web
pages Wikipedia forums books scientific
articles code bases you name it post J
GPT there's a lot more controversy
around this and copyright laws the risk
with web scraping yourself is that you
might grab data that you're not supposed
to grab or you don't have the rights to
grab and then using it in a model for
potentially commercial use could come
back and cause some trouble down the
line alternatively there are many public
data sets out there one of the most
popular is common crawl which is a huge
Corpus of text from the internet and
then there are some more refined
versions such as colossal clean crawled
Corpus also called C4 there's also
Falcon refined web which was used to
train Falcon 180b mentioned on the
previous slide another popular data set
is the pile which tries to bring
together a wide variety of diverse data
sources into the training data set which
we'll talk a bit more about in the next
slide and then we have hugging face
which has really emerged as a big player
in the generative Ai and large language
model space who houses a ton of Open
Access Data sources on their platform
another place are private data sources
so a great example of this is fin pile
which was used to train Bloomberg GPD
and the key upside of private data
sources is you own the rights to it and
and it's data that no one else has which
can give you a strategic Advantage if
you're trying to build a model for some
business application or for some other
application where there's some
competition or environment of other
players that are also making their own
large language models finally and
perhaps the most interesting is using an
llm to generate the training data a
notable example of this comes from the
alpaca model put out by researchers at
Stanford and what they did was they
trained an llm alpaca using structured
text generated by gpt3 this is my
cartoon version of it you pass on the
prompt make me training data into your
large language model and it spits out
the training data for you turning to the
point of data set diversity that I
mentioned briefly with the pile one
aspect of a good training data set seems
to be data set diversity and the idea
here is that a diverse data set
translates to to a model that can
perform well in a wide variety of tasks
essentially it translates into a good
general purpose model here I've listed
out a few different models and the
composition of their training data sets
so you can see gpt3 is mainly web pages
but also some books you see gopher is
also mainly web pages but they got more
books and then they also have some code
in there llama is mainly web pages but
they also have books code and scientific
articles and then Palm is mainly built
on conversational data but then you see
it's trained on web pages books and code
how you curate your training data set is
going to drive the types of tasks the
large language model will be good at and
while we're far away from an exact
science or theory of this particular
data set composition translates to this
type of model or like adding an
additional 3% code in your trading data
set will have this quantifiable outcome
in the downstream model while we're far
away from that diversity does seem to be
an important consideration when making
your training data sets another thing
that's important to ask ourselves is how
do we prepare the data again the quality
of our model is driven by the quality of
our data so one needs to be thoughtful
with the text that they use to generate
a large language model and here I'm
going to talk about four key data
preparation steps the first is quality
filtering this is removing text which is
not helpful to the large language model
this could be just a bunch of random
gibberish from some corner of the
internet this could be toxic language or
hate speech found on some Forum this
could be things that are objectively
false like 2 + 2al 5 which you'll see in
the book 1984 while that text exists out
there it is not a true statement there's
a really nice paper it's called survey
of large language models I think and in
that paper they distinguish two types of
quality filtering the first is
classifier based and this this is where
you take a small highquality data set
and use it to train a text
classification model that allows you to
automatically score text as either good
or bad low quality or high quality so
that precludes the need for a human to
read a trillion words of text to assess
its quality it can kind of be offloaded
to this classifier the other type of
approach they Define is heuristic based
this is using various rules of thumb to
filter the text text this could be
removing specific words like explicit
text this could be if a word repeats
more than two times in a sentence you
remove it or using various statistical
properties of the text to do the
filtering and of course you can do a
combination of the two you can use the
classifier based method to distill down
your data set and then on top of that
you can do some heuristics or vice versa
you can use heuristics to distill down
the data set and then apply your
classifier there's no one- siiz fits-all
recipe for doing quality filter in
rather there's a menu of many different
options and approaches that one can take
next is D duplication this is removing
several instances of the same or very
similar text and the reason this is
important is that duplicate texts can
bias the model and disrupt training
namely if you have some web page that
exists on two different domains one ends
up in the training data set one ends up
in the testing data set this causes some
trouble trying to get a fair assessment
of model performance during training
another key step is privacy redaction
especially for text grab from the
internet it might include sensitive or
confidential information it's important
to remove this text because if sensitive
information makes its way into the
training data set it could be
inadvertently learned by the language
model and be exposed in unexpected ways
finally we have the tokenization step
which is essentially translating text
into numbers and the reason this is
important is because neural networks do
not understand text directly they
understand numbers so anytime you feed
something into a neural network it needs
to come in numerical form while there
are many ways to do this mapping one of
the most popular ways is via the bite
pair encoding algorithm which
essentially takes a corpus of text and
deres from it an efficient subword
vocabulary it figures out the best
choice of subwords or character
sequences to define a vocabulary from
which the entire Corpus can be
represented for example maybe the word
efficient gets mapped to a integer and
exists in the vocabulary maybe sub with
a dash gets mapped to its own integer
word gets mapped to its own integer
vocab gets mapped to its own integer and
UL gets mapped to its own integer so
this string of text here efficient
subword vocabulary might be translated
into five tokens each with their own
numerical representation so one two
three four five there are python
libraries out there that implement this
algorithm so you don't have to do it
from scratch namely there's the sentence
piece python Library there's also the
tokenizer library coming from hugging
face here the citation numbers and I
provide the link in the description and
comment section below moving on to step
two model architecture so in this step
we need to define the architecture of
the language model and as far as large
language models go Transformers have
emerged merged as the state-of-the-art
architecture and a Transformer is a
neural network architecture that
strictly uses attention mechanisms to
map inputs to outputs so you might ask
what is an attention mechanism and here
I Define it as something that learns
dependencies between different elements
of a sequence based on position and
content this is based on the intuition
that when you're talking about language
the context matters and so let's look at
a couple examples so if we see the
sentence I hit the base baseball with a
bat the appearance of baseball implies
that bat is probably a baseball bat and
not a nocturnal mammal this is the
picture that we have in our minds this
is an example of the content of the
context of the word bat so bat exists in
this larger context of this sentence and
the content is the words making up this
context the the content of the context
drives what word is going to come next
and the meaning of this word here but
content isn't enough the positioning of
these words is also important so to see
that consider another example I hit the
bat with a baseball now there's a bit
more ambiguity of what bat means it
could still mean a baseball bat but
people don't really hit baseball bats
with baseballs they hit baseballs with
baseball bats one might reasonably think
bad here means the nocturnal mammal and
so an attention mechanism captures both
these aspects of language more
specifically it will use both the
content of the sequence and the
positions of each element in the
sequence to help infer what the next
word should be well at first it might
seem that Transformers are a constrained
in particular architecture we actually
have an incredible amount of freedom and
choices we can make as developers making
a Transformer model so at a high level
there are actually three types of
Transformers which follows from the two
modules that exist in the Transformer
architecture namely we have the encoder
and decoder so we can have an encoder by
itself that can be the architecture we
can have a decoder by itself that's
another architecture and then we can
have the encoder and decoder working
together and that's the third type of
Transformer so let's take a look at
these One By One The encoder only
Transformer translates tokens into a
semantically mean meaningful
representation and these are typically
good for Tech classification tasks or if
you're just trying to generate a
embedding for some text next we have the
decoder only Transformer which is
similar to an encoder because it
translates text into a semantically
meaningful internal representation but
decoders are trying to predict the next
word they're trying to predict future
tokens and for this decoders do not
allow self attention with future
elements which makes it great for text
generation tasks and so just to get a
bit more intuition of the difference
between the encoder self attention
mechanism and the decoder self attention
mechanism the encoder any part of the
sequence can interact with any other
part of the sequence if we were to zoom
in on the weight matrices that are
generating these internal
representations in the encoder you'll
see that none of the weights are zero on
the other hand for a decoder it uses
so-called masked self attention so any
weights that would connect a token to a
token in the future is going to be set
to zero it doesn't make sense for the
decoder to see into the future if it's
trying to predict the future that would
kind of be like cheating and then
finally we can combine the encoder and
decoder together to create another
choice of model architecture this was
actually the original design of the
Transformer model kind of what's
depicted here and so what you can do
with the encoder decoder model that you
can't do with the others is the
so-called cross attention so instead of
just being restricted to self attention
with the encoder or mask self attention
with the decoder the encoder decoder
model allows for cross attention where
the embeddings from the encoder so this
will generate a sequence and the
internal embeddings of the decoder which
will be another sequence will have this
attention weight Matrix so that the
encoders representations can communicate
with the decoder representations and
this tends to be good for tasks such as
translation which was the original
application of this Transformers model
while we do have three options to choose
from when it comes to making a
Transformer the most popular by far is
this decoder only architecture where
you're only using this part of the
Transformer to do the language modeling
and this is also called causal language
modeling which basically means given a
sequence of text you want to predict
future text Beyond just this highlevel
choice of model architecture there are
actually a lot of other design choices
and details that one needs to take into
consideration first is the use of
residual connections which are just
Connections in your model architecture
that allow intermediate training values
to bypass various hidden layers and so
to make this more concrete this is from
reference number 18 Linked In the
description and comment section below
what this looks like is you have some
input and instead of strictly feeding
the input into your hidden layer which
is this stack of things here you allow
it to go to both the hidden layer and to
bypass the hidden layer then you can
aggregate the original input and the
output of the Hidden layer in some way
to generate the input for the next layer
and of course there are many different
ways one can do this with all the
different details that can go into a
hidden layer you can have the input and
the output of the Hidden layer be added
together and then have an activation
applied to the addition you can have the
input and the output of the Hidden layer
be added and then you can do some kind
of normalization and then you can add
the activation or you can have the
original input and the output of the
Hidden layer just be added together you
really have a tremendous amount of
flexibility and design Choice when it
comes to these residual Connections in
the original Transformers architecture
the way they did it was something
similar to this where the input bypasses
this multiheaded attention layer and is
added and normalized with the output of
this multi attention layer and then the
same thing happens for this layer same
thing happens for this layer same thing
happens for this layer and same thing
happens for this layer next is layer
normalization which is rescaling values
between layers based on their mean and
standard deviation and so when it comes
to layer normalization there are two
considerations that we can make one is
where you normalize so there are
generally two options here you can
normalize before the layer also called
pre-layer normalization or you can
normalize after the layer also called
post layer normalization another
consideration is how you normalize one
of the most common ways is via layer
norm and this is the equation here this
is your input X you subtract the mean of
the input and then you divide it by the
variance plus some noise term then you
multiply it by some gain factor and then
you can have some bias term as well an
alternative to this is the root mean
Square Norm or RMS Norm which is very
similar it just doesn't have the mean
term in the numerator and then it
replaces this denominator with just the
RMS while you have a few different
options on how you do layer
normalization the most common based on
that survey of large language models I
mentioned earlier reference number eight
pre-layer normalization seems to be most
common combined with this vanilla layer
Norm approach next we have activation
functions and these are non-linear
functions that we can include in the
model which in principle allow it to
capture comp Lex mappings between inputs
and outputs here there are several
common choices for large language models
namely gelu relo swish swish Glu G Glu
and I'm sure there are more but glus
seem to be the most common for large
language models another design Choice Is
How We Do position embeddings position
embeddings capture information about
token positions the way that this was
done in the original Transformers paper
was using these sign and cosine basic
functions which added a unique value to
each token position to represent its
position and you can see in the original
Transformers architecture you had your
tokenized input and the positional
encodings were just added to the
tokenized input for both the encoder
input and the decoder input more
recently there's this idea of relative
positional encodings so instead of just
adding some fixed positional encoding
before the input is passed into the
model the idea with relative positional
encodings is to bake positional
encodings into the attention mechanism
and so I won't dive into the details of
that here but I will provide this
reference self attention with relative
position representations also citation
number 20 the last consideration that
I'll talk about when it comes to model
architecture is how big do I make it and
the reason this is important is because
if a model is too big or train too long
it can overfit on the other hand if a
model is too small or not trained long
enough it can underperform and these are
both in the context of the training data
and so there's this relationship between
the number of parameters the number of
computations or training time and the
size of the training data set there's a
nice paper by Hoffman at all where they
do an analysis of optimal compute
considerations when it comes to large
language models I've just grabbed a
table from that paper that summarizes
their key findings what this is saying
is that a 400 million parameter model
should undergo on the order of let's say
like 2 to the 19 floating Point
operations and have a training data
consisting of 8 billion tokens and then
a parameter with 1 billion models should
have 10 times as many floating Point
operations and be trained on 20 billion
parameters and so on and so forth my
kind of summarization takeaway from this
is that you should have about 20 tokens
per model mod parameter it's not going
to be very precise but might be a good
rule of thumb and then we have for every
10x increase in model parameters there's
about a 100x increase in floating Point
operations so if you're curious about
this check out the paper Linked In the
description below even if this isn't an
optimal approach in all cases it may be
a good starting place and rule of thumb
for training these models so now we come
to step three which is training these
models at scale so again the central
challenge of these large language models
is is their scale when you're training
on trillions of tokens and you're
talking about billions tens of billions
hundreds of billions of parameters
there's a lot of computational cost
associated with these things and it is
basically impossible to train one of
these models without employing some
computational tricks and techniques to
speed up the training process here I'm
going to talk about three popular
training techniques the first is mixed
Precision training which is essentially
when you use both 32bit and 16 bit
floating Point numbers during model
training such that you use the 16bit
floating Point numbers whenever possible
and 32bit numbers only when you have to
more on mixed Precision training in that
survey of large language models and then
there's also a nice documentation by
Nvidia linked below next is this
approach of 3D parallelism which is
actually the combination of three
different parallelization strategies
which are all listed here and I'll just
go through them one by one first is
pipeline parallelism which is
Distributing the Transformer layers
across multiple gpus and it actually
does an additional optimization where it
puts adjacent layers on the same GPU to
reduce the amount of cross GPU
communication that has to take place the
next is model parallelism which
basically decomposes The Matrix
multiplications that make up the model
into smaller Matrix multiplies and then
distributes those Matrix multiplies
across multiple gpus and then and then
finally there's data parallelism which
distributes training data across
multiple gpus but one of the challenges
with parallelization is that
redundancies start to emerge because
model parameters and Optimizer States
need to be copied across multiple gpus
so you're having some portion of the
gpu's precious memory devoted to storing
information that's copied in multiple
places this is where zero redundancy
Optimizer or zero is helpful which
essentially reduces data redundancy
regarding the optimizer State the
gradient and parameter partitioning and
so this was just like a surface level
survey of these three training
techniques these techniques and many
more are implemented by the deepe speed
python library and of course deep speed
isn't the only Library out there there
are a few other ones such as colossal AI
Alpa and some more which I talk about in
the blog associated with this video
another consideration when training
these massive models is training
stability and it turns out there are a
few things that we can do to help ensure
that the training process goes smoothly
the first is checkpointing which takes a
snapshot of model artifacts so training
can resume from that point this is
helpful because let's say you're
training loss is going down it's great
but then you just have this spike in
loss after training for a week and it
just blows up training and you don't
know what happened checkpointing allows
you to go back to when everything was
okay and debug what could have gone
wrong and maybe make some adjustments to
the learning rate or other
hyperparameters so that you can try to
avoid that spike in the loss function
that came up later another strategy is
weight Decay which is essentially a
regularization strategy that penalizes
large parameter values I've seen two
ways of doing this one is either by
adding a term to the objective function
which is like regular regularization
regular regularization or changing the
parameter update Rule and then finally
we have gradient clipping which rescales
the gradient of the objective function
if it exceeds a pre-specified value so
this helps avoid the exploding gradient
problem which may blow up your training
process and then the last thing I want
to talk about when it comes to training
are hyperparameters while these aren't
specific to large language models my
goal here is to just lay out some common
choices when it comes to these values so
first we have batch size which can be
either static or dynamic and if it's
static batch sizes are usually pretty
big so on the order of like 16 million
tokens but it can also be dynamic for
example in GPT 3 what they did is they
gradually increased the batch size from
32,000 tokens to 3.2 million tokens next
we have the learning rate and so this
can also be static or dynamic but it
seems that Dynamic learning rates are
much more common for these models a
common strategy seems to go as follows
you have a learning rate that increases
linearly until reaching some specified
maximum value and then it'll reduce via
a cosine Decay until the learning rate
is about 10% % of its max value next we
have the optimizer atom or atom based
optimizers are most commonly used for
large language models and then finally
we have Dropout typical values for
Dropout are between 0.2 and 0.5 from the
original Dropout paper by Hinton at all
finally step four is model evaluation so
just cuz you've trained your model and
you've spent millions of dollars and
weeks of your time if not more it's
still not over typically when you have a
model in hand that's really just the
starting place in many ways next you got
to see what this thing actually does how
it works in the context of the desired
use case the desired application of it
this is where model evaluation becomes
important for this there are many
Benchmark data sets out there here I'm
going to restrict the discussion to the
open llm leaderboard which is a public
llm Benchmark that is continually
updated with new models un hugging faces
models platform and the four benchmarks
that is used in the open El M
leaderboard are Arc H swag MML and
truthful QA while these are only four of
many possible Benchmark data sets the
evaluation strategies that we can use
for these Benchmark data sets can easily
port to other benchmarks so first I want
to start with just Arc helis swagen MML
U which are multiple choice tasks so a
bit more about these Ark and MML U are
essentially great school questions on
subjects like math math history common
knowledge you know whatever and it'll be
like a question with a multiple choice
response A B C or D so an example is
which technology was developed most
recently a a cell phone B a microwave c
a refrigerator and D an airplane H swag
is a little bit different these are
specifically questions that computers
tend to struggle with so an example of
this is in the blog associated with this
video which goes like this a woman is
outside with a bucket ET and a dog the
dog is running around trying to avoid a
bath she dot dot dot a rinses the bucket
off with soap and blow dries the dog's
head B uses a hose to keep it from
getting soapy C gets the dog wet then it
runs away again D gets into a bathtub
with a dog and so this is a very strange
question but intuitively humans tend to
do very well on these tasks and
computers do not so while these are
multiple choice tasks and we might think
it should be pretty straight forward to
evaluate model performance on them there
is one hiccup namely these large
language models are typically text
generation models so they'll take some
input text and they'll output more text
they're not classifiers they don't
generate responses like ABC or D or
class one class 2 class 3 class 4 they
just generate text completions and so
you have to do a little trick to get
these large language models to perform
multiple choice tasks and this is
essentially through prompt templates for
example if you have the question which
technology was developed most recently
instead of just passing in this question
and the choices to the large language
model and hopefully it figures out to do
a BC or D you can use a prompt template
like this and additionally prend the
prompt template with a few shot examples
so the language model will pick up that
I should return just a single token that
is one of these four tokens here so if
you pass this into to the model you'll
get a distribution of probabilities for
each possible token and what you can do
then is just evaluate of all the tens of
thousands of tokens that are possible
you just pick the four tokens associated
with a B C or D and see which one is
most likely and you take that to be the
predicted answer from the large language
model while there is this like extra
step of creating a prompt template you
can still evaluate a large language
model on these multiple choice tasks and
in a relatively straightforward way
however this is a bit more tricky when
you have open-ended tasks such as for
truthful QA for truthful QA or other
open-ended tasks where there isn't a
specific one right answer but rather a
wide range of possible right answers
there are a few different evaluation
strategies we can take the first is
human evaluation so a person scores the
completion based on some ground truth
some guidelines or both while this is
the most labor int ensive this may
provide the highest quality assessment
of model completions another strategy is
we could use NLP metrics so this is
trying to quantify the completion
quality using metrics such as perplexity
blue score row score Etc so just using
the statistical properties of the
completion as a way to quantify its
quality while this is a lot less labor
intensive it's not always clear what the
mapping between a completions
statistical properties is to the quality
of that that completion and then the
third approach which might capture The
Best of Both Worlds is to use an
auxiliary fine-tuned model to rate the
quality of the completions and this was
actually used in the truthful QA paper
should be reference 30 where they
created an auxiliary model called GPT
judge which would take model completions
and classify it as either truthful or
not truthful and then that would help
reduce the burden of human evaluation
when evaluating model outputs okay so
what's next so you've created your large
language model from scratch what do you
do next often this isn't the end of the
story as the name base models might
suggest base models are typically a
starting point not the final solution
they are really just a starting place
for you to build something more
practical on top of and there are
generally two directions here one is via
prompt engineering and prompt
engineering is just feeding things into
the language model and harvesting their
completions for some particular use case
another Direction one can go is via
model fine-tuning which is where you
take the pre-trained model and you adapt
it for a particular use case prompt
engineering and model fine tuning both
have their pros and cons to them if you
want to learn more check out the
previous two videos of this series where
I do a deep dive into each of these
approaches if you enjoyed this content
please consider liking subscribing and
sharing it with others if you have any
questions or suggestions for future
content please drop those in the comment
section below and as always thank you so
much for your time and thanks for
watching
Просмотреть больше связанных видео
![](https://i.ytimg.com/vi/-FPOJ5YptUY/hqdefault.jpg?sqp=-oaymwExCJADEOABSFryq4qpAyMIARUAAIhCGAHwAQH4Af4JgALOBYoCDAgAEAEYZSBeKFMwDw==&rs=AOn4CLA-R2ZoL0s9KnJS_H3EN-WH-g--CA)
A basic introduction to LLM | Ideas behind ChatGPT
![](/_next/static/media/default-video-cover.615af72e.png)
ChatGPT Explained Completely.
![](https://i.ytimg.com/vi/y2J7EZUk_a0/hq720.jpg)
SDXL Local LORA Training Guide: Unlimited AI Images of Yourself
![](https://i.ytimg.com/vi/u5Vcrwpzoz8/hq720.jpg)
"I want Llama3 to perform 10x with my private knowledge" - Local Agentic RAG w/ llama3
![](https://i.ytimg.com/vi/_OIq-9dKkbI/hq720.jpg)
Lessons From Fine-Tuning Llama-2
![](https://i.ytimg.com/vi/TRjq7t2Ms5I/hqdefault.jpg?sqp=-oaymwEXCJADEOABSFryq4qpAwkIARUAAIhCGAE=&rs=AOn4CLDRfTRa4V1hfpCUcJ6VFtfn_zieuA)
Building Production-Ready RAG Applications: Jerry Liu
5.0 / 5 (0 votes)