How to Build an LLM from Scratch | An Overview

Shaw Talebi
5 Oct 202335:44

Summary

TLDRThe video provides an overview of key considerations when building a large language model from scratch in 2024, a now more feasible endeavor thanks to advances in AI. It steps through the process, from curating high-quality diverse training data, to designing an efficient Transformer architecture, to leveraging techniques like mixed precision to train at scale, to evaluating model performance on benchmarks. While still resource-intensive, building an LL.M may make sense for certain applications. The video concludes by noting base models are usually then customized via prompt engineering or fine-tuning.

Takeaways

  • 😊 Building LLMs is gaining popularity due to increased interest after ChatGPT release
  • 📈 Costs to train LLMs range from $100K (10B parameters) to $1.5M (100B parameters)
  • 🗃️ High quality and diverse training data is critical for LLM performance
  • ⚙️ Transformers with causal decoding are the most popular LLM architecture
  • 👩‍💻 Many design choices exist when constructing LLM architectures
  • 🚦 Parallelism, mixed precision, and optimizers boost LLM training efficiency
  • 📊 Hyperparameters like batch size, learning rate, and dropout affect stability
  • 📈 LLMs should balance model size, compute, and training data to prevent over/underfitting
  • ✅ Benchmark datasets help evaluate capabilities on tasks like QA and common sense
  • 🔄 Fine-tuning and prompt engineering can adapt pretrained LLMs for downstream uses

Q & A

  • What are the four main steps involved in building a large language model from scratch?

    -The four main steps are: 1) Data curation 2) Model architecture 3) Training the model at scale 4) Evaluating the model.

  • What type of model architecture is commonly used for large language models?

    -Transformers have emerged as the state-of-the-art architecture for large language models.

  • Why is data curation considered the most important step when building a large language model?

    -Data curation is critical because the quality of the model is driven by the quality of the data. Large language models require large, high-quality training data sets.

  • What are some key considerations when preparing the training data?

    -Some key data preparation steps include: quality filtering, deduplication, privacy redaction, and tokenization.

  • What are some common training techniques used to make it feasible to train large language models?

    -Popular training techniques include mixed precision training, 3D parallelism, zero redundancy optimizers, checkpointing, weight decay, and gradient clipping.

  • How can you evaluate a text generation model on multiple choice benchmark tasks?

    -You can create prompt templates with a few shot examples to guide the model to return one of the multiple choice tokens as its response.

  • What are some pros and cons of prompt engineering versus model fine-tuning?

    -Prompt engineering avoids changing the original model but requires more effort to create effective prompts. Fine-tuning adapts the model for a specific use case but risks degrading performance on other tasks.

  • What are some examples of quality filtering approaches for training data?

    -Classifier-based filtering using a text classification model, heuristic-based rules of thumb to filter text, or a combination of both approaches.

  • What considerations go into determining model size and training time?

    -You generally want around 20 tokens per model parameter in the training data. And a 10x increase in model parameters requires around a 100x increase in computational operations.

  • Why might building a large language model from scratch not be necessary?

    -Using an existing model with prompt engineering or fine-tuning is better suited for most use cases. Building from scratch has high costs and only makes sense in certain specialized cases.

Outlines

00:00

😄 Intro to building language models from scratch

The paragraph introduces the topic of building large language models from scratch. It notes the increasing interest in this from businesses and organizations post-ChatGPT. It highlights considerations like when it might make sense to build vs using existing models, and breaks down the process into 4 key steps: data curation, model architecture, training at scale, and evaluation.

05:01

📚 Data curation for language models

The paragraph discusses data curation, noting this is the most time consuming but important step. It covers sourcing training data, highlighting common sources like the internet and public/private datasets. It also covers data diversity, showing how different models use different compositions of data, and data preparation like quality filtering, de-duplication, privacy redaction and tokenization.

10:03

🚧 Model architecture decisions

The paragraph provides an overview of model architecture decisions when building a language model. It focuses on Transformer models, explaining the encoder-decoder structure and detailing specific considerations like residual connections, normalization strategies, activation functions, positional encodings, and model size in relation to training data.

15:04

⚙️ Training language models at scale

The paragraph discusses training large language models, which requires leveraging computational tricks and techniques. It covers mixed precision training, 3D parallelism, zero redundancy optimization, and training stability techniques like checkpointing, weight decay, and gradient clipping. It also notes common hyperparameter choices.

20:05

📈 Evaluating language models

The paragraph focuses on evaluating trained language models using benchmarks like the open LLM leaderboard. It details strategies for multiple choice tasks using prompt templating. It also covers evaluation options for open-ended tasks like human evaluation, NLP metrics, and using auxiliary classifiers.

25:06

😃 What's next after training a model

The closing paragraph notes that a trained model is often just the starting point for building something practical. It highlights two directions - prompt engineering to use it as-is, or fine-tuning the model for a specific use case. It concludes by noting the pros and cons of these approaches.

Mindmap

Keywords

💡large language model

A large language model (LLM) is a machine learning model trained on vast amounts of text data to generate realistic human language. The video focuses on best practices for building an LLM from scratch. Examples of LLMs include GPT-3, LLMA-2, and Falcon-180B.

💡training data

The quality of an LLM is highly dependent on its training data. LLMs require massive datasets of high-quality, diverse text data spanning billions of words. Important considerations around training data include web scraping, public datasets like Common Crawl, private datasets, and even using an LLM to generate data.

💡Transformer

Transformers are the state-of-the-art neural network architecture used for LLMs. They rely solely on attention mechanisms to translate text input to text output. The video discusses tradeoffs with encoder-only, decoder-only, and encoder-decoder Transformers when building an LLM architecture.

💡residual connections

Residual connections allow intermediate training values to bypass layers in a neural network, helping address degradation issues. The video illustrates how residual connections are incorporated across Transformer layers.

💡mixed precision training

A technique to use both 32-bit and 16-bit floating point numbers during LLM training, reducing memory usage. This and techniques like 3D parallelism and zero redundancy optimization enable training LLMs at their massive scale.

💡position embeddings

Position embeddings capture information about token positions within text sequences. The video contrasts fixed positional encodings used originally in Transformers versus more advanced relative positional encodings.

💡prompt engineering

One approach after building an LLM is prompt engineering - carefully designing input prompts to yield useful completions from the LLM for a particular application. Can be more lightweight than fine-tuning.

💡checkpointing

Taking periodic snapshots of model parameters during lengthy LLM training to enable resuming from that point if training becomes unstable. Helps address the complexity of training giant models.

💡model parallelism

Decomposing matrix multiplications required during neural network training across multiple GPUs. This and pipeline parallelism are key parallelization strategies to scale up LLM training.

💡model evaluation

After training an LLM, rigorous evaluation on benchmarks like the Anthropic Open LLM Leaderboard is critical. The video discusses prompt engineering techniques and evaluation strategies tailored to different benchmark datasets.

Highlights

Building large language models was an esoteric and specialized activity reserved mainly for cutting edge AI research but today many businesses and enterprises have interest in building them

Bloomberg GPT is a large language model specifically built to handle tasks in the space of finance

Building an LLM from scratch often not necessary and using prompt engineering or fine-tuning an existing model is better suited

Back of the napkin math shows training a 10 billion parameter model costs around $100,000 in compute and a 100 billion parameter model costs around $1.5 million

Data curation is the most important and time consuming part of building an LLM

LLMs require large training sets, like half a trillion to 3.5 trillion tokens which equals about a million novels to a billion news articles

Common Crawl, C4, Falcon Refined Web, and the Pile are popular publicly available training data sets for LLMs

Private data sources can provide strategic advantage for business applications of LLMs

ALPACA model used GPT-3 generated structured text as training data in iterative loops

Data diversity in training sets leads to general purpose models good at wide variety of tasks

Transformers have emerged as state-of-the-art model architecture for LLMs due to their use of attention

Causal language modeling with decoder-only architecture most popular for LLMs

Mixed precision, model/pipeline/data parallelism, zero redundancy optimization are tricks to reduce LLM training time

Checkpointing, weight decay, gradient clipping important for training stability at scale

Open LLM leaderboard provides model evaluation on Arc, HellaSwag, MMLU, and TruthfulQA benchmarks

Transcripts

play00:00

hey everyone I'm sha and this is the

play00:01

sixth video in the larger series on how

play00:04

to use large language models in practice

play00:07

in this video I'm going to review key

play00:10

aspects and considerations for building

play00:12

a large language model from scratch if

play00:15

you Googled this topic even just one

play00:17

year ago you'd probably see something

play00:19

very different than we see today

play00:21

building large language models was a

play00:23

very esoteric and specialized activity

play00:27

reserved mainly for Cutting Edge AI

play00:30

research but today if you Google how to

play00:32

build an llm from scratch or should I

play00:34

build a large language model you'll see

play00:36

a much different story with all the

play00:38

excitement surrounding large language

play00:40

models post chat GPT we now have an

play00:43

environment where a lot of businesses

play00:45

and Enterprises and other organizations

play00:48

have an interest in building these

play00:50

models perhaps one of the most notable

play00:52

examples comes from Bloomberg in

play00:54

Bloomberg GPT which is a large language

play00:57

model that was specifically built to

play01:00

handle tasks in the space of Finance

play01:03

however the way I see it building a

play01:05

large language model from scratch is

play01:07

often not necessary for the vast

play01:10

majority of llm use cases using

play01:12

something like prompt engineering or

play01:14

fine-tuning in existing model is going

play01:17

to be much better suited than building a

play01:20

large language model from scratch with

play01:22

that being said it is valuable to better

play01:24

understand what it takes to build one of

play01:26

these models from scratch and when it

play01:28

might make sense to do it before diving

play01:30

into the technical aspects of building a

play01:32

large language model let's do some back

play01:35

the napkin math to get a sense of the

play01:38

financial costs that we're talking about

play01:40

here taking as a baseline llama 2 the

play01:43

relatively recent large language model

play01:45

put out by meta these were the

play01:47

computational costs associated with the

play01:50

7 billion parameter version and 70

play01:53

billion parameter versions of the model

play01:55

so you can see for llama 27b it took

play01:58

about 180,000 th000 GPU hours to train

play02:01

that model while for 70b a model 10

play02:05

times as large it required 10 times as

play02:07

much compute so this required 1.7

play02:09

million GPU hours so if we just do what

play02:12

physicists love to do we can just take

play02:14

orders of magnitude and based on the

play02:16

Llama 2 numbers we'll say a 10 billion

play02:19

parameter model takes on the order of

play02:21

100,000 GPU hours to train while 100

play02:24

billion parameter model takes about a

play02:26

million GPU hours to train so how can we

play02:29

trans at this into a dollar amount here

play02:32

we have two options option one is we can

play02:35

rent the gpus and compute that we need

play02:38

to train our model via any of the big

play02:40

cloud providers out there a Nvidia a100

play02:44

what was used to train llama 2 is going

play02:46

to be on the order of $1 to $2 per GPU

play02:50

per hour so just doing some simple

play02:52

multiplication here that means the 10

play02:55

billion parameter model is going to be

play02:57

on the order of1 15 $50,000 just to

play03:01

train and the 100 billion parameter

play03:03

model will be on the order of $1.5

play03:06

million to train alternatively instead

play03:08

of renting the compute you can always

play03:11

buy the hardware in that case we just

play03:13

have to take into consideration the

play03:15

price of these gpus so let's say an a100

play03:18

is about $110,000 and you want to form a

play03:21

GPU cluster which is about 1,000 gpus

play03:24

the hardware costs alone are going to be

play03:25

on the order of like $10 million but

play03:28

that's not the only cost when you're

play03:29

running a cluster like this for weeks it

play03:31

consumes a tremendous amount of energy

play03:33

and so you also have to take into

play03:34

account the energy cost so let's say

play03:37

training a 100 billion parameter model

play03:40

consumes about 1,000 megawatt hours of

play03:43

energy and let's just say the price of

play03:45

energy is about $100 per megawatt hour

play03:48

then that means the marginal cost of

play03:50

training a 100 billion parameter model

play03:52

is going to be on the order of $100,000

play03:55

okay so now that you've realized you

play03:56

probably won't be training a large

play03:58

language model anytime soon or maybe you

play04:00

are I don't know let's dive into the

play04:02

technical aspects of building one of

play04:04

these models I'm going to break the

play04:06

process down into four steps one is data

play04:09

curation two is the model architecture

play04:12

three is training the model at scale and

play04:15

four is evaluating the model okay so

play04:18

starting with data curation I would

play04:19

assert that this is the most important

play04:22

and perhaps most time consuming part of

play04:25

the process and this comes from the

play04:26

basic principle of machine learning of

play04:28

garbage in garbage out put another way

play04:31

the quality of your model is driven by

play04:34

the quality of your data so it's super

play04:36

important that you get the training data

play04:39

right especially if you're going to be

play04:40

investing millions of dollars in this

play04:42

model but this presents a problem large

play04:45

language models require large training

play04:47

data sets and so just to get a sense of

play04:49

this gpt3 was trained on half a trillion

play04:52

tokens llama 2 was trained on two

play04:55

trillion tokens and the more recent

play04:57

Falcon 180b was trained on 3.5 trillion

play05:01

tokens and if you're not familiar with

play05:03

tokens you can check out the previous

play05:04

video in the series where I talk more

play05:06

about what tokens are and why they're

play05:08

important but here we can say that as

play05:10

far as training data go we're talking

play05:12

about a trillion words of text or in

play05:14

other words about a million novels or a

play05:18

billion news articles so we're talking

play05:20

about a tremendous amount of data going

play05:23

through a trillion words of text and

play05:25

ensuring data quality is a tremendous

play05:29

effort and undertaking and so a natural

play05:31

question is where do we even get all

play05:33

this text the most common place is the

play05:36

internet the internet consist of web

play05:38

pages Wikipedia forums books scientific

play05:40

articles code bases you name it post J

play05:42

GPT there's a lot more controversy

play05:44

around this and copyright laws the risk

play05:47

with web scraping yourself is that you

play05:49

might grab data that you're not supposed

play05:51

to grab or you don't have the rights to

play05:54

grab and then using it in a model for

play05:56

potentially commercial use could come

play05:58

back and cause some trouble down the

play06:00

line alternatively there are many public

play06:02

data sets out there one of the most

play06:04

popular is common crawl which is a huge

play06:07

Corpus of text from the internet and

play06:09

then there are some more refined

play06:11

versions such as colossal clean crawled

play06:14

Corpus also called C4 there's also

play06:17

Falcon refined web which was used to

play06:19

train Falcon 180b mentioned on the

play06:21

previous slide another popular data set

play06:23

is the pile which tries to bring

play06:25

together a wide variety of diverse data

play06:29

sources into the training data set which

play06:32

we'll talk a bit more about in the next

play06:33

slide and then we have hugging face

play06:35

which has really emerged as a big player

play06:38

in the generative Ai and large language

play06:40

model space who houses a ton of Open

play06:44

Access Data sources on their platform

play06:46

another place are private data sources

play06:48

so a great example of this is fin pile

play06:52

which was used to train Bloomberg GPD

play06:54

and the key upside of private data

play06:56

sources is you own the rights to it and

play06:59

and it's data that no one else has which

play07:01

can give you a strategic Advantage if

play07:04

you're trying to build a model for some

play07:06

business application or for some other

play07:08

application where there's some

play07:10

competition or environment of other

play07:13

players that are also making their own

play07:15

large language models finally and

play07:17

perhaps the most interesting is using an

play07:20

llm to generate the training data a

play07:23

notable example of this comes from the

play07:26

alpaca model put out by researchers at

play07:28

Stanford and what they did was they

play07:31

trained an llm alpaca using structured

play07:34

text generated by gpt3 this is my

play07:37

cartoon version of it you pass on the

play07:39

prompt make me training data into your

play07:41

large language model and it spits out

play07:43

the training data for you turning to the

play07:45

point of data set diversity that I

play07:47

mentioned briefly with the pile one

play07:50

aspect of a good training data set seems

play07:53

to be data set diversity and the idea

play07:56

here is that a diverse data set

play07:58

translates to to a model that can

play08:01

perform well in a wide variety of tasks

play08:04

essentially it translates into a good

play08:06

general purpose model here I've listed

play08:09

out a few different models and the

play08:11

composition of their training data sets

play08:13

so you can see gpt3 is mainly web pages

play08:16

but also some books you see gopher is

play08:18

also mainly web pages but they got more

play08:21

books and then they also have some code

play08:23

in there llama is mainly web pages but

play08:26

they also have books code and scientific

play08:28

articles and then Palm is mainly built

play08:32

on conversational data but then you see

play08:34

it's trained on web pages books and code

play08:37

how you curate your training data set is

play08:40

going to drive the types of tasks the

play08:42

large language model will be good at and

play08:44

while we're far away from an exact

play08:45

science or theory of this particular

play08:48

data set composition translates to this

play08:50

type of model or like adding an

play08:52

additional 3% code in your trading data

play08:55

set will have this quantifiable outcome

play08:58

in the downstream model while we're far

play09:00

away from that diversity does seem to be

play09:02

an important consideration when making

play09:04

your training data sets another thing

play09:06

that's important to ask ourselves is how

play09:08

do we prepare the data again the quality

play09:11

of our model is driven by the quality of

play09:14

our data so one needs to be thoughtful

play09:16

with the text that they use to generate

play09:18

a large language model and here I'm

play09:20

going to talk about four key data

play09:22

preparation steps the first is quality

play09:26

filtering this is removing text which is

play09:28

not helpful to the large language model

play09:30

this could be just a bunch of random

play09:32

gibberish from some corner of the

play09:34

internet this could be toxic language or

play09:37

hate speech found on some Forum this

play09:39

could be things that are objectively

play09:41

false like 2 + 2al 5 which you'll see in

play09:44

the book 1984 while that text exists out

play09:47

there it is not a true statement there's

play09:49

a really nice paper it's called survey

play09:51

of large language models I think and in

play09:53

that paper they distinguish two types of

play09:55

quality filtering the first is

play09:57

classifier based and this this is where

play09:59

you take a small highquality data set

play10:02

and use it to train a text

play10:04

classification model that allows you to

play10:07

automatically score text as either good

play10:10

or bad low quality or high quality so

play10:13

that precludes the need for a human to

play10:16

read a trillion words of text to assess

play10:18

its quality it can kind of be offloaded

play10:20

to this classifier the other type of

play10:22

approach they Define is heuristic based

play10:25

this is using various rules of thumb to

play10:28

filter the text text this could be

play10:30

removing specific words like explicit

play10:32

text this could be if a word repeats

play10:34

more than two times in a sentence you

play10:37

remove it or using various statistical

play10:39

properties of the text to do the

play10:41

filtering and of course you can do a

play10:43

combination of the two you can use the

play10:45

classifier based method to distill down

play10:47

your data set and then on top of that

play10:49

you can do some heuristics or vice versa

play10:51

you can use heuristics to distill down

play10:53

the data set and then apply your

play10:55

classifier there's no one- siiz fits-all

play10:57

recipe for doing quality filter in

play10:59

rather there's a menu of many different

play11:02

options and approaches that one can take

play11:05

next is D duplication this is removing

play11:08

several instances of the same or very

play11:10

similar text and the reason this is

play11:12

important is that duplicate texts can

play11:15

bias the model and disrupt training

play11:17

namely if you have some web page that

play11:20

exists on two different domains one ends

play11:22

up in the training data set one ends up

play11:24

in the testing data set this causes some

play11:26

trouble trying to get a fair assessment

play11:28

of model performance during training

play11:30

another key step is privacy redaction

play11:32

especially for text grab from the

play11:34

internet it might include sensitive or

play11:36

confidential information it's important

play11:38

to remove this text because if sensitive

play11:40

information makes its way into the

play11:42

training data set it could be

play11:44

inadvertently learned by the language

play11:46

model and be exposed in unexpected ways

play11:49

finally we have the tokenization step

play11:52

which is essentially translating text

play11:54

into numbers and the reason this is

play11:56

important is because neural networks do

play11:59

not understand text directly they

play12:01

understand numbers so anytime you feed

play12:03

something into a neural network it needs

play12:06

to come in numerical form while there

play12:08

are many ways to do this mapping one of

play12:10

the most popular ways is via the bite

play12:12

pair encoding algorithm which

play12:14

essentially takes a corpus of text and

play12:17

deres from it an efficient subword

play12:20

vocabulary it figures out the best

play12:22

choice of subwords or character

play12:25

sequences to define a vocabulary from

play12:29

which the entire Corpus can be

play12:32

represented for example maybe the word

play12:34

efficient gets mapped to a integer and

play12:37

exists in the vocabulary maybe sub with

play12:40

a dash gets mapped to its own integer

play12:42

word gets mapped to its own integer

play12:44

vocab gets mapped to its own integer and

play12:46

UL gets mapped to its own integer so

play12:49

this string of text here efficient

play12:51

subword vocabulary might be translated

play12:53

into five tokens each with their own

play12:56

numerical representation so one two

play12:59

three four five there are python

play13:01

libraries out there that implement this

play13:03

algorithm so you don't have to do it

play13:05

from scratch namely there's the sentence

play13:07

piece python Library there's also the

play13:10

tokenizer library coming from hugging

play13:12

face here the citation numbers and I

play13:14

provide the link in the description and

play13:15

comment section below moving on to step

play13:18

two model architecture so in this step

play13:20

we need to define the architecture of

play13:24

the language model and as far as large

play13:26

language models go Transformers have

play13:28

emerged merged as the state-of-the-art

play13:30

architecture and a Transformer is a

play13:32

neural network architecture that

play13:34

strictly uses attention mechanisms to

play13:37

map inputs to outputs so you might ask

play13:39

what is an attention mechanism and here

play13:41

I Define it as something that learns

play13:43

dependencies between different elements

play13:46

of a sequence based on position and

play13:49

content this is based on the intuition

play13:50

that when you're talking about language

play13:52

the context matters and so let's look at

play13:54

a couple examples so if we see the

play13:57

sentence I hit the base baseball with a

play13:59

bat the appearance of baseball implies

play14:03

that bat is probably a baseball bat and

play14:06

not a nocturnal mammal this is the

play14:09

picture that we have in our minds this

play14:11

is an example of the content of the

play14:14

context of the word bat so bat exists in

play14:17

this larger context of this sentence and

play14:20

the content is the words making up this

play14:23

context the the content of the context

play14:25

drives what word is going to come next

play14:28

and the meaning of this word here but

play14:31

content isn't enough the positioning of

play14:33

these words is also important so to see

play14:36

that consider another example I hit the

play14:39

bat with a baseball now there's a bit

play14:42

more ambiguity of what bat means it

play14:45

could still mean a baseball bat but

play14:48

people don't really hit baseball bats

play14:50

with baseballs they hit baseballs with

play14:52

baseball bats one might reasonably think

play14:54

bad here means the nocturnal mammal and

play14:57

so an attention mechanism captures both

play15:00

these aspects of language more

play15:02

specifically it will use both the

play15:04

content of the sequence and the

play15:06

positions of each element in the

play15:08

sequence to help infer what the next

play15:12

word should be well at first it might

play15:14

seem that Transformers are a constrained

play15:17

in particular architecture we actually

play15:19

have an incredible amount of freedom and

play15:22

choices we can make as developers making

play15:24

a Transformer model so at a high level

play15:27

there are actually three types of

play15:28

Transformers which follows from the two

play15:31

modules that exist in the Transformer

play15:33

architecture namely we have the encoder

play15:36

and decoder so we can have an encoder by

play15:39

itself that can be the architecture we

play15:41

can have a decoder by itself that's

play15:43

another architecture and then we can

play15:45

have the encoder and decoder working

play15:48

together and that's the third type of

play15:49

Transformer so let's take a look at

play15:51

these One By One The encoder only

play15:54

Transformer translates tokens into a

play15:57

semantically mean meaningful

play15:59

representation and these are typically

play16:01

good for Tech classification tasks or if

play16:04

you're just trying to generate a

play16:06

embedding for some text next we have the

play16:09

decoder only Transformer which is

play16:11

similar to an encoder because it

play16:13

translates text into a semantically

play16:16

meaningful internal representation but

play16:18

decoders are trying to predict the next

play16:20

word they're trying to predict future

play16:22

tokens and for this decoders do not

play16:25

allow self attention with future

play16:27

elements which makes it great for text

play16:30

generation tasks and so just to get a

play16:32

bit more intuition of the difference

play16:34

between the encoder self attention

play16:36

mechanism and the decoder self attention

play16:39

mechanism the encoder any part of the

play16:41

sequence can interact with any other

play16:44

part of the sequence if we were to zoom

play16:46

in on the weight matrices that are

play16:49

generating these internal

play16:50

representations in the encoder you'll

play16:52

see that none of the weights are zero on

play16:55

the other hand for a decoder it uses

play16:57

so-called masked self attention so any

play17:01

weights that would connect a token to a

play17:04

token in the future is going to be set

play17:06

to zero it doesn't make sense for the

play17:08

decoder to see into the future if it's

play17:10

trying to predict the future that would

play17:12

kind of be like cheating and then

play17:13

finally we can combine the encoder and

play17:15

decoder together to create another

play17:17

choice of model architecture this was

play17:19

actually the original design of the

play17:22

Transformer model kind of what's

play17:24

depicted here and so what you can do

play17:26

with the encoder decoder model that you

play17:28

can't do with the others is the

play17:30

so-called cross attention so instead of

play17:32

just being restricted to self attention

play17:34

with the encoder or mask self attention

play17:36

with the decoder the encoder decoder

play17:38

model allows for cross attention where

play17:41

the embeddings from the encoder so this

play17:44

will generate a sequence and the

play17:46

internal embeddings of the decoder which

play17:48

will be another sequence will have this

play17:50

attention weight Matrix so that the

play17:53

encoders representations can communicate

play17:55

with the decoder representations and

play17:58

this tends to be good for tasks such as

play18:00

translation which was the original

play18:02

application of this Transformers model

play18:04

while we do have three options to choose

play18:07

from when it comes to making a

play18:08

Transformer the most popular by far is

play18:11

this decoder only architecture where

play18:14

you're only using this part of the

play18:17

Transformer to do the language modeling

play18:19

and this is also called causal language

play18:21

modeling which basically means given a

play18:23

sequence of text you want to predict

play18:25

future text Beyond just this highlevel

play18:27

choice of model architecture there are

play18:30

actually a lot of other design choices

play18:32

and details that one needs to take into

play18:34

consideration first is the use of

play18:37

residual connections which are just

play18:38

Connections in your model architecture

play18:40

that allow intermediate training values

play18:42

to bypass various hidden layers and so

play18:45

to make this more concrete this is from

play18:47

reference number 18 Linked In the

play18:49

description and comment section below

play18:51

what this looks like is you have some

play18:53

input and instead of strictly feeding

play18:55

the input into your hidden layer which

play18:57

is this stack of things here you allow

play19:00

it to go to both the hidden layer and to

play19:02

bypass the hidden layer then you can

play19:04

aggregate the original input and the

play19:06

output of the Hidden layer in some way

play19:08

to generate the input for the next layer

play19:11

and of course there are many different

play19:13

ways one can do this with all the

play19:15

different details that can go into a

play19:17

hidden layer you can have the input and

play19:20

the output of the Hidden layer be added

play19:21

together and then have an activation

play19:23

applied to the addition you can have the

play19:26

input and the output of the Hidden layer

play19:28

be added and then you can do some kind

play19:30

of normalization and then you can add

play19:32

the activation or you can have the

play19:34

original input and the output of the

play19:35

Hidden layer just be added together you

play19:37

really have a tremendous amount of

play19:39

flexibility and design Choice when it

play19:42

comes to these residual Connections in

play19:44

the original Transformers architecture

play19:46

the way they did it was something

play19:47

similar to this where the input bypasses

play19:51

this multiheaded attention layer and is

play19:54

added and normalized with the output of

play19:57

this multi attention layer and then the

play19:59

same thing happens for this layer same

play20:01

thing happens for this layer same thing

play20:03

happens for this layer and same thing

play20:04

happens for this layer next is layer

play20:07

normalization which is rescaling values

play20:10

between layers based on their mean and

play20:12

standard deviation and so when it comes

play20:14

to layer normalization there are two

play20:16

considerations that we can make one is

play20:19

where you normalize so there are

play20:21

generally two options here you can

play20:23

normalize before the layer also called

play20:25

pre-layer normalization or you can

play20:27

normalize after the layer also called

play20:30

post layer normalization another

play20:32

consideration is how you normalize one

play20:34

of the most common ways is via layer

play20:36

norm and this is the equation here this

play20:39

is your input X you subtract the mean of

play20:41

the input and then you divide it by the

play20:44

variance plus some noise term then you

play20:46

multiply it by some gain factor and then

play20:48

you can have some bias term as well an

play20:51

alternative to this is the root mean

play20:53

Square Norm or RMS Norm which is very

play20:56

similar it just doesn't have the mean

play20:58

term in the numerator and then it

play21:00

replaces this denominator with just the

play21:02

RMS while you have a few different

play21:05

options on how you do layer

play21:06

normalization the most common based on

play21:08

that survey of large language models I

play21:10

mentioned earlier reference number eight

play21:12

pre-layer normalization seems to be most

play21:14

common combined with this vanilla layer

play21:17

Norm approach next we have activation

play21:20

functions and these are non-linear

play21:22

functions that we can include in the

play21:24

model which in principle allow it to

play21:27

capture comp Lex mappings between inputs

play21:30

and outputs here there are several

play21:31

common choices for large language models

play21:33

namely gelu relo swish swish Glu G Glu

play21:39

and I'm sure there are more but glus

play21:41

seem to be the most common for large

play21:43

language models another design Choice Is

play21:45

How We Do position embeddings position

play21:48

embeddings capture information about

play21:50

token positions the way that this was

play21:53

done in the original Transformers paper

play21:55

was using these sign and cosine basic

play21:58

functions which added a unique value to

play22:00

each token position to represent its

play22:04

position and you can see in the original

play22:06

Transformers architecture you had your

play22:08

tokenized input and the positional

play22:11

encodings were just added to the

play22:13

tokenized input for both the encoder

play22:15

input and the decoder input more

play22:18

recently there's this idea of relative

play22:20

positional encodings so instead of just

play22:22

adding some fixed positional encoding

play22:26

before the input is passed into the

play22:28

model the idea with relative positional

play22:30

encodings is to bake positional

play22:32

encodings into the attention mechanism

play22:35

and so I won't dive into the details of

play22:37

that here but I will provide this

play22:39

reference self attention with relative

play22:41

position representations also citation

play22:44

number 20 the last consideration that

play22:46

I'll talk about when it comes to model

play22:48

architecture is how big do I make it and

play22:51

the reason this is important is because

play22:53

if a model is too big or train too long

play22:55

it can overfit on the other hand if a

play22:58

model is too small or not trained long

play23:00

enough it can underperform and these are

play23:03

both in the context of the training data

play23:05

and so there's this relationship between

play23:08

the number of parameters the number of

play23:10

computations or training time and the

play23:12

size of the training data set there's a

play23:15

nice paper by Hoffman at all where they

play23:17

do an analysis of optimal compute

play23:20

considerations when it comes to large

play23:22

language models I've just grabbed a

play23:24

table from that paper that summarizes

play23:26

their key findings what this is saying

play23:28

is that a 400 million parameter model

play23:31

should undergo on the order of let's say

play23:34

like 2 to the 19 floating Point

play23:36

operations and have a training data

play23:38

consisting of 8 billion tokens and then

play23:41

a parameter with 1 billion models should

play23:44

have 10 times as many floating Point

play23:46

operations and be trained on 20 billion

play23:49

parameters and so on and so forth my

play23:51

kind of summarization takeaway from this

play23:54

is that you should have about 20 tokens

play23:57

per model mod parameter it's not going

play23:59

to be very precise but might be a good

play24:00

rule of thumb and then we have for every

play24:02

10x increase in model parameters there's

play24:05

about a 100x increase in floating Point

play24:08

operations so if you're curious about

play24:09

this check out the paper Linked In the

play24:11

description below even if this isn't an

play24:13

optimal approach in all cases it may be

play24:16

a good starting place and rule of thumb

play24:18

for training these models so now we come

play24:20

to step three which is training these

play24:22

models at scale so again the central

play24:25

challenge of these large language models

play24:27

is is their scale when you're training

play24:29

on trillions of tokens and you're

play24:31

talking about billions tens of billions

play24:34

hundreds of billions of parameters

play24:35

there's a lot of computational cost

play24:38

associated with these things and it is

play24:39

basically impossible to train one of

play24:42

these models without employing some

play24:45

computational tricks and techniques to

play24:47

speed up the training process here I'm

play24:49

going to talk about three popular

play24:50

training techniques the first is mixed

play24:53

Precision training which is essentially

play24:55

when you use both 32bit and 16 bit

play24:58

floating Point numbers during model

play25:00

training such that you use the 16bit

play25:03

floating Point numbers whenever possible

play25:05

and 32bit numbers only when you have to

play25:08

more on mixed Precision training in that

play25:11

survey of large language models and then

play25:12

there's also a nice documentation by

play25:15

Nvidia linked below next is this

play25:17

approach of 3D parallelism which is

play25:19

actually the combination of three

play25:21

different parallelization strategies

play25:24

which are all listed here and I'll just

play25:26

go through them one by one first is

play25:28

pipeline parallelism which is

play25:30

Distributing the Transformer layers

play25:32

across multiple gpus and it actually

play25:35

does an additional optimization where it

play25:37

puts adjacent layers on the same GPU to

play25:40

reduce the amount of cross GPU

play25:43

communication that has to take place the

play25:45

next is model parallelism which

play25:47

basically decomposes The Matrix

play25:49

multiplications that make up the model

play25:51

into smaller Matrix multiplies and then

play25:54

distributes those Matrix multiplies

play25:56

across multiple gpus and then and then

play25:57

finally there's data parallelism which

play26:00

distributes training data across

play26:02

multiple gpus but one of the challenges

play26:05

with parallelization is that

play26:07

redundancies start to emerge because

play26:09

model parameters and Optimizer States

play26:11

need to be copied across multiple gpus

play26:14

so you're having some portion of the

play26:16

gpu's precious memory devoted to storing

play26:20

information that's copied in multiple

play26:21

places this is where zero redundancy

play26:24

Optimizer or zero is helpful which

play26:26

essentially reduces data redundancy

play26:28

regarding the optimizer State the

play26:30

gradient and parameter partitioning and

play26:32

so this was just like a surface level

play26:34

survey of these three training

play26:35

techniques these techniques and many

play26:38

more are implemented by the deepe speed

play26:41

python library and of course deep speed

play26:43

isn't the only Library out there there

play26:45

are a few other ones such as colossal AI

play26:47

Alpa and some more which I talk about in

play26:49

the blog associated with this video

play26:52

another consideration when training

play26:53

these massive models is training

play26:55

stability and it turns out there are a

play26:57

few things that we can do to help ensure

play27:00

that the training process goes smoothly

play27:02

the first is checkpointing which takes a

play27:05

snapshot of model artifacts so training

play27:07

can resume from that point this is

play27:09

helpful because let's say you're

play27:11

training loss is going down it's great

play27:13

but then you just have this spike in

play27:14

loss after training for a week and it

play27:18

just blows up training and you don't

play27:19

know what happened checkpointing allows

play27:21

you to go back to when everything was

play27:23

okay and debug what could have gone

play27:26

wrong and maybe make some adjustments to

play27:27

the learning rate or other

play27:29

hyperparameters so that you can try to

play27:31

avoid that spike in the loss function

play27:33

that came up later another strategy is

play27:35

weight Decay which is essentially a

play27:36

regularization strategy that penalizes

play27:39

large parameter values I've seen two

play27:41

ways of doing this one is either by

play27:43

adding a term to the objective function

play27:45

which is like regular regularization

play27:47

regular regularization or changing the

play27:50

parameter update Rule and then finally

play27:52

we have gradient clipping which rescales

play27:55

the gradient of the objective function

play27:57

if it exceeds a pre-specified value so

play28:00

this helps avoid the exploding gradient

play28:02

problem which may blow up your training

play28:05

process and then the last thing I want

play28:06

to talk about when it comes to training

play28:07

are hyperparameters while these aren't

play28:09

specific to large language models my

play28:11

goal here is to just lay out some common

play28:13

choices when it comes to these values so

play28:15

first we have batch size which can be

play28:18

either static or dynamic and if it's

play28:20

static batch sizes are usually pretty

play28:22

big so on the order of like 16 million

play28:24

tokens but it can also be dynamic for

play28:26

example in GPT 3 what they did is they

play28:28

gradually increased the batch size from

play28:31

32,000 tokens to 3.2 million tokens next

play28:34

we have the learning rate and so this

play28:37

can also be static or dynamic but it

play28:39

seems that Dynamic learning rates are

play28:41

much more common for these models a

play28:43

common strategy seems to go as follows

play28:45

you have a learning rate that increases

play28:48

linearly until reaching some specified

play28:51

maximum value and then it'll reduce via

play28:54

a cosine Decay until the learning rate

play28:56

is about 10% % of its max value next we

play28:59

have the optimizer atom or atom based

play29:02

optimizers are most commonly used for

play29:04

large language models and then finally

play29:05

we have Dropout typical values for

play29:07

Dropout are between 0.2 and 0.5 from the

play29:11

original Dropout paper by Hinton at all

play29:14

finally step four is model evaluation so

play29:17

just cuz you've trained your model and

play29:18

you've spent millions of dollars and

play29:20

weeks of your time if not more it's

play29:22

still not over typically when you have a

play29:24

model in hand that's really just the

play29:26

starting place in many ways next you got

play29:28

to see what this thing actually does how

play29:31

it works in the context of the desired

play29:34

use case the desired application of it

play29:36

this is where model evaluation becomes

play29:38

important for this there are many

play29:40

Benchmark data sets out there here I'm

play29:42

going to restrict the discussion to the

play29:44

open llm leaderboard which is a public

play29:47

llm Benchmark that is continually

play29:50

updated with new models un hugging faces

play29:53

models platform and the four benchmarks

play29:55

that is used in the open El M

play29:57

leaderboard are Arc H swag MML and

play30:02

truthful QA while these are only four of

play30:06

many possible Benchmark data sets the

play30:08

evaluation strategies that we can use

play30:10

for these Benchmark data sets can easily

play30:13

port to other benchmarks so first I want

play30:15

to start with just Arc helis swagen MML

play30:18

U which are multiple choice tasks so a

play30:21

bit more about these Ark and MML U are

play30:24

essentially great school questions on

play30:26

subjects like math math history common

play30:29

knowledge you know whatever and it'll be

play30:30

like a question with a multiple choice

play30:33

response A B C or D so an example is

play30:35

which technology was developed most

play30:37

recently a a cell phone B a microwave c

play30:41

a refrigerator and D an airplane H swag

play30:44

is a little bit different these are

play30:46

specifically questions that computers

play30:48

tend to struggle with so an example of

play30:50

this is in the blog associated with this

play30:53

video which goes like this a woman is

play30:56

outside with a bucket ET and a dog the

play30:58

dog is running around trying to avoid a

play31:00

bath she dot dot dot a rinses the bucket

play31:03

off with soap and blow dries the dog's

play31:06

head B uses a hose to keep it from

play31:08

getting soapy C gets the dog wet then it

play31:11

runs away again D gets into a bathtub

play31:14

with a dog and so this is a very strange

play31:17

question but intuitively humans tend to

play31:19

do very well on these tasks and

play31:21

computers do not so while these are

play31:23

multiple choice tasks and we might think

play31:26

it should be pretty straight forward to

play31:27

evaluate model performance on them there

play31:30

is one hiccup namely these large

play31:32

language models are typically text

play31:34

generation models so they'll take some

play31:36

input text and they'll output more text

play31:39

they're not classifiers they don't

play31:40

generate responses like ABC or D or

play31:44

class one class 2 class 3 class 4 they

play31:47

just generate text completions and so

play31:49

you have to do a little trick to get

play31:51

these large language models to perform

play31:53

multiple choice tasks and this is

play31:56

essentially through prompt templates for

play31:58

example if you have the question which

play32:00

technology was developed most recently

play32:02

instead of just passing in this question

play32:04

and the choices to the large language

play32:06

model and hopefully it figures out to do

play32:09

a BC or D you can use a prompt template

play32:12

like this and additionally prend the

play32:15

prompt template with a few shot examples

play32:18

so the language model will pick up that

play32:20

I should return just a single token that

play32:24

is one of these four tokens here so if

play32:26

you pass this into to the model you'll

play32:27

get a distribution of probabilities for

play32:30

each possible token and what you can do

play32:32

then is just evaluate of all the tens of

play32:36

thousands of tokens that are possible

play32:39

you just pick the four tokens associated

play32:41

with a B C or D and see which one is

play32:44

most likely and you take that to be the

play32:46

predicted answer from the large language

play32:48

model while there is this like extra

play32:50

step of creating a prompt template you

play32:52

can still evaluate a large language

play32:54

model on these multiple choice tasks and

play32:57

in a relatively straightforward way

play32:58

however this is a bit more tricky when

play33:00

you have open-ended tasks such as for

play33:03

truthful QA for truthful QA or other

play33:06

open-ended tasks where there isn't a

play33:08

specific one right answer but rather a

play33:12

wide range of possible right answers

play33:14

there are a few different evaluation

play33:16

strategies we can take the first is

play33:18

human evaluation so a person scores the

play33:21

completion based on some ground truth

play33:23

some guidelines or both while this is

play33:25

the most labor int ensive this may

play33:28

provide the highest quality assessment

play33:29

of model completions another strategy is

play33:32

we could use NLP metrics so this is

play33:34

trying to quantify the completion

play33:36

quality using metrics such as perplexity

play33:39

blue score row score Etc so just using

play33:42

the statistical properties of the

play33:44

completion as a way to quantify its

play33:47

quality while this is a lot less labor

play33:49

intensive it's not always clear what the

play33:51

mapping between a completions

play33:53

statistical properties is to the quality

play33:56

of that that completion and then the

play33:58

third approach which might capture The

play34:00

Best of Both Worlds is to use an

play34:02

auxiliary fine-tuned model to rate the

play34:05

quality of the completions and this was

play34:08

actually used in the truthful QA paper

play34:11

should be reference 30 where they

play34:14

created an auxiliary model called GPT

play34:17

judge which would take model completions

play34:20

and classify it as either truthful or

play34:23

not truthful and then that would help

play34:25

reduce the burden of human evaluation

play34:28

when evaluating model outputs okay so

play34:31

what's next so you've created your large

play34:33

language model from scratch what do you

play34:35

do next often this isn't the end of the

play34:38

story as the name base models might

play34:40

suggest base models are typically a

play34:42

starting point not the final solution

play34:45

they are really just a starting place

play34:47

for you to build something more

play34:49

practical on top of and there are

play34:51

generally two directions here one is via

play34:54

prompt engineering and prompt

play34:55

engineering is just feeding things into

play34:59

the language model and harvesting their

play35:01

completions for some particular use case

play35:04

another Direction one can go is via

play35:06

model fine-tuning which is where you

play35:08

take the pre-trained model and you adapt

play35:11

it for a particular use case prompt

play35:13

engineering and model fine tuning both

play35:15

have their pros and cons to them if you

play35:17

want to learn more check out the

play35:19

previous two videos of this series where

play35:21

I do a deep dive into each of these

play35:23

approaches if you enjoyed this content

play35:25

please consider liking subscribing and

play35:28

sharing it with others if you have any

play35:29

questions or suggestions for future

play35:32

content please drop those in the comment

play35:34

section below and as always thank you so

play35:36

much for your time and thanks for

play35:43

watching