Let's build GPT: from scratch, in code, spelled out.

Andrej Karpathy
17 Jan 2023116:20

Summary

TLDR本视频介绍了如何使用Transformer架构来训练一个类似于GPT的文本生成模型。通过分析文本序列,模型能够生成连贯的文本。讲解了模型的内部机制,包括自注意力机制、多头注意力、位置编码和前馈网络等关键概念,并展示了如何使用Python和PyTorch实现这些机制。最后,通过在小型数据集上的训练,模型成功生成了类似莎士比亚风格的文本。

Takeaways

  • 🤖 介绍了Chachi PT系统,它是一种基于AI的文本交互系统,能够执行文本任务并生成内容。
  • 📈 通过比较不同提示生成的结果,展示了Chachi PT是一个概率系统,可以为同一提示提供多种答案。
  • 🌐 讨论了Transformer架构的重要性,这是支持Chachi PT和其他类似系统的核心神经网络技术。
  • 📚 提到了使用'tiny Shakespeare'数据集来训练基于字符的Transformer语言模型。
  • 🛠️ 强调了训练Transformer模型的过程,包括预训练和微调阶段,以及如何通过迭代和随机采样来训练模型。
  • 🔍 解释了如何通过编码器和解码器将文本转换为整数序列,以及如何使用这些序列进行模型训练。
  • 📊 展示了如何使用PyTorch库和张量来处理和训练数据,以及如何通过前向和后向传播来训练神经网络。
  • 🎯 讨论了模型训练中的损失函数和如何评估模型性能,以及如何通过调整参数来改善模型的训练。
  • 🔄 描述了使用自注意力机制(self-attention)来增强模型对文本序列的理解能力。
  • 📈 通过实验和结果分析,展示了模型从随机预测到逐渐学习文本模式并生成更合理文本的过程。
  • 🚀 最后,提到了如何通过GitHub上的Nano GPT项目来进一步探索和训练Transformer模型。

Q & A

  • Chachi PT是什么,它如何影响AI社区?

    -Chachi PT是一种允许用户与AI互动并给予基于文本的任务的系统。它通过生成文本序列来响应用户的提示,从而在AI社区引起了轰动。

  • 如何理解Chachi PT生成的文本是概率性的?

    -Chachi PT是一个概率性系统,对于同一个提示,它可以生成多个不同的回答。这意味着系统能够根据输入的初始文本序列,预测接下来可能出现的字符或词汇。

  • Transformer架构是如何在Chachi PT中起作用的?

    -Transformer架构是Chachi PT的核心,它负责处理序列数据并生成响应。Transformer通过自注意力机制(Self-Attention)和位置编码(Positional Encoding)来理解输入文本的上下文和结构,从而生成连贯和相关的输出文本。

  • 什么是自注意力机制(Self-Attention)?

    -自注意力机制是Transformer架构中的关键组成部分,它允许模型在处理一个序列时考虑序列中的所有位置,这使得模型能够捕捉到长距离依赖关系。通过自注意力机制,模型可以更好地理解文本中的上下文信息。

  • 为什么Transformer架构在AI领域如此重要?

    -Transformer架构因其高效的并行处理能力和对长距离依赖关系的捕捉而在AI领域变得极其重要。自从2017年提出以来,它已经被广泛应用于各种AI任务,如机器翻译、文本生成、问答系统等。

  • 如何训练一个基于Transformer的语言模型?

    -训练一个基于Transformer的语言模型需要大量的文本数据和计算资源。首先,需要对数据进行预处理,如分词、编码和位置编码。然后,通过反向传播和优化算法(如Adam)来调整模型参数,使模型能够根据给定的输入序列生成准确的输出。

  • 在训练Transformer模型时,为什么需要使用位置编码?

    -位置编码用于给模型提供序列中每个元素的位置信息。由于Transformer架构本身不具备捕捉序列顺序的能力,位置编码通过向输入的每个元素添加一个唯一的位置标记来帮助模型理解元素在序列中的位置。

  • 在训练过程中,为什么需要对模型进行预训练和微调?

    -预训练是在一个大型的数据集上进行的,目的是让模型学习到语言的通用表示。微调则是在特定任务的数据集上进行的,目的是让模型更好地适应这个任务。通过这两个阶段,模型能够从通用知识中提取出对特定任务有用的信息。

  • 如何评估Chachi PT或类似模型的性能?

    -模型的性能通常通过损失函数来评估,如交叉熵损失。此外,还可以通过人工评估生成的文本的质量,包括其连贯性、相关性和准确性。在实际应用中,还需要考虑模型的响应时间和计算效率。

  • 在训练Transformer模型时,如何避免过拟合?

    -为了避免过拟合,可以采用多种技术,如数据增强、正则化、dropout和早停(early stopping)。这些方法可以帮助模型在保持良好性能的同时,避免对训练数据过度敏感。

Outlines

00:00

🤖 介绍Chachi PT和AI交互

本段介绍了Chachi PT系统,这是一个允许与AI交互并给予文本任务的系统。通过示例展示了AI如何根据给定的文本提示生成响应,例如写关于AI重要性的俳句。同时,提到了AI是概率性系统,对于同一个提示可以有多种回答,并且介绍了如何通过Transformer架构来实现这种语言模型。

05:03

📝 训练Transformer基础语言模型

这一部分讨论了如何训练一个基于Transformer的语言模型,特别是字符级别的模型。提到了使用Shakespeare作品作为数据集,并解释了如何将文本字符编码为整数序列。还介绍了如何使用简单的Tokenizer将文本转换为模型可以理解的格式,并展示了如何通过神经网络进行训练。

10:03

🌐 在训练数据中应用Transformer

这一段深入探讨了如何将文本序列或整数序列输入到Transformer中进行训练。详细解释了如何通过随机采样训练集的小片段(chunks)来训练Transformer,并展示了如何构建输入和目标数据集,以及如何通过迭代每个块大小的序列来训练模型。

15:05

🔢 处理批次维度和批量数据

本段讨论了在训练过程中如何处理批次维度和批量数据。解释了如何通过随机偏移量来抓取训练集中的块,并将这些块堆叠成单个张量以进行并行处理。同时,强调了这种处理方式的效率,因为它允许GPU同时处理多个批次数据。

20:06

📈 训练过程中的损失函数和生成函数

这一部分介绍了在训练过程中如何使用损失函数来评估模型的性能,以及如何通过生成函数从模型中生成文本。详细说明了如何计算交叉熵损失,并展示了如何通过softmax函数和负对数似然损失来优化模型。此外,还讨论了如何通过迭代每个块大小的序列来生成文本。

25:07

🔄 实现简单的bigram语言模型

本段展示了如何实现一个简单的bigram语言模型,并通过PyTorch模块直接实现。讨论了如何创建一个嵌入表,并通过嵌入表来预测序列中的下一个字符。同时,介绍了如何通过交叉熵损失来评估模型的质量,并展示了如何通过生成函数来从模型中生成文本。

30:08

🚀 训练和优化模型

这一段讨论了如何使用Adam优化器来训练模型,并通过迭代训练循环来优化模型参数。介绍了如何使用估计损失函数来更准确地衡量训练和验证损失,并展示了如何通过设置模型为评估阶段来提高内存效率。同时,还提到了如何通过简单地增加迭代次数来提高模型性能。

35:09

🎯 理解自注意力机制

本段深入探讨了自注意力机制的数学原理和实现方式。通过一个玩具示例解释了如何使用矩阵乘法来实现有效的加权聚合,并展示了如何通过权重矩阵来控制不同token之间的交互。同时,介绍了如何使用softmax函数来规范化权重,以及如何通过三角形掩码来防止未来信息的流入。

40:12

🛠️ 实现单个自注意力头

这一部分详细介绍了如何实现单个自注意力头,包括如何初始化线性模块来生成查询(Q)、键(K)和值(V)向量。讨论了如何通过点积来计算Q和K之间的相似度,并展示了如何使用softmax函数来获取注意力权重。同时,还介绍了如何通过这些权重来聚合V向量,以及如何通过线性层将聚合结果转换为最终的输出。

45:12

🔄 多头注意力和前馈网络

本段讨论了如何通过并行应用多个自注意力头来实现多头注意力机制,并通过连接它们的输出来增强模型的表达能力。同时,介绍了前馈网络的结构和作用,以及如何将自注意力和前馈网络结合在一起,形成Transformer的基本构建块。此外,还提到了如何通过调整模型的超参数来优化性能。

50:13

📉 优化深度神经网络

这一部分探讨了如何使用残差连接和层归一化来优化深度神经网络。解释了残差连接如何帮助梯度直接从输出传播到输入,从而缓解优化问题。同时,介绍了层归一化如何对每一层的输出进行归一化,以保持模型训练的稳定性。还提到了在模型中实现这些技术的具体方法。

55:14

🔧 调整和扩展Transformer模型

本段讨论了如何通过调整模型的超参数和添加正则化技术(如Dropout)来扩展Transformer模型。介绍了如何通过增加层数、调整嵌入维度和头数来增强模型,并讨论了Dropout如何防止过拟合。同时,展示了如何通过训练更深层次的网络来进一步降低验证损失。

00:15

🎭 从模型中生成文本

这一部分展示了如何使用训练好的Transformer模型来生成文本。讨论了如何通过模型生成大量字符,并将其写入文件。同时,展示了生成文本的示例,虽然这些文本在莎士比亚的风格上是无意义的,但它们在形式上模仿了输入文本文件的结构。

05:18

📚 总结和未来方向

本段总结了如何训练一个解码器型的Transformer模型,并将其与GPT-3进行了比较。讨论了预训练和微调阶段,并指出了如何通过额外的微调步骤将模型从文档完成器转变为问答系统。同时,提到了未来可能的研究方向,包括使用奖励模型和策略梯度优化来进一步优化模型。

Mindmap

Keywords

💡Chachi PT

Chachi PT是一种人工智能系统,它允许用户通过文本任务与AI进行交互。在视频中,Chachi PT被用来生成关于人们理解AI重要性的短诗,展示了AI在生成文本和理解语言模式方面的能力。

💡Transformer架构

Transformer架构是一种深度学习模型,它在自然语言处理领域具有革命性的影响。它通过自注意力机制(Self-Attention)来捕捉输入数据之间的依赖关系,从而在机器翻译等任务中取得了显著的成果。

💡自注意力机制

自注意力机制是一种让神经网络在处理序列数据时能够考虑到序列中不同位置的信息的技术。这种机制使得模型能够捕捉到序列内部的复杂依赖关系,从而提高对语言等序列数据的理解能力。

💡位置编码

位置编码是一种添加到序列数据中的额外信息,用于帮助神经网络理解序列中各个元素的位置。这对于模型正确理解和生成文本至关重要,因为它提供了序列中单词或字符的顺序信息。

💡语言模型

语言模型是一种用于预测和生成自然语言文本的机器学习模型。它通过学习语言的统计规律来理解和生成句子,是自然语言处理领域的基础技术之一。

💡编码器和解码器

在自然语言处理中,编码器和解码器通常是模型的两个主要部分,编码器负责将输入数据转换为内部表示,而解码器则将这些表示转换回输出数据。在Transformer架构中,编码器和解码器通过自注意力机制相互作用,以理解和生成文本。

💡预训练和微调

预训练是指在大规模数据集上训练模型以学习通用知识的过程,而微调则是在特定任务上进一步优化模型性能的过程。预训练使模型掌握语言的基本规律,微调则使模型适应特定的应用场景。

💡多头注意力

多头注意力是一种允许模型在处理信息时考虑多个不同表示子空间的技术。通过并行地学习序列中不同部分的关系,多头注意力可以提高模型对复杂数据的理解能力。

💡层归一化

层归一化是一种在深度神经网络中常用的技术,它通过对每一层的激活值进行归一化处理,来稳定训练过程并加速收敛。这有助于防止梯度消失或爆炸问题,从而提高模型的训练效率。

💡Dropout

Dropout是一种正则化技术,通过在训练过程中随机丢弃(即暂时移除)网络中的一些节点,来防止模型过拟合。这样可以迫使网络学习更加鲁棒的特征,因为它不能依赖于任何一个节点的激活。

Highlights

介绍了Chachi PT系统,它允许与AI交互并完成基于文本的任务。

通过AI生成的俳句来强调人们理解AI的重要性。

展示了ChatGPT的输出,证明了AI是增长的力量,而无知则阻碍我们的进步。

解释了ChatGPT是一个概率系统,对于同一个提示可以给出多种答案。

讨论了Transformer架构,这是ChatGPT背后的核心机制。

提到了2017年的开创性论文《Attention is all you need》,它提出了Transformer架构。

阐述了Transformer神经网络如何通过字符级语言模型来预测文本序列。

介绍了如何使用Python和一些基本的微积分和统计学知识来理解ChatGPT的内部工作原理。

讨论了如何使用GitHub上的Nano GPT存储库来训练Transformer模型。

展示了如何通过字符级Tokenizer将文本转换为整数序列。

解释了如何将整个训练集编码为整数张量,以便输入到Transformer模型中。

讨论了如何将数据集分为训练和验证集,以评估模型的过拟合程度。

介绍了如何使用Transformer模型进行文本序列的训练,包括设置批量大小和块大小。

展示了如何通过简单的bigram语言模型来生成文本,并讨论了其局限性。

讨论了如何使用自注意力机制来改进模型,使模型能够更好地理解和预测文本序列。

介绍了多头注意力的概念,以及如何通过并行处理多个注意力头来提高模型性能。

讨论了位置编码的重要性,以及如何通过位置嵌入来给模型提供关于单词在句子中位置的信息。

提到了如何使用Layer Norm和残差连接来优化深度神经网络的训练。

展示了如何通过训练一个更大的Transformer模型来改进语言生成任务的性能。

Transcripts

play00:00

hi everyone

play00:01

so by now you have probably heard of

play00:03

Chachi PT it has taken the world and the

play00:05

AI Community by storm and it is a system

play00:07

that allows you to interact with an AI

play00:10

and give it text-based tasks so for

play00:13

example we can ask chatgpt to write us a

play00:15

small haiku about how important it is

play00:17

that people understand Ai and then they

play00:18

can use it to improve the world and make

play00:20

it more prosperous so when we run this

play00:23

AI knowledge brings prosperity for all

play00:25

to see Embrace its power okay not bad

play00:29

and so you could see that Chachi PT went

play00:30

from left to right and generated all

play00:33

these words seek sort of sequentially

play00:35

now I asked it already the exact same

play00:38

prompt a little bit earlier and it

play00:40

generated a slightly different outcome

play00:41

AI is power to grow ignorance holds us

play00:44

back learn Prosperity weights

play00:47

so uh pretty good in both cases and

play00:49

slightly different so you can see that

play00:50

chatgpt is a probabilistic system and

play00:53

for any one prompt it can give us

play00:54

multiple answers sort of replying to it

play00:58

now this is just one example of a prompt

play01:00

people have come up with many many

play01:01

examples and there are entire websites

play01:03

that index interactions with charge EBT

play01:06

and so many of them are quite humorous

play01:09

explain HTML to me like I'm a dog write

play01:12

release notes for chess 2. write a note

play01:15

about Elon Musk buying on Twitter

play01:17

and so on

play01:19

so as an example please write a breaking

play01:21

news article about a leaf falling from a

play01:23

tree

play01:24

uh and a shocking turn of events a leaf

play01:26

has fallen from a treat in the local

play01:28

park Witnesses report that the leaf

play01:30

which was previously attached to a

play01:31

branch of a tree detached itself and

play01:33

fell to the ground very dramatic so you

play01:36

can see that this is a pretty remarkable

play01:37

system and it is what we call a language

play01:40

model because it it models the sequence

play01:44

of words or characters or tokens more

play01:47

generally and it knows how sort of words

play01:49

follow each other in English language

play01:51

and so from its perspective what it is

play01:54

doing is it is completing the sequence

play01:56

so I give it the start of a sequence and

play01:59

it completes the sequence with the

play02:01

outcome and so it's a language model in

play02:03

that sense

play02:04

now I would like to focus on the under

play02:06

the hood of

play02:08

um under the hood components of what

play02:10

makes chat GPT work so what is the

play02:12

neural network under the hood that

play02:13

models the sequence of these words

play02:16

and that comes from this paper called

play02:18

attention is all you need in 2017 a

play02:22

landmark paper a landmark paper and AI

play02:24

that produced and proposed the

play02:26

Transformer architecture

play02:28

so GPT is short for generally

play02:31

generatively pre-trained Transformer so

play02:34

Transformer is the neural nut that

play02:36

actually does all the heavy lifting

play02:37

under the hood it comes from this paper

play02:39

in 2017. now if you read this paper this

play02:42

reads like a pretty random machine

play02:44

translation paper and that's because I

play02:46

think the authors didn't fully

play02:47

anticipate the impact that the

play02:49

Transformer would have on the field and

play02:51

this architecture that they produced in

play02:53

the context of machine translation in

play02:55

their case actually ended up taking over

play02:57

the rest of AI in the next five years

play02:59

after and so this architecture with

play03:02

minor changes was copy pasted into a

play03:05

huge amount of applications in AI in

play03:07

more recent years and that includes at

play03:10

the core of chat GPT

play03:12

now we are not going to what I'd like to

play03:15

do now is I'd like to build out

play03:16

something like chatgpt but we're not

play03:19

going to be able to of course reproduce

play03:20

chatgpt this is a very serious

play03:22

production grade system it is trained on

play03:25

a good chunk of internet and then

play03:28

there's a lot of pre-training and

play03:30

fine-tuning stages to it and so it's

play03:32

very complicated what I'd like to focus

play03:34

on is just to train a Transformer based

play03:37

language model and in our case it's

play03:39

going to be a character level

play03:41

a language model I still think that is a

play03:43

very educational with respect to how

play03:45

these systems work so I don't want to

play03:47

train on the chunk of Internet we need a

play03:49

smaller data set in this case I propose

play03:51

that we work with my favorite toy data

play03:54

set it's called tiny Shakespeare and

play03:56

what it is is basically it's a

play03:58

concatenation of all of the works of

play03:59

Shakespeare in my understanding and so

play04:02

this is all of Shakespeare in a single

play04:04

file this file is about one megabyte

play04:07

and it's just all of Shakespeare

play04:09

and what we are going to do now is we're

play04:11

going to basically model how these

play04:13

characters follow each other so for

play04:15

example given a chunk of these

play04:17

characters like this

play04:18

are given some context of characters in

play04:21

the past the Transformer neural network

play04:23

will look at the characters that I've

play04:25

highlighted and is going to predict that

play04:27

g is likely to come next in the sequence

play04:29

and it's going to do that because we're

play04:31

going to train that Transformer on

play04:33

Shakespeare and it's just going to try

play04:35

to produce uh character sequences that

play04:37

look like this

play04:39

and in that process is going to model

play04:40

all the patterns inside this data so

play04:43

once we've trained the system I'd just

play04:45

like to give you a preview we can

play04:47

generate infinite Shakespeare and of

play04:49

course it's a fake thing that looks kind

play04:52

of like Shakespeare

play04:54

um

play04:56

apologies for there's some junk that I'm

play04:59

not able to resolve in in here but

play05:02

um

play05:03

you can see how this is going character

play05:05

by character and it's kind of like

play05:07

predicting Shakespeare like language so

play05:10

verily my Lord the sights have left the

play05:13

again the king coming with my curses

play05:16

with precious pale and then tronio says

play05:19

something else Etc and this is just

play05:21

coming out of the Transformer in a very

play05:23

similar manner as it would come out in

play05:25

Chachi PT in our case character by

play05:27

character in Chachi PT it's coming out

play05:31

on the token by token level and tokens

play05:33

are these a sort of like little sub word

play05:35

pieces so they're not Word level they're

play05:37

kind of like work chunk level

play05:40

um and now the I've already written this

play05:43

entire code to train these Transformers

play05:46

um and it is in a GitHub repository that

play05:49

you can find and it's called a nano GPT

play05:52

so Nano GPT is a repository that you can

play05:54

find on my GitHub and it's a repository

play05:57

for training Transformers

play05:59

um On Any Given text

play06:01

and what I think is interesting about it

play06:02

because there's many ways to train

play06:03

Transformers but this is a very simple

play06:05

implementation so it's just two files of

play06:08

300 lines of code each one file defines

play06:11

the GPT model the Transformer and one

play06:13

file trains it on some given Text data

play06:15

set and here I'm showing that if you

play06:17

train it on a open webtext data set

play06:19

which is a fairly large data set of web

play06:21

pages then I reproduce the the

play06:24

performance of gpt2

play06:26

so gpt2 is an early version of openai's

play06:29

GPT from 2017 if I occur correctly and

play06:33

I've only so far reproduced the the

play06:35

smallest 124 million parameter model but

play06:38

basically this is just proving that the

play06:39

code base is correctly arranged and I'm

play06:41

able to load the neural network weights

play06:44

that open AI has released later

play06:46

so you can take a look at the finished

play06:49

code here in Nano GPT but what I would

play06:51

like to do in this lecture is I would

play06:53

like to basically write this repository

play06:56

from scratch so we're going to begin

play06:57

with an empty file and we're going to

play07:00

define a Transformer piece by piece

play07:03

we're going to train it on the tiny

play07:05

Shakespeare data set and we'll see how

play07:07

we can then generate infinite

play07:09

Shakespeare and of course this can copy

play07:11

paste to any arbitrary Text data set

play07:13

that you like but my goal really here is

play07:16

to just make you understand and

play07:17

appreciate how under the hood chat GPT

play07:20

works and really all that's required is

play07:23

a Proficiency in Python and some basic

play07:27

understanding of calculus and statistics

play07:29

and it would help if you also see my

play07:32

previous videos on the same YouTube

play07:34

channel in particular my make more

play07:36

series where I

play07:38

Define smaller and simpler neural

play07:41

network language models so multilevel

play07:43

perceptrons and so on it really

play07:45

introduces the language modeling

play07:47

framework and then here in this video

play07:49

we're going to focus on the Transformer

play07:50

neural network itself

play07:52

okay so I created a new Google collab uh

play07:55

jupyter notebook here and this will

play07:57

allow me to later easily share this code

play07:59

that we're going to develop together

play08:00

with you so you can follow along so this

play08:03

will be in the video description later

play08:05

now here I've just done some

play08:07

preliminaries I downloaded the data set

play08:09

the tiny Shakespeare data set at this

play08:11

URL and you can see that it's about a

play08:12

one megabyte file

play08:14

then here I open the input.txt file and

play08:17

just read in all the text as a string

play08:19

and we see that we are working with 1

play08:21

million characters roughly

play08:23

and the first 1000 characters if we just

play08:25

print them out are basically what you

play08:27

would expect this is the first 1000

play08:28

characters of the tiny Shakespeare data

play08:31

set roughly up to here

play08:33

so so far so good next we're going to

play08:36

take this text and the text is a

play08:38

sequence of characters in Python so when

play08:40

I call the set Constructor on it I'm

play08:43

just going to get the set of all the

play08:45

characters that occur in this text

play08:48

and then I call list on that to create a

play08:51

list of those characters instead of just

play08:52

a set so that I have an ordering an

play08:54

arbitrary ordering

play08:56

and then I sort that

play08:58

so basically we get just all the

play08:59

characters that occur in the entire data

play09:01

set and they're sorted now the number of

play09:03

them is going to be our vocabulary size

play09:05

these are the possible elements of our

play09:07

sequences and we see that when I print

play09:10

here the characters

play09:12

there's 65 of them in total there's a

play09:14

space character and then all kinds of

play09:16

special characters

play09:18

and then capitals and lowercase letters

play09:21

so that's our vocabulary and that's the

play09:23

sort of like possible characters that

play09:25

the model can see or emit

play09:28

okay so next we would like to develop

play09:30

some strategy to tokenize the input text

play09:33

now when people say tokenize they mean

play09:36

convert the raw text as a string to some

play09:39

sequence of integers According to some

play09:41

notebook According to some vocabulary of

play09:43

possible elements

play09:45

so as an example here we are going to be

play09:47

building a character level language

play09:49

model so we're simply going to be

play09:50

translating individual characters into

play09:52

integers

play09:53

so let me show you a chunk of code that

play09:55

sort of does that for us

play09:57

so we're building both the encoder and

play09:59

the decoder and let me just talk through

play10:01

What's Happening Here

play10:03

when we encode an arbitrary text like hi

play10:05

there we're going to receive a list of

play10:08

integers that represents that string so

play10:11

for example 46 47 Etc

play10:14

and then we also have the reverse

play10:16

mapping so we can take this list and

play10:19

decode it to get back the exact same

play10:21

string

play10:22

so it's really just like a translation

play10:23

two integers and back for arbitrary

play10:26

string and for us it is done on a

play10:28

character level

play10:30

now the way this was achieved is we just

play10:32

iterate over all the characters here and

play10:34

create a lookup table from the character

play10:35

to the integer and vice versa and then

play10:38

to encode some string we simply

play10:40

translate all the characters

play10:41

individually and to decode it back we

play10:44

use the reverse mapping and concatenate

play10:46

all of it

play10:47

now this is only one of many possible

play10:49

encodings or many possible sort of

play10:51

tokenizers and it's a very simple one

play10:53

but there's many other schemas that

play10:55

people have come up with in practice so

play10:57

for example Google uses a sentence piece

play11:00

uh so sentence piece will also encode

play11:02

text into integers but in a different

play11:05

schema and using a different vocabulary

play11:08

and sentence piece is a sub word sort of

play11:12

tokenizer and what that means is that

play11:14

you're not encoding entire words but

play11:17

you're not also encoding individual

play11:18

characters it's it's a sub word unit

play11:21

level and that's usually what's adopted

play11:23

in practice for example also openai has

play11:25

this Library called tick token that uses

play11:28

a pipe pair encoding tokenizer

play11:31

um and that's what GPT uses

play11:33

and you can also just encode words into

play11:35

like hello world into a list of integers

play11:38

so as an example I'm using the tick

play11:40

token Library here

play11:42

I'm getting the encoding for gpt2 or

play11:44

that was used for gpt2

play11:46

instead of just having 65 possible

play11:48

characters or tokens they have 50 000

play11:51

tokens

play11:52

and so when they encode the exact same

play11:54

string High there we only get a list of

play11:57

three integers but those integers are

play11:59

not between 0 and 64. they are between 0

play12:02

and 5000 50 256.

play12:06

so basically you can trade off the code

play12:09

book size and the sequence lengths so

play12:12

you can have a very long sequences of

play12:14

integers with very small vocabularies or

play12:16

you can have a short

play12:18

um

play12:19

sequences of integers with very large

play12:21

vocabularies and so typically people use

play12:25

in practice the sub word encodings but

play12:28

I'd like to keep our tokenizer very

play12:30

simple so we're using character level

play12:31

tokenizer

play12:32

and that means that we have very small

play12:34

code books we have very simple encode

play12:36

and decode functions but we do get very

play12:40

long sequences as a result but that's

play12:42

the level at which we're going to stick

play12:43

with this lecture because it's the

play12:44

simplest thing okay so now that we have

play12:46

an encoder and a decoder effectively a

play12:49

tokenizer we can tokenize the entire

play12:51

training set of Shakespeare so here's a

play12:54

chunk of code that does that

play12:55

and I'm going to start to use the

play12:56

pytorch library and specifically the

play12:58

torch.tensor from the pytorch library

play13:01

so we're going to take all of the text

play13:02

in tiny Shakespeare encode it and then

play13:05

wrap it into a torch.tensor to get the

play13:08

data tensor so here's what the data

play13:10

tensor looks like when I look at just

play13:11

the first 1000 characters or the 1000

play13:14

elements of it

play13:15

so we see that we have a massive

play13:16

sequence of integers and this sequence

play13:19

of integers here is basically an

play13:21

identical translation of the first 1000

play13:23

characters here

play13:25

so I believe for example that zero is a

play13:27

new line character and maybe one is a

play13:29

space not 100 sure but from now on the

play13:33

entire data set of text is

play13:34

re-represented as just it just stretched

play13:36

out as a single very large uh sequence

play13:38

of integers

play13:40

let me do one more thing before we move

play13:41

on here I'd like to separate out our

play13:43

data set into a train and a validation

play13:46

split so in particular we're going to

play13:48

take the first 90 of the data set and

play13:51

consider that to be the training data

play13:53

for the Transformer and we're going to

play13:54

withhold the last 10 percent at the end

play13:56

of it to be the validation data and this

play13:59

will help us understand to what extent

play14:01

our model is overfitting so we're going

play14:03

to basically hide and keep the

play14:04

validation data on the side because we

play14:06

don't want just a perfect memorization

play14:08

of this exact Shakespeare we want a

play14:11

neural network that sort of creates

play14:12

Shakespeare like text and so it should

play14:15

be fairly likely for it to produce

play14:17

the actual like stowed away uh true

play14:22

Shakespeare text

play14:24

um and so we're going to use this to get

play14:26

a sense of the overfitting okay so now

play14:28

we would like to start plugging these

play14:29

text sequences or integer sequences into

play14:32

the Transformer so that it can train and

play14:34

learn those patterns

play14:35

now the important thing to realize is

play14:38

we're never going to actually feed the

play14:39

entire text into Transformer all at once

play14:41

that would be computationally very

play14:43

expensive and prohibitive so when we

play14:45

actually train a Transformer on a lot of

play14:47

these data sets we only work with chunks

play14:49

of the data set and when we train the

play14:51

Transformer we basically sample random

play14:53

little chunks out of the training set

play14:54

and train them just chunks at a time and

play14:57

these chunks have basically some kind of

play14:59

a length

play15:01

and as a maximum length now the maximum

play15:04

length typically at least in the code I

play15:06

usually write is called block size

play15:08

you can you can find it on the different

play15:10

names like context length or something

play15:12

like that let's start with the block

play15:14

size of just eight and let me look at

play15:16

the first train data characters the

play15:18

first block size plus one characters

play15:20

I'll explain why plus one in a second

play15:23

so this is the first nine characters in

play15:26

the sequence in the training set

play15:29

now what I'd like to point out is that

play15:30

when you sample a chunk of data like

play15:32

this so say that these nine characters

play15:34

out of the training set

play15:36

this actually has multiple examples

play15:38

packed into it

play15:39

and that's because all of these

play15:41

characters follow each other

play15:43

and so what this thing is going to say

play15:46

when we plug it into a Transformer is

play15:49

we're going to actually simultaneously

play15:50

train it to make prediction at every one

play15:52

of these positions

play15:54

now in the in a chunk of nine characters

play15:57

there's actually eight individual

play15:59

examples packed in there

play16:01

so there's the example that one 18 when

play16:04

in the context of 18 47 likely comes

play16:07

next in the context of 18 and 47 56

play16:10

comes next in the context of 1847-56 57

play16:14

can come next and so on so that's the

play16:18

eight individual examples let me

play16:20

actually spell it out with code

play16:22

so here's a chunk of code to illustrate

play16:25

X are the inputs to the Transformer it

play16:27

will just be the first block size

play16:28

characters

play16:30

y will be the next block size characters

play16:33

so it's offset by one

play16:36

and that's because y are the targets for

play16:38

each position in the input

play16:41

and then here I'm iterating over all the

play16:43

block size of 8. and the context is

play16:46

always all the characters in X up to T

play16:49

and including t

play16:50

and the target is always the teeth

play16:52

character but in the targets array why

play16:56

so let me just run this

play16:58

and basically it spells out what I've

play16:59

said in words these are the eight

play17:02

examples hidden in a chunk of nine

play17:04

characters that we uh sampled from the

play17:08

training set

play17:09

I want to mention one more thing we

play17:12

train on all the eight examples here

play17:14

with context between one all the way up

play17:17

to context of block size

play17:18

and we train on that not just for

play17:20

computational reasons because we happen

play17:21

to have the sequence already or

play17:23

something like that it's not just done

play17:24

for efficiency it's also done to make

play17:28

the Transformer Network be used to

play17:30

seeing contexts all the way from as

play17:32

little as one all the way to block size

play17:35

and we'd like the transform to be used

play17:37

to seeing everything in between and

play17:39

that's going to be useful later during

play17:41

inference because while we're sampling

play17:43

we can start the sampling generation

play17:45

with as little as one character of

play17:46

context and the Transformer knows how to

play17:48

predict the next character with all the

play17:50

way up to just one context of one and so

play17:53

then it can predict everything up to

play17:54

block size and after block size we have

play17:56

to start truncating because the

play17:58

Transformer will never receive more than

play18:01

block size inputs when it's predicting

play18:03

the next character

play18:04

Okay so we've looked at the time

play18:06

dimension of the tensors that are going

play18:08

to be feeding into the Transformer

play18:09

there's one more Dimension to care about

play18:10

and that is the batch dimension and so

play18:13

as we're sampling these chunks of text

play18:15

we're going to be actually every time

play18:17

we're going to feed them into a

play18:18

Transformer we're going to have many

play18:20

batches of multiple chunks of text that

play18:22

are all like stacked up in a single

play18:23

tensor and that's just done for

play18:25

efficiency just so that we can keep the

play18:27

gpus busy because they are very good at

play18:29

parallel processing of

play18:31

um of data and so we just want to

play18:34

process multiple chunks all at the same

play18:36

time but those chunks are processed

play18:38

completely independently they don't talk

play18:39

to each other and so on so let me

play18:42

basically just generalize this and

play18:43

introduce a batch Dimension here's a

play18:45

chunk of code

play18:47

let me just run it and then I'm going to

play18:48

explain what it does

play18:51

so here because we're going to start

play18:53

sampling random locations in the data

play18:55

set to pull chunks from I am setting the

play18:57

seed so that

play18:59

um in the random number generator so

play19:01

that the numbers I see here are going to

play19:02

be the same numbers you see later if you

play19:04

try to reproduce this

play19:06

now the back size here is how many

play19:07

independent sequences we are processing

play19:09

every forward backward pass of the

play19:11

Transformer

play19:13

the block size as I explained is the

play19:15

maximum context length to make those

play19:17

predictions

play19:18

so let's say by size 4 block size 8 and

play19:21

then here's how we get batch

play19:22

for any arbitrary split if the split is

play19:25

a training split then we're going to

play19:26

look at train data otherwise and

play19:28

validata

play19:29

that gets us the data array and then

play19:33

when I Generate random positions to grab

play19:35

a chunk out of

play19:37

I actually grab I actually generate

play19:39

batch size number of

play19:41

random offsets

play19:43

so because this is four we are IX is

play19:46

going to be a four numbers that are

play19:48

randomly generated between 0 and Len of

play19:50

data minus block size so it's just

play19:52

random offsets into the training set

play19:55

and then X's as I explained are the

play19:58

first block size characters starting at

play20:01

I

play20:02

the Y's are the offset by one of that so

play20:06

just add plus one

play20:08

and then we're going to get those chunks

play20:10

for every one of integers I in IX and

play20:13

use a torch.stack to take all those

play20:17

one-dimensional tensors as we saw here

play20:20

and we're going to

play20:21

um stack them up at rows

play20:24

and so they all become a row in a four

play20:27

by eight tensor

play20:29

so here's where I'm printing then

play20:31

when I sample a batch XP and YB

play20:34

the input the Transformer now are

play20:37

the input X is the four by eight tensor

play20:41

four uh rows of eight columns

play20:44

and each one of these is a chunk of the

play20:47

training set

play20:49

and then the targets here are in the

play20:52

associated array Y and they will come in

play20:54

through the Transformer all the way at

play20:55

the end to create the loss function so

play20:59

they will give us the correct answer for

play21:01

every single position inside X

play21:04

and then these are the four independent

play21:07

rows

play21:08

so spelled out as we did before

play21:12

this four by eight array contains a

play21:15

total of 32 examples and they're

play21:17

completely independent as far as the

play21:19

Transformer is concerned

play21:21

uh so when the

play21:24

input is 24 the target is 43 or rather

play21:27

43 here in the Y array when the input is

play21:30

2443 the target is 58.

play21:32

when the input is 24 43.58 the target is

play21:35

5 Etc or like when it is a 5258 one the

play21:39

target is 58.

play21:41

right so you can sort of see this

play21:43

spelled out these are the 32 independent

play21:46

examples packed in to a single batch of

play21:48

the input X and then the desired targets

play21:51

are in y

play21:53

and so now this integer tensor of X is

play21:59

going to feed into the Transformer

play22:01

and that Transformer is going to

play22:03

simultaneously process all these

play22:04

examples and then look up the correct

play22:07

um integers to predict in every one of

play22:09

these positions in the tensor y okay so

play22:12

now that we have our batch of input that

play22:14

we'd like to feed into a Transformer

play22:15

let's start basically feeding this into

play22:17

neural networks now we're going to start

play22:19

off with the simplest possible neural

play22:21

network which in the case of language

play22:22

modeling in my opinion is the bigram

play22:24

language model and we've covered the

play22:26

background language model in my make

play22:27

more series in a lot of depth and so

play22:30

here I'm going to sort of go faster and

play22:32

let's just implement the pytorch module

play22:34

directly that implements the bigram

play22:36

language model

play22:37

so I'm importing the pytorch and then

play22:40

module

play22:41

uh for reproducibility

play22:44

and then here I'm constructing a diagram

play22:45

language model which is a subclass of NN

play22:47

module

play22:49

and then I'm calling it and I'm passing

play22:51

in the inputs and the targets

play22:53

and I'm just printing now when the

play22:55

inputs and targets come here you see

play22:57

that I'm just taking the index the

play22:59

inputs X here which I rename to idx and

play23:03

I'm just passing them into this token

play23:04

embedding table

play23:06

so what's going on here is that here in

play23:08

the Constructor

play23:09

we are creating a token embedding table

play23:11

and it is of size vocab size by vocab

play23:14

size

play23:16

and we're using nn.embedding which is a

play23:18

very thin wrapper around basically a

play23:20

tensor of shape both capsized by vocab

play23:23

size

play23:24

and what's happening here is that when

play23:25

we pass idx here every single integer in

play23:28

our input is going to refer to this

play23:30

embedding table and is going to pluck

play23:32

out a row of that embedding table

play23:34

corresponding to its index so 24 here

play23:38

we'll go to the embedding table and

play23:39

we'll pluck out the 24th row and then 43

play23:42

will go here and block out the 43rd row

play23:45

Etc and then Pi torch is going to

play23:47

arrange all of this into a batch by Time

play23:50

by Channel tensor in this case batch is

play23:53

4 time is 8 and C which is the channels

play23:58

is vocab size or 65. and so we're just

play24:01

going to pluck out all those rows

play24:02

arrange them in a b by T by C and now

play24:06

we're going to interpret this as the

play24:07

logits which are basically the scores

play24:09

for the next character in the sequence

play24:12

and so what's happening here is we are

play24:14

predicting what comes next based on just

play24:17

the individual identity of a single

play24:19

token and you can do that because

play24:22

um I mean currently the tokens are not

play24:23

talking to each other and they're not

play24:25

seeing any context except for they're

play24:26

just seeing themselves so I'm a I'm a

play24:29

token number five and then I can

play24:32

actually make pretty decent predictions

play24:33

about what comes next just by knowing

play24:35

that I'm token five because some

play24:37

characters know cert follow other

play24:40

characters in in typical scenarios so we

play24:43

saw a lot of this in a lot more depth in

play24:45

the make more series and here if I just

play24:47

run this then we currently get the

play24:49

predictions the scores the logits for

play24:53

every one of the four by eight positions

play24:55

now that we've made predictions about

play24:56

what comes next we'd like to evaluate

play24:58

the loss function and so in make more

play25:00

series we saw that a good way to measure

play25:02

a loss or like a quality of the

play25:04

predictions is to use the negative log

play25:06

likelihood loss which is also

play25:08

implemented in pytorch under the name

play25:10

cross entropy

play25:11

so what we'd like to do here is

play25:14

loss is the cross entropy on the

play25:17

predictions and the targets and so this

play25:19

measures the quality of the logits with

play25:21

respect to the Targets in other words we

play25:24

have the identity of the next character

play25:25

so how well are we predicting the next

play25:28

character based on Illusions and

play25:30

intuitively the correct

play25:32

um the correct dimension of logits uh

play25:36

depending on whatever the target is

play25:37

should have a very high number and all

play25:39

the other dimensions should be very low

play25:40

number right

play25:42

now the issue is that this won't

play25:44

actually this is what we want we want to

play25:46

basically output the logits and the loss

play25:51

this is what we want but unfortunately

play25:52

uh this won't actually run

play25:55

we get an error message but intuitively

play25:58

we want to measure this now when we go

play26:01

to the pi torch cross entropy

play26:04

a documentation here

play26:06

um

play26:07

we're trying to call the cross entropy

play26:09

in its functional form so that means we

play26:11

don't have to create like a module for

play26:13

it

play26:14

but here when we go to the documentation

play26:16

you have to look into the details of how

play26:18

pytorch expects these inputs and

play26:20

basically the issue here is by torch

play26:22

expects if you have multi-dimensional

play26:24

input which we do because we have a b by

play26:26

T by C tensor then it actually really

play26:29

wants the channels to be the second

play26:32

dimension here

play26:34

so if you um so basically it wants a b

play26:38

by C by T instead of a b by T by C

play26:42

and so it's just the details of how

play26:44

pytorch treats

play26:45

um these kinds of inputs and so we don't

play26:49

actually want to deal with that so what

play26:51

we're going to do instead is we need to

play26:52

basically reshape our logits so here's

play26:54

what I like to do I like to take

play26:56

basically give names to the dimensions

play26:58

so launches.shape is B by T by C and

play27:01

unpack those numbers

play27:02

and then let's say that logits equals

play27:05

logits.view

play27:07

and we want it to be a b times c b times

play27:09

T by C so just a two-dimensional array

play27:13

right so we're going to take all the

play27:15

we're going to take all of these

play27:17

um

play27:18

positions here and we're going to uh

play27:20

stretch them out in a one-dimensional

play27:22

sequence

play27:23

and preserve the channel Dimension as

play27:25

the second dimension

play27:27

so we're just kind of like stretching

play27:29

out the array so it's two-dimensional

play27:30

and in that case it's going to better

play27:32

conform to what pi torch sort of expects

play27:34

in its dimensions

play27:36

now we have to do the same to targets

play27:38

because currently targets

play27:40

are of shape B by T and we want it to be

play27:45

just B times T so one dimensional now

play27:48

alternatively you could always still

play27:50

just do -1 because Pi torch will guess

play27:53

what this should be if you want to lay

play27:54

it out but let me just be explicit on

play27:56

say Q times t

play27:58

once we've reshaped this it will match

play28:00

the cross entropy case

play28:02

and then we should be able to evaluate

play28:04

our loss

play28:06

okay so with that right now and we can

play28:09

do loss and So currently we see that the

play28:12

loss is 4.87

play28:14

now because our we have 65 possible

play28:17

vocabulary elements we can actually

play28:19

guess at what the loss should be and in

play28:22

particular

play28:22

we covered negative log likelihood in a

play28:25

lot of detail we are expecting log or

play28:28

long of

play28:30

um 1 over 65 and negative of that

play28:33

so we're expecting the loss to be about

play28:35

4.1217 but we're getting 4.87 and so

play28:39

that's telling us that the initial

play28:40

predictions are not super diffuse

play28:42

they've got a little bit of entropy and

play28:44

so we're guessing wrong

play28:46

uh so uh yes but actually we're I able

play28:50

we are able to evaluate the loss okay so

play28:53

now that we can evaluate the quality of

play28:55

the model on some data we'd likely also

play28:57

be able to generate from the model so

play28:59

let's do the generation now I'm going to

play29:01

go again a little bit faster here

play29:03

because I covered all this already in

play29:04

previous videos

play29:06

so

play29:08

here's a generate function for the model

play29:12

so we take some uh we take the the same

play29:15

kind of input idx here

play29:17

and basically

play29:19

this is the current context of some

play29:22

characters in a batch in some batch

play29:25

so it's also B by T and the job of

play29:28

generate is to basically take this B by

play29:30

T and extend it to be B by T plus one

play29:32

plus two plus three and so it's just

play29:34

basically it contains the generation in

play29:36

all the batch dimensions in the time

play29:38

dimension

play29:39

So that's its job and we'll do that for

play29:41

Max new tokens

play29:42

so you can see here on the bottom

play29:44

there's going to be some stuff here but

play29:46

on the bottom whatever is predicted is

play29:48

concatenated on top of the previous idx

play29:51

along the First Dimension which is the

play29:53

time Dimension to create a b by T plus

play29:55

one

play29:56

so that becomes the new idx so the job

play29:58

of generators to take a b by T and make

play30:00

it a b by T plus one plus two plus three

play30:03

as many as we want maximum tokens so

play30:05

this is the generation from the model

play30:08

now inside the generation what we're

play30:09

what are we doing we're taking the

play30:11

current indices we're getting the

play30:13

predictions so we get those are in the

play30:16

logits

play30:17

and then the loss here is going to be

play30:19

ignored because

play30:20

um we're not we're not using that and we

play30:22

have no targets that are sort of ground

play30:24

truth targets that we're going to be

play30:26

comparing with

play30:28

then once we get the logits we are only

play30:30

focusing on the last step so instead of

play30:33

a b by T by C we're going to pluck out

play30:36

the negative one the last element in the

play30:38

time dimension

play30:40

because those are the predictions for

play30:41

what comes next

play30:42

so that this is the logits which we then

play30:44

convert to probabilities via softmax and

play30:47

then we use torch that multinomial to

play30:49

sample from those probabilities and we

play30:51

ask by torch to give us one sample

play30:53

and so idx next will become a b by one

play30:56

because in each one of the batch

play30:59

Dimensions we're going to have a single

play31:01

prediction for what comes next so this

play31:03

num samples equals one will make this be

play31:05

a one

play31:07

and then we're going to take those

play31:08

integers that come from the sampling

play31:10

process according to the probability

play31:11

distribution given here

play31:13

and those integers got just concatenated

play31:15

on top of the current sort of like

play31:17

running stream of integers and this

play31:19

gives us a p by T plus one

play31:21

and then we can return that now one

play31:24

thing here is you see how I'm calling

play31:26

self of idx which will end up going to

play31:29

the forward function I'm not providing

play31:31

any Targets So currently this would give

play31:33

an error because targets is uh is uh

play31:36

sort of like not given so target has to

play31:39

be optional so targets is none by

play31:41

default and then if targets is none then

play31:44

there's no loss to create so it's just

play31:47

loss is none but else all of this

play31:50

happens and we can create a loss

play31:52

so this will make it so

play31:54

um

play31:55

if we have the targets we provide them

play31:57

and get a loss if we have no targets

play31:58

we'll just get the logits

play32:01

so this here will generate from the

play32:03

model

play32:04

um and let's take that for a ride now

play32:09

oops

play32:10

so I have another code chunk here which

play32:12

will generate for the model from the

play32:14

model and okay this is kind of crazy so

play32:16

maybe let me let me break this down

play32:19

so these are the idx right

play32:24

I'm creating a batch will be just one

play32:27

time will be just one

play32:29

so I'm creating a little one by one

play32:31

tensor and it's holding a zero

play32:34

and the D type the data type is integer

play32:37

so 0 is going to be how we kick off the

play32:39

generation and remember that zero is uh

play32:42

is the element standing for a new line

play32:45

character so it's kind of like a

play32:46

reasonable thing to to feed in as the

play32:48

very first character in a sequence to be

play32:50

the new line

play32:52

um so it's going to be idx which we're

play32:55

going to feed in here then we're going

play32:56

to ask for 100 tokens

play32:58

and then enter generate will continue

play33:00

that

play33:01

now because uh generate works on the

play33:05

level of batches we then have to index

play33:07

into the zero throw to basically unplug

play33:10

the um

play33:11

the single bash Dimension that exists

play33:14

and then that gives us a um

play33:18

time steps it's just a one-dimensional

play33:20

array of all the indices which we will

play33:22

convert to simple python list

play33:25

from pytorch tensor so that that can

play33:28

feed into our decode function and

play33:31

convert those integers into text

play33:33

so let me bring this back and we're

play33:36

generating 100 tokens let's run

play33:38

and uh here's the generation that we

play33:41

achieved so obviously it's garbage and

play33:43

the reason it's garbage is because this

play33:45

is a totally random model so next up

play33:47

we're going to want to train this model

play33:49

now one more thing I wanted to point out

play33:50

here is

play33:52

this function is written to be General

play33:53

but it's kind of like ridiculous right

play33:56

now because

play33:57

we're feeding in all this we're building

play33:59

out this context and we're concatenating

play34:02

it all and we're always feeding it all

play34:04

into the model

play34:06

but that's kind of ridiculous because

play34:08

this is just a simple background model

play34:09

so to make for example this prediction

play34:11

about K we only needed this W but

play34:14

actually what we fed into the model is

play34:16

we fed the entire sequence and then we

play34:18

only looked at the very last piece and

play34:20

predicted k

play34:22

so the only reason I'm writing it in

play34:24

this way is because right now this is a

play34:26

bygram model but I'd like to keep this

play34:28

function fixed and I'd like it to work

play34:30

later when our character is actually

play34:34

basically look further in the history

play34:36

and so right now the history is not used

play34:38

so this looks silly but eventually the

play34:41

history will be used and so that's why

play34:43

we want to do it this way so just a

play34:46

quick comment on that so now we see that

play34:48

this is um random so let's train the

play34:51

model so it becomes a bit less random

play34:53

okay let's Now train the model so first

play34:55

what I'm going to do is I'm going to

play34:56

create a pytorch optimization object

play34:59

so here we are using the optimizer

play35:02

atom W

play35:03

now in the make more series we've only

play35:06

ever used stochastic gradient descent

play35:07

the simplest possible Optimizer which

play35:09

you can get using the SGD instead but I

play35:12

want to use Adam which is a much more

play35:13

advanced and popular Optimizer and it

play35:15

works extremely well for a typical good

play35:18

setting for the learning rate is roughly

play35:20

3E negative four but for very very small

play35:23

networks luck is the case here you can

play35:25

get away with much much higher learning

play35:26

rates running negative three or even

play35:28

higher probably

play35:29

but let me create the optimizer object

play35:32

which will basically take the gradients

play35:34

and update the parameters using the

play35:36

gradients

play35:37

and then here

play35:39

our batch size up above was only four so

play35:41

let me actually use something bigger

play35:42

let's say 32 and then for some number of

play35:45

steps

play35:46

um we are sampling a new batch of data

play35:48

we're evaluating the loss we're zeroing

play35:51

out all the gradients from the previous

play35:53

step getting the gradients for all the

play35:55

parameters and then using those

play35:57

gradients to update our parameters so

play35:58

typical training loop as we saw in the

play36:01

make more series

play36:02

so let me now uh run this

play36:05

for say 100 iterations and let's see

play36:07

what kind of losses we're gonna get

play36:10

so we started around 4.7

play36:13

and now we're going to down to like 4.6

play36:15

4.5

play36:16

Etc so the optimization is definitely

play36:18

happening but

play36:20

um let's uh sort of try to increase the

play36:23

number of iterations and only print at

play36:25

the end

play36:26

because we probably will not train for

play36:28

longer

play36:30

okay so we're down to 3.6 roughly

play36:35

roughly down to three

play36:41

this is the most janky optimization

play36:47

okay it's working let's just do ten

play36:48

thousand

play36:51

and then from here we want to copy this

play36:55

and hopefully we're going to get

play36:57

something reasonable and of course it's

play36:58

not going to be Shakespeare from a

play37:00

background model but at least we see

play37:01

that the loss is improving and hopefully

play37:05

we're expecting something a bit more

play37:06

reasonable

play37:07

okay so we're down there about 2.5 ish

play37:09

let's see what we get

play37:11

okay

play37:12

dramatic improvements certainly on what

play37:14

we had here

play37:15

so let me just increase the number of

play37:18

tokens

play37:19

okay so we see that we're starting to

play37:20

get something at least like

play37:22

reasonable ish

play37:24

um

play37:26

certainly not Shakespeare but the model

play37:29

is making progress so that is the

play37:31

simplest possible model

play37:34

so now what I'd like to do is

play37:37

obviously that this is a very simple

play37:39

model because the tokens are not talking

play37:40

to each other so given the previous

play37:42

context of whatever was generated we're

play37:45

only looking at the very last character

play37:46

to make the predictions about what comes

play37:48

next so now these uh now these tokens

play37:50

have to start talking to each other and

play37:53

figuring out what is in the context so

play37:55

that they can make better predictions

play37:56

for what comes next and this is how

play37:58

we're going to kick off the Transformer

play38:00

okay so next I took the code that we

play38:02

developed in this Jupiter notebook and I

play38:03

converted it to be a script and I'm

play38:06

doing this because I just want to

play38:08

simplify our intermediate work into just

play38:10

the final product that we have at this

play38:11

point

play38:13

so in the top here I put all the hyper

play38:15

parameters that we've defined I

play38:17

introduced a few and I'm going to speak

play38:18

to that in a little bit otherwise a lot

play38:20

of this should be recognizable

play38:21

reproducibility

play38:24

read data get the encoder in the decoder

play38:26

create the training test splits I use

play38:29

the uh kind of like data loader that

play38:32

gets a batch of the inputs and targets

play38:35

this is new and I'll talk about it in a

play38:37

second

play38:38

now this is the background language

play38:40

model that we developed and it can

play38:42

forward and give us a logits and loss

play38:44

and it can generate

play38:46

and then here we are creating the

play38:48

optimizer and this is the training Loop

play38:51

so everything here should look pretty

play38:53

familiar now some of the small things

play38:55

that I added number one I added the

play38:58

ability to run on a GPU if you have it

play39:00

so if you have a GPU then you can this

play39:02

will use Cuda instead of just CPU and

play39:05

everything will be a lot more faster now

play39:07

when device becomes screwed up then we

play39:09

need to make sure that when we load the

play39:11

data we move it to device

play39:13

when we create the model we want to move

play39:16

the model parameters to device

play39:19

so as an example here we have the NN

play39:21

embedding table and it's got a double

play39:23

weight inside it which stores the sort

play39:26

of lookup table so that would be moved

play39:28

to the GPU so that all the calculations

play39:30

here happen on the GPU and they can be a

play39:32

lot faster

play39:33

and then finally here when I'm creating

play39:35

the context that feeds into generate I

play39:37

have to make sure that I create on the

play39:39

device

play39:40

number two what I introduced is

play39:43

the fact that here in the training Loop

play39:47

here I was just printing the Lost dot

play39:50

item

play39:51

inside the training Loop but this is a

play39:53

very noisy measurement of the current

play39:54

loss because every batch will be more or

play39:57

less lucky

play39:58

and so what I want to do usually is I

play40:02

have an estimate loss function and the

play40:04

estimated loss basically then goes up

play40:07

here

play40:08

and it averages up the loss over

play40:11

multiple batches

play40:13

so in particular we're going to iterate

play40:15

invalider times and we're going to

play40:18

basically get our loss and then we're

play40:19

going to get the average loss for both

play40:21

splits and so this will be a lot less

play40:23

noisy

play40:25

so here what we call the estimate loss

play40:27

we're going to report the pretty

play40:28

accurate train and validation loss

play40:31

now when we come back up you'll notice a

play40:34

few things here I'm setting the model to

play40:35

evaluation phase and down here I'm

play40:38

resetting it back to training phase

play40:40

now right now for our model as is this

play40:42

this doesn't actually do anything

play40:43

because the only thing inside this model

play40:46

is this nn.embedding and

play40:49

um this this network would behave both

play40:52

would be have the same in both

play40:54

evaluation mode and training mode we

play40:56

have no Dropout layers we have no

play40:58

bathroom layers Etc but it is a good

play41:00

practice to Think Through what mode your

play41:02

neural network is in because some layers

play41:05

will have different Behavior at

play41:07

inference time or training time

play41:09

and

play41:10

there's also this context manager

play41:12

torch.nograd and this is just telling

play41:14

pytorch that everything that happens

play41:16

inside this function we will not call

play41:18

that backward on and so Patrick can be a

play41:21

lot more efficient with its memory use

play41:23

because it doesn't have to store all the

play41:25

intermediate variables because we're

play41:27

never going to call backward and so it

play41:29

can it can be a lot more memory

play41:30

efficient in that way so also a good

play41:32

practice to tell Pi torch when we don't

play41:35

intend to do back propagation

play41:37

so right now the script is about 120

play41:41

lines of code of and that's kind of our

play41:43

starter code

play41:45

I'm calling it background.pi and I'm

play41:47

going to release it later now running

play41:49

this script gives us output in the

play41:52

terminal and it looks something like

play41:53

this

play41:54

it basically as I ran this code it was

play41:57

giving me the train loss and Val loss

play41:59

and we see that we convert to somewhere

play42:01

around 2.5

play42:02

with the migrant model and then here's

play42:05

the sample that we produced at the end

play42:08

and so we have everything packaged up in

play42:10

the script and we're in a good position

play42:11

now to iterate on this okay so we are

play42:14

almost ready to start writing our very

play42:16

first self-attention block for

play42:18

processing these tokens

play42:21

now before we actually get there I want

play42:24

to get you used to a mathematical trick

play42:26

that is used in the self attention

play42:27

inside a Transformer and is really just

play42:29

like at the heart of an efficient

play42:32

implementation of self-attention and so

play42:34

I want to work with this toy example you

play42:36

just get used to this operation and then

play42:38

it's going to make it much more clear

play42:40

once we actually get to um to it in the

play42:43

script again

play42:45

so let's create a b by T by C where B T

play42:48

and C are just 4 8 and 2 in the story

play42:49

example and these are basically channels

play42:52

and we have batches and we have the time

play42:56

component and we have some information

play42:57

at each point in the sequence so C

play43:01

now what we would like to do is we would

play43:03

like these um tokens so we have up to

play43:06

eight tokens here in a batch and these

play43:09

eight tokens are currently not talking

play43:10

to each other and we would like them to

play43:12

talk to each other we'd like to couple

play43:13

them

play43:14

and in particular we don't we we want to

play43:18

couple them in a very specific way so

play43:20

the token for example at the fifth

play43:22

location it should not communicate with

play43:24

tokens in the sixth seventh and eighth

play43:26

location

play43:27

because those are future tokens in the

play43:30

sequence

play43:30

the token on the fifth location should

play43:32

only talk to the one in the fourth third

play43:34

second and first

play43:36

so it's only so information only flows

play43:38

from previous context to the current

play43:40

timestamp and we cannot get any

play43:42

information from the future because we

play43:43

are about to try to predict the future

play43:46

so

play43:47

what is the easiest way for tokens to

play43:49

communicate okay the easiest way I would

play43:52

say is okay if we are up to if we're a

play43:55

fifth token and I'd like to communicate

play43:56

with my past the simplest way we can do

play43:58

that is to just do a weight is to just

play44:01

do an average of all the um of all the

play44:04

preceding elements so for example if I'm

play44:07

the fifth token I would like to take the

play44:09

channels that make up that are

play44:12

information at my step but then also the

play44:14

channels from the four step third step

play44:16

second step in the first step I'd like

play44:18

to average those up and then that would

play44:20

become sort of like a feature Vector

play44:22

that summarizes me in the context of my

play44:24

history

play44:25

now of course just doing a sum or like

play44:28

an average is an extremely weak form of

play44:30

interaction like this communication is

play44:31

extremely lossy we've lost a ton of

play44:33

information about the spatial

play44:34

Arrangements of all those tokens but

play44:37

that's okay for now we'll see how we can

play44:39

bring that information back later

play44:40

for now what we would like to do is

play44:43

for every single batch element

play44:44

independently

play44:45

for every teeth token in that sequence

play44:48

we'd like to now calculate the average

play44:51

of all the vectors in all the previous

play44:54

tokens and also at this token

play44:57

so let's write that out

play44:59

um I have a small snippet here and

play45:01

instead of just fumbling around let me

play45:03

just copy paste it and talk to it

play45:06

so in other words we're going to create

play45:08

X

play45:09

and bow is short for backup words

play45:12

because backup words is um is kind of

play45:15

like um

play45:16

a term that people use when you are just

play45:18

averaging up things so it's just a bag

play45:19

of words basically there's a word stored

play45:22

on every one of these eight locations

play45:24

and we're doing a bag of words such as

play45:25

averaging

play45:27

so in the beginning we're going to say

play45:29

that it's just initialized at Zero and

play45:30

then I'm doing a for Loop here so we're

play45:32

not being efficient yet that's coming

play45:33

but for now we're just iterating over

play45:36

all the batch Dimensions independently

play45:37

iterating over time

play45:40

and then the previous tokens are at this

play45:44

batch Dimension and then everything up

play45:47

to and including the teeth token okay

play45:50

so when we slice out X in this way xrev

play45:54

Becomes of shape

play45:56

um how many T elements there were in the

play45:59

past and then of course C so all the two

play46:02

dimensional information from these log

play46:03

tokens

play46:05

so that's the previous sort of chunk of

play46:08

um tokens from my current sequence

play46:11

and then I'm just doing the average or

play46:13

the mean over the zeroth dimension so

play46:15

I'm averaging out the time here

play46:18

and I'm just going to get a little C

play46:20

one-dimensional Vector which I'm going

play46:22

to store in X background words

play46:25

so I can run this and uh this is not

play46:28

going to be very informative because

play46:31

let's see so this is x sub 0. so this is

play46:33

the zeroth batch element and then expo

play46:36

at zero now

play46:38

you see how the at the first location

play46:40

here you see that the two are equal and

play46:44

that's because it's we're just doing an

play46:45

average of this one token

play46:47

but here this one is now an average of

play46:50

these two

play46:51

and now this one is an average of these

play46:54

three

play46:55

and so on

play46:57

so uh and this last one is the average

play47:00

of all of these elements so vertical

play47:03

average just averaging up all the tokens

play47:04

now gives this outcome here

play47:08

so this is all well and good but this is

play47:10

very inefficient now the trick is that

play47:13

we can be very very efficient about

play47:14

doing this using matrix multiplication

play47:17

so that's the mathematical trick and let

play47:19

me show you what I mean let's work with

play47:21

the toy example here

play47:22

let me run it and I'll explain

play47:25

I have a simple Matrix here that is a

play47:28

three by three of all ones

play47:30

a matrix B of just random numbers and

play47:32

it's a three by two

play47:33

and a matrix C which will be three by

play47:35

three multiply three by two which will

play47:37

give out a three by two

play47:39

so here we're just using

play47:41

um

play47:41

matrix multiplication

play47:43

so a multiply B gives us C

play47:46

okay so how are these numbers in C

play47:50

achieved right so this number in the top

play47:54

left is the first row of a DOT product

play47:57

with the First Column of B

play48:00

and since all the the row of a right now

play48:02

is all just once

play48:03

then the dot product here with with this

play48:06

column of B is just going to do a sum of

play48:08

these of this column so 2 plus 6 plus 6

play48:11

is 14.

play48:13

the element here and the output of C is

play48:15

also the first column here the first row

play48:17

of a multiplied now with the second

play48:20

column of B so 7 plus 4 plus plus 5 is

play48:23

16.

play48:24

now you see that there's repeating

play48:26

elements here so this 14 again is

play48:28

because this row is again all once and

play48:29

it's multiplying the First Column of B

play48:31

so we get 14. and this one is and so on

play48:35

so this last number here is the last row

play48:37

dot product last column

play48:40

now the trick here is uh the following

play48:43

this is just a boring number of

play48:46

um it's just a boring array of all ones

play48:48

but torch has this function called trell

play48:51

which is short for a triangular

play48:54

uh something like that and you can wrap

play48:57

it in torched at once and it will just

play48:58

return the lower triangular portion of

play49:01

this

play49:02

okay

play49:04

so now it will basically zero out uh

play49:06

these guys here so we just get the lower

play49:08

triangular part well what happens if we

play49:11

do that

play49:15

so now we'll have a like this and B like

play49:17

this and now what are we getting here in

play49:19

C

play49:20

well what is this number well this is

play49:22

the first row times the First Column and

play49:25

because this is zeros

play49:28

uh these elements here are now ignored

play49:30

so we just get a two

play49:32

and then this number here is the first

play49:34

row times the second column and because

play49:37

these are zeros they get ignored and

play49:38

it's just seven the seven multiplies

play49:41

this one

play49:42

but look what happened here because this

play49:44

is one and then zeros we what ended up

play49:46

happening is we're just plucking out the

play49:48

row of this row of B and that's what we

play49:50

got

play49:52

now here we have 1 1 0. so here one one

play49:57

zero dot product with these two columns

play49:58

will now give us two plus six which is

play50:00

eight and seven plus four which is 11.

play50:03

and because this is one one one we ended

play50:05

up with the addition of all of them

play50:08

and so basically depending on how many

play50:10

ones and zeros we have here we are

play50:12

basically doing a sum currently of a

play50:16

variable number of these rows and that

play50:19

gets deposited into C

play50:21

So currently we're doing sums because

play50:23

these are ones but we can also do

play50:25

average right and you can start to see

play50:27

how we could do average of the rows of B

play50:30

uh sort of in an incremental fashion

play50:33

because we don't have to we can

play50:35

basically normalize these rows so that

play50:37

they sum to one and then we're going to

play50:39

get an average

play50:40

so if we took a and then we did a equals

play50:43

a divide a torch.sum

play50:46

in the um

play50:48

of a in the warmth

play50:52

Dimension and then let's keep them as

play50:55

true so therefore the broadcasting will

play50:57

work out

play50:58

so if I rerun this you see now that

play51:01

these rows now sum to one so this row is

play51:04

one this row is 0.5.50 and here we get

play51:07

one thirds

play51:08

and now when we do a multiply B what are

play51:11

we getting

play51:12

here we are just getting the first row

play51:14

first row

play51:15

here now we are getting the average of

play51:18

the first two rows

play51:21

okay so 2 and 6 average is four and four

play51:23

and seven average is 5.5

play51:25

and on the bottom here we are now

play51:28

getting the average of these three rows

play51:31

so the average of all of elements of B

play51:33

are now deposited here

play51:36

and so you can see that by manipulating

play51:38

these uh elements of this multiplying

play51:41

Matrix and then multiplying it with any

play51:44

given Matrix we can do these averages in

play51:47

this incremental fashion because we just

play51:49

get

play51:50

um

play51:51

and we can manipulate that based on the

play51:53

elements of a okay so that's very

play51:55

convenient so let's swing back up here

play51:57

and see how we can vectorize this and

play51:59

make it much more efficient using what

play52:01

we've learned

play52:02

so in particular

play52:04

we are going to produce an array a but

play52:07

here I'm going to call it way short for

play52:09

weights

play52:10

but this is r a

play52:12

and this is how much of every row we

play52:15

want to average up and it's going to be

play52:17

an average because you can see it in

play52:18

these rows sum to 1.

play52:21

so this is our a and then our B in this

play52:23

example of course is

play52:26

X

play52:27

so it's going to happen here now is that

play52:29

we are going to have an expo 2.

play52:32

and this Expo 2 is going to be way

play52:35

multiplying

play52:37

RX

play52:39

so let's think this through way is T by

play52:42

T and this is Matrix multiplying in pi

play52:44

torch a b by T by C

play52:48

and it's giving us

play52:49

uh the what shape so pytorch will come

play52:52

here and then we'll see that these

play52:53

shapes are not the same so it will

play52:56

create a batch Dimension here and this

play52:58

is a batched matrix multiply

play53:00

and so it will apply this matrix

play53:02

multiplication in all the batch elements

play53:04

in parallel

play53:06

and individually and then for each batch

play53:09

element there will be a t by T

play53:10

multiplying T by C exactly as we had

play53:13

below

play53:16

so this will now create

play53:18

B by T by C

play53:21

and X both 2 will now become identical

play53:24

to Expo

play53:26

so

play53:28

we can see that torch.all close

play53:31

of Expo and Expo 2 should be true now

play53:37

so this kind of like misses us that uh

play53:39

these are in fact the same

play53:42

so Expo and Expo 2 if I just print them

play53:47

uh okay we're not going to be able to

play53:49

okay we're not going to be able to just

play53:51

stare it down but

play53:53

um

play53:55

well let me try Expo basically just at

play53:56

the zeroth element and Expo two at the

play53:58

zeroth element so just the first batch

play54:00

and we should see that this and that

play54:02

should be identical which they are

play54:05

right so what happened here the trick is

play54:07

we were able to use batched Matrix

play54:10

multiply

play54:10

to do this uh aggregation really and

play54:14

it's awaited aggregation and the weights

play54:17

are specified in this T by T array

play54:21

and we're basically doing weighted sums

play54:23

and uh these weighted sums are according

play54:27

to the weights inside here they take on

play54:29

sort of this triangular form

play54:32

and so that means that a token at the

play54:34

teeth Dimension will only get uh sort of

play54:36

um information from the um tokens

play54:39

preceding it so that's exactly what we

play54:41

want and finally I would like to rewrite

play54:43

it in one more way

play54:45

and we're going to see why that's useful

play54:48

so this is the third version and it's

play54:50

also identical to the first and second

play54:51

but let me talk through it it uses

play54:54

softmax

play54:55

so

play54:56

Trill here is this Matrix lower

play55:00

triangular ones

play55:02

way begins as all zero

play55:06

okay so if I just print way in the

play55:08

beginning it's all zero

play55:10

then I used

play55:12

masked fill

play55:14

so what this is doing is

play55:16

wait that masked fill it's all zeros and

play55:18

I'm saying for all the elements where

play55:20

Trill is equals equals zero make them be

play55:24

negative Infinity

play55:25

so all the elements where Trill is zero

play55:27

will become negative Infinity now

play55:30

so this is what we get

play55:32

and then the final one here is softmax

play55:37

so if I take a soft Max along every

play55:39

single so dim is negative one so along

play55:41

every single row

play55:42

if I do a soft Max what is that going to

play55:45

do

play55:47

well softmax is um

play55:50

it's also like a normalization operation

play55:53

right

play55:54

and so spoiler alert you get the exact

play55:56

same Matrix

play55:58

let me bring back the softmax

play56:01

and recall that in softmax we're going

play56:03

to exponentiate every single one of

play56:04

these

play56:05

and then we're going to divide by the

play56:07

sum

play56:08

and so for if we exponentiate every

play56:10

single element here we're going to get a

play56:11

one and here we're going to get uh

play56:14

basically zero zero zero zero zero

play56:15

everywhere else

play56:16

and then when we normalize we just get

play56:19

one here we're going to get 1 1 and then

play56:22

zeros and then softmax will again divide

play56:25

and this will give us 0.5.5 and so on

play56:28

and so this is also the uh the same way

play56:31

to produce this mask

play56:33

now the reason that this is a bit more

play56:34

interesting and the reason we're going

play56:36

to end up using it and solve a tension

play56:38

is that

play56:40

these weights here begin uh with zero

play56:43

and you can think of this as like an

play56:45

interaction strength or like an affinity

play56:48

so basically it's telling us how much of

play56:51

each token from the past do we want to

play56:54

Aggregate and average up

play56:57

and then this line is saying tokens from

play57:00

the past cannot communicate by setting

play57:02

them to negative Infinity we're saying

play57:04

that we will not aggregate anything from

play57:06

those tokens

play57:08

and so basically this then goes through

play57:10

softmax and through the weighted and

play57:11

this is the aggregation through matrix

play57:13

multiplication

play57:14

and so what this is now is you can think

play57:17

of these as

play57:18

um these zeros are currently just set by

play57:21

us to be zero but a quick preview is

play57:24

that these affinities between the tokens

play57:26

are not going to be just constant at

play57:28

zero they're going to be data dependent

play57:30

these tokens are going to start looking

play57:32

at each other and some tokens will find

play57:34

other tokens more or less interesting

play57:37

and depending on what their values are

play57:39

they're going to find each other

play57:40

interesting to different amounts and I'm

play57:43

going to call those affinities I think

play57:45

and then here we are saying the future

play57:47

cannot communicate with the past we're

play57:49

going to clamp them

play57:51

and then when we normalize and sum we're

play57:53

going to aggregate sort of their values

play57:55

depending on how interesting they find

play57:57

each other

play57:58

and so that's the preview for

play57:59

self-attention and basically long story

play58:03

short from this entire section is that

play58:04

you can do weighted aggregations of your

play58:07

past elements

play58:09

by having by using matrix multiplication

play58:12

of a lower triangular fashion

play58:15

and then the elements here in the lower

play58:17

triangular part are telling you how much

play58:19

of each element fuses into this position

play58:23

so we're going to use this trick now to

play58:24

develop the self-attention block so

play58:27

first let's get some quick preliminaries

play58:28

out of the way

play58:30

first the thing I'm kind of bothered by

play58:31

is that you see how we're passing in

play58:33

vocab size into the Constructor there's

play58:35

no need to do that because vocab size

play58:36

has already defined up top as a global

play58:39

variable so there's no need to pass this

play58:40

stuff around

play58:42

next one I want to do is I don't want to

play58:45

actually create I want to create like a

play58:46

level of interaction here where we don't

play58:48

directly go to the embedding for the um

play58:51

logits but instead we go through this

play58:53

intermediate phase because we're going

play58:54

to start making that bigger so let me

play58:57

introduce a new variable and embed a

play59:01

short for a number of embedding

play59:02

dimensions

play59:03

so an embed

play59:05

here

play59:06

will be say 32. that was a suggestion

play59:09

from GitHub by the way it also showed us

play59:12

to 32 which is a good number

play59:15

so this is an embedding table and only

play59:16

32 dimensional embeddings

play59:19

so then here this is not going to give

play59:21

us logits directly instead this is going

play59:24

to give us token embeddings

play59:25

that's what I'm going to call it and

play59:27

then to go from the token embeddings to

play59:29

the logits we're going to need a linear

play59:30

layer so self.lm head let's call it

play59:34

short for language modeling head

play59:36

is n linear from an embed up to vocab

play59:38

size

play59:39

and then when we swing over here we're

play59:41

actually going to get the logits by

play59:43

exactly what the copilot says

play59:45

now we have to be careful here because

play59:47

this C and this C are not equal

play59:50

this is an embedded C and this is vocab

play59:53

size

play59:54

so let's just say that an embed is equal

play59:56

to C

play59:58

and then this just creates one spurious

play60:01

layer of interaction through a linear

play60:02

layer but this should basically run

play60:12

so we see that this runs and uh this

play60:15

currently looks kind of spurious but

play60:16

we're going to build on top of this now

play60:19

next up so far we've taken these in in

play60:21

the seas and we've encoded them based on

play60:23

the identity of the tokens inside idx

play60:28

the next thing that people very often do

play60:29

is that we're not just encoding the

play60:31

identity of these tokens but also their

play60:33

position

play60:34

so we're going to have a second position

play60:36

uh embedding table here so solve that

play60:39

position embedding table

play60:40

is an embedding of block size by an

play60:43

embed and so each position from 0 to

play60:45

block size minus 1 will also get its own

play60:47

embedding vector

play60:49

and then here first let me decode a b by

play60:52

T from idx.shape

play60:55

and then here we're also going to have a

play60:56

positive bedding which is the positional

play60:58

embedding and these are this is tour

play61:00

Dutch arrange so this will be basically

play61:03

just integers from 0 to T minus 1.

play61:05

and all of those integers from 0 to T

play61:08

minus 1 get embedded through the table

play61:09

to create a t by C

play61:12

and then here this gets renamed to just

play61:15

say x and x will be

play61:17

the addition of the token embeddings

play61:19

with the positional embeddings

play61:21

and here the broadcasting note will work

play61:23

out so B by T by C plus T by C this gets

play61:27

right aligned a new dimension of one

play61:28

gets added and it gets broadcasted

play61:30

across batch

play61:32

so at this point x holds not just the

play61:35

token identities but the positions at

play61:37

which these tokens occur

play61:39

and this is currently not that useful

play61:41

because of course we just have a simple

play61:42

migrain model so it doesn't matter if

play61:44

you're in the fifth position the second

play61:45

position or wherever it's all

play61:47

translation invariant at this stage so

play61:49

this information currently wouldn't help

play61:51

but as we work on the self potential

play61:53

block we'll see that this starts to

play61:55

matter

play61:59

okay so now we get the Crux of

play62:01

self-attention so this is probably the

play62:03

most important part of this video to

play62:04

understand

play62:06

we're going to implement a small

play62:07

self-attention for a single individual

play62:09

head as they're called

play62:11

so we start off with where we were so

play62:13

all of this code is familiar

play62:15

so right now I'm working with an example

play62:17

where I change the number of channels

play62:18

from 2 to 32 so we have a 4x8

play62:21

arrangement of tokens and each and the

play62:25

information at each token is currently

play62:27

32 dimensional but we just are working

play62:29

with random numbers

play62:31

now we saw here that

play62:33

the code as we had it before does a

play62:36

simple weight a simple average of all

play62:39

the past tokens and the current token so

play62:42

it's just the previous information and

play62:44

current information is just being mixed

play62:45

together in an average

play62:47

and that's what this code currently

play62:48

achieves and it does so by creating this

play62:50

lower triangular structure which allows

play62:53

us to mask out this weight Matrix that

play62:56

we create

play62:57

so we mask it out and then we normalize

play63:00

it and currently when we initialize the

play63:04

affinities between all the different

play63:05

sort of tokens or nodes I'm going to use

play63:08

those terms interchangeably

play63:10

so when we initialize the affinities

play63:12

between all the different tokens to be

play63:13

zero

play63:14

then we see that way gives us this

play63:17

structure where every single row has

play63:19

these um

play63:21

uniform numbers and so that's what

play63:23

that's what then uh in this Matrix

play63:25

multiply makes it so that we're doing a

play63:28

simple average

play63:29

now

play63:30

we don't actually want this to be

play63:33

All Uniform because different uh tokens

play63:36

will find different other tokens more or

play63:39

less interesting and we want that to be

play63:40

data dependent so for example if I'm a

play63:42

vowel then maybe I'm looking for

play63:44

consonants in my past and maybe I want

play63:47

to know what those consonants are and I

play63:48

want that information to Flow To Me

play63:51

and so I want to now gather information

play63:53

from the past but I want to do it in a

play63:55

data dependent way and this is the

play63:57

problem that self-attention solves

play63:58

now the way self-attention solves this

play64:00

is the following every single node or

play64:04

every single token at each position will

play64:06

emit two vectors

play64:08

it will emit a query and it will emit a

play64:11

key

play64:13

now the query Vector roughly speaking is

play64:16

what am I looking for

play64:18

and the key Vector roughly speaking is

play64:20

what do I contain

play64:22

and then the way we get affinities

play64:24

between these tokens now in a sequence

play64:28

is we basically just do a DOT product

play64:30

between the keys and the queries

play64:32

so my query dot products with all the

play64:35

keys of all the other tokens and that

play64:38

dot product now becomes way

play64:42

and so um if the key and the query are

play64:46

sort of aligned they will interact to a

play64:48

very high amount and then I will get to

play64:50

learn more about that specific token as

play64:53

opposed to any other token in the

play64:55

sequence so let's implement this tab

play65:01

we're going to implement a single

play65:04

what's called head of self-attention

play65:07

so this is just one head there's a hyper

play65:10

parameter involved with these heads

play65:11

which is the head size

play65:13

and then here I'm initializing the

play65:15

linear modules and I'm using bias equals

play65:17

false so these are just going to apply a

play65:19

matrix multiply with some fixed weights

play65:22

and now let me produce a

play65:24

key and Q K and Q by forwarding these

play65:28

modules on x

play65:30

so the size of this will not become

play65:33

B by T by 16 because that is the head

play65:37

size and the same here B by T by 16.

play65:45

so this being that size

play65:47

so you see here that when I forward this

play65:49

linear on top of my X all the tokens in

play65:52

all the positions in the B by T

play65:54

Arrangement all of them in parallel and

play65:57

independently produce a key and a query

play65:59

so no communication has happened yet

play66:02

but the communication comes now all the

play66:04

queries will dot product with all the

play66:07

keys

play66:08

so basically what we want is we want way

play66:10

now or the affinities between these to

play66:13

be query multiplying key

play66:16

but we have to be careful with uh we

play66:18

can't Matrix multiply this we actually

play66:19

need to transpose uh K but we have to be

play66:22

also careful because these are when you

play66:25

have the batch Dimension so in

play66:27

particular we want to transpose uh the

play66:29

last two Dimensions Dimension negative

play66:32

one and dimension negative two

play66:33

so negative 2 negative 1.

play66:37

and so this Matrix multiplied now will

play66:40

basically do the following B by T by 16

play66:45

Matrix multiplies B by 16 by T to give

play66:49

us

play66:50

B by T by T

play66:54

right

play66:55

so for every row of B we're not going to

play66:58

have a t-square matrix giving us the

play67:01

affinities and these are now the way

play67:03

so they're not zeros they are now coming

play67:06

from this dot product between the keys

play67:08

and the queries

play67:09

so this can now run I can I can run this

play67:13

and the weighted aggregation now is a

play67:15

function in a data dependent manner

play67:17

between the keys and queries of these

play67:19

nodes

play67:20

so just inspecting what happened here

play67:23

the way takes on this form

play67:26

and you see that before way was just a

play67:29

constant so it was applied in the same

play67:31

way to all the batch elements but now

play67:33

every single batch elements will have

play67:34

different sort of way because uh every

play67:37

single batch element contains different

play67:39

tokens at different positions and so

play67:42

this is not data dependent

play67:43

so when we look at just the zeroth row

play67:47

for example in the input these are the

play67:49

weights that came out and so you can see

play67:51

now that they're not just exactly

play67:52

uniform

play67:54

and in particular as an example here for

play67:56

the last row this was the eighth token

play67:59

and the eighth token knows what content

play68:01

it has and it knows at what position

play68:02

it's in

play68:04

and now the eighth token based on that

play68:06

creates a query hey I'm looking for this

play68:09

kind of stuff I'm a vowel I'm on the

play68:12

eighth position I'm looking for any

play68:13

consonants at positions up to four

play68:16

and then all the nodes get to emit keys

play68:19

and maybe one of the channels could be I

play68:21

am a I am a consonant and I am in a

play68:23

position up to four

play68:25

and that key would have a high number in

play68:27

that specific Channel and that's how the

play68:30

query and the key when they dot product

play68:31

they can find each other and create a

play68:33

high affinity

play68:34

and when they have a high Affinity like

play68:36

say this token was pretty interesting to

play68:39

uh to this eighth token

play68:42

when they have a high Affinity then

play68:44

through the soft Max I will end up

play68:45

aggregating a lot of its information

play68:47

into my position

play68:49

and so I'll get to learn a lot about it

play68:52

now just this we're looking at way after

play68:56

this has already happened

play68:58

um

play68:59

let me erase this operation as well so

play69:01

let me erase the masking and the softmax

play69:03

just to show you the under the hood

play69:04

internals and how that works

play69:06

so without the masking in the softmax

play69:08

way comes out like this right this is

play69:11

the outputs of the dot products

play69:13

and these are the raw outputs and they

play69:15

take on values from negative you know

play69:16

two to positive two Etc

play69:19

so that's the raw interactions and raw

play69:22

affinities between all the nodes

play69:24

but now if I'm a if I'm a fifth node I

play69:27

will not want to aggregate anything from

play69:28

the six node seventh node and the eighth

play69:30

node so actually we use the upper

play69:33

triangular masking so those are not

play69:35

allowed to communicate

play69:38

and now we actually want to have a nice

play69:40

uh distribution so we don't want to

play69:43

aggregate negative 0.11 of this node

play69:45

that's crazy so instead we exponentiate

play69:47

and normalize and now we get a nice

play69:49

distribution that seems to one

play69:51

and this is telling us now in the data

play69:53

dependent manner how much of information

play69:54

to aggregate from any of these tokens in

play69:57

the past

play69:59

so that's way and it's not zeros anymore

play70:02

but but it's calculated in this way now

play70:05

there's one more uh part to a single

play70:08

self-attention head and that is that

play70:10

when you do the aggregation we don't

play70:12

actually aggregate the tokens exactly we

play70:15

aggregate we produce one more value here

play70:17

and we call that the value

play70:21

so in the same way that we produced p

play70:22

and query we're also going to create a

play70:24

value

play70:25

and then

play70:27

here

play70:28

we don't aggregate

play70:31

X we calculate a v which is just

play70:34

achieved by propagating this linear on

play70:37

top of X again and then we

play70:40

output way multiplied by V so V is the

play70:44

elements that we aggregate or the the

play70:46

vector that we aggregate instead of the

play70:48

raw X

play70:49

and now of course this will make it so

play70:51

that the output here of the single head

play70:53

will be 16 dimensional because that is

play70:55

the head size

play70:58

so you can think of X as kind of like a

play71:00

private information to this token if you

play71:02

if you think about it that way so X is

play71:04

kind of private to this token so I'm a

play71:06

fifth token at some and I have some

play71:08

identity and my information is kept in

play71:11

Vector X

play71:13

and now for the purposes of the single

play71:15

head here's what I'm interested in

play71:17

here's what I have

play71:19

and if you find me interesting here's

play71:21

what I will communicate to you and

play71:23

that's stored in v

play71:25

and so V is the thing that gets

play71:26

aggregated for the purposes of this

play71:28

single head between the different nodes

play71:31

and that's uh

play71:33

basically the self attention mechanism

play71:35

this is this is what it does

play71:38

there are a few notes that I would make

play71:39

like to make about attention number one

play71:42

attention is a communication mechanism

play71:44

you can really think about it as a

play71:46

communication mechanism where you have a

play71:48

number of nodes in a directed graph

play71:50

where basically you have edges pointing

play71:52

between those like this

play71:54

and what happens is every node has some

play71:57

Vector of information and it gets to

play71:59

aggregate information via a weighted sum

play72:01

from all the nodes that point to it

play72:04

and this is done in a data dependent

play72:06

manner so depending on whatever data is

play72:08

actually stored at each node at any

play72:09

point in time

play72:11

now

play72:12

our graph doesn't look like this our

play72:14

graph has a different structure we have

play72:16

eight nodes because the block size is

play72:18

eight and there's always eight tokens

play72:21

and the first node is only pointed to by

play72:24

itself the second node is pointed to by

play72:26

the first node and itself all the way up

play72:28

to the eighth node which is pointed to

play72:30

by all the previous nodes and itself

play72:33

and so that's the structure that our

play72:35

directed graph has or happens happens to

play72:37

have in other aggressive sort of

play72:39

scenario like language modeling but in

play72:41

principle attention can be applied to

play72:43

any arbitrary directed graph and it's

play72:44

just a communication mechanism between

play72:46

the nodes

play72:47

the second note is that notice that

play72:49

there is no notion of space so attention

play72:51

simply acts over like a set of vectors

play72:54

in this graph and so by default these

play72:56

nodes have no idea where they are

play72:58

positioned in a space and that's why we

play72:59

need to encode them positionally and

play73:02

sort of give them some information that

play73:03

is anchored to a specific position so

play73:05

that they sort of know where they are

play73:08

and this is different than for example

play73:10

from convolution because if you run for

play73:12

example a convolution operation over

play73:13

some input there's a very specific sort

play73:16

of layout of the information in space in

play73:18

the convolutional filters sort of act in

play73:20

space and so it's it's not like an

play73:24

attention in attention is just a set of

play73:26

vectors out there in space they

play73:28

communicate and if you want them to have

play73:29

a notion of space you need to

play73:31

specifically add it which is what we've

play73:33

done when we calculated the um relative

play73:36

the position loan code encodings and

play73:38

added that information to the vectors

play73:40

the next thing that I hope is very clear

play73:41

is that the elements across the batch

play73:43

Dimension which are independent examples

play73:45

never talk to each other don't always

play73:47

processed independently and this is a

play73:49

bashed Matrix multiply that applies

play73:51

basically a matrix multiplication kind

play73:53

of in parallel across the batch

play73:54

Dimension so maybe it would be more

play73:56

accurate to say that in this analogy of

play73:58

a directed graph we really have because

play74:00

the batch size is four we really have

play74:02

four separate pools of eight nodes and

play74:05

those eight nodes only talk to each

play74:07

other but in total there's like 32 nodes

play74:09

that are being processed but there's um

play74:11

sort of four separate pools of eight you

play74:13

can look at it that way

play74:15

the next note is that here in the case

play74:18

of language modeling uh we have this

play74:20

specific structure of directed graph

play74:22

where the future tokens will not

play74:24

communicate to the Past tokens but this

play74:27

doesn't necessarily have to be the

play74:28

constraint in the general case and in

play74:30

fact in many cases you may want to have

play74:32

all of the nodes talk to each other

play74:35

fully so as an example if you're doing

play74:37

sentiment analysis or something like

play74:39

that with a Transformer you might have a

play74:41

number of tokens and you may want to

play74:42

have them all talk to each other fully

play74:44

because later you are predicting for

play74:46

example the sentiment of the sentence

play74:48

and so it's okay for these nodes to talk

play74:50

to each other

play74:51

and so in those cases you will use an

play74:54

encoder block of self-attention and all

play74:57

it means that it's an encoder block is

play74:59

that you will delete this line of code

play75:01

allowing all the nodes to completely

play75:03

talk to each other what we're

play75:05

implementing here is sometimes called a

play75:06

decoder block and it's called a decoder

play75:09

because it is sort of like a decoding

play75:13

language and it's got this Auto

play75:15

aggressive format where you have to mask

play75:17

with the Triangular Matrix so that nodes

play75:21

from the future never talk to the Past

play75:22

because they would give away the answer

play75:25

and so basically in encoder blocks you

play75:27

would delete this allow all the nodes to

play75:29

talk in decoder blocks this will always

play75:31

be present so that you have this

play75:33

triangular structure but both are

play75:35

allowed and attention doesn't care

play75:36

attention supports arbitrary

play75:38

connectivity between nodes

play75:39

the next thing I wanted to comment on is

play75:41

you keep me you keep hearing me say

play75:43

attention self-attention Etc there's

play75:45

actually also something called cross

play75:46

attention what is the difference

play75:48

so

play75:50

basically the reason this attention is

play75:53

self-attention is because the keys

play75:56

queries and the values are all coming

play75:58

from the same Source from X so the same

play76:01

Source X produces case queries and

play76:03

values so these nodes are self-attending

play76:06

but in principle attention is much more

play76:09

General than that so for example an

play76:10

encoder decoder Transformers uh you can

play76:13

have a case where the queries are

play76:15

produced from X but the keys and the

play76:17

values come from a whole separate

play76:18

external source and sometimes from

play76:21

encoder blocks that encode some context

play76:23

that we'd like to condition on and so

play76:25

the keys and the values will actually

play76:27

come from a whole separate Source those

play76:29

are nodes on the side and here we're

play76:31

just producing queries and we're reading

play76:33

off information from the side

play76:35

so cross attention is used when there's

play76:37

a separate source of nodes we'd like to

play76:41

pull information from into our nodes and

play76:44

it's self-attention if we just have

play76:45

nodes that would like to look at each

play76:46

other and talk to each other

play76:48

so this attention here happens to be

play76:50

self-attention

play76:52

but in principle

play76:54

um

play76:55

attention is a lot more General okay in

play76:57

the last note at this stage is if we

play76:59

come to the attention is all you need

play77:00

paper here we've already implemented

play77:02

attention so given query key and value

play77:04

we've multiplied the query on the key

play77:07

we've softmaxed it and then we are

play77:09

aggregating the values

play77:11

there's one more thing that we're

play77:12

missing here which is the dividing by

play77:14

one over square root of the head size

play77:16

the DK here is the head size why aren't

play77:19

they doing this once it's important so

play77:21

they call it a scaled attention

play77:23

and it's kind of like an important

play77:25

normalization to basically have

play77:27

the problem is if you have unit gaussian

play77:30

inputs so zero mean unit variance K and

play77:32

Q are unit caution and if you just do

play77:34

way naively then you see that your way

play77:36

actually will be uh the variance will be

play77:39

on the order of head size which in our

play77:40

case is 16.

play77:42

but if you multiply by one over head

play77:44

size square root so this is square root

play77:46

and this is one over

play77:48

then the variance of way will be one so

play77:50

it will be preserved

play77:52

now why is this important you'll notice

play77:55

that way here

play77:57

will feed into softmax

play77:59

and so it's really important especially

play78:01

at initialization that way be fairly

play78:03

diffuse

play78:04

so in our case here we sort of lucked

play78:07

out here and weigh had a fairly diffuse

play78:10

numbers here so

play78:13

um like this now the problem is that

play78:15

because of softmax if weight takes on

play78:17

very positive and very negative numbers

play78:19

inside it softmax will actually converge

play78:21

towards one hot vectors and so I can

play78:25

illustrate that here

play78:27

um

play78:28

say we are applying softmax to a tensor

play78:30

of values that are very close to zero

play78:32

then we're going to get a diffuse thing

play78:33

out of softmax

play78:35

but the moment I take the exact same

play78:36

thing and I start sharpening it making

play78:38

it bigger by multiplying these numbers

play78:40

by eight for example you'll see that the

play78:42

soft Max will start to sharpen and in

play78:44

fact it will sharpen towards the max so

play78:47

it will sharpen towards whatever number

play78:48

here is the highest

play78:50

and so

play78:51

um basically we don't want these values

play78:52

to be too extreme especially the

play78:54

initialization otherwise softmax will be

play78:55

way too peaky and you're basically

play78:58

aggregating

play79:00

um information from like a single node

play79:01

every node just Aggregates information

play79:03

from a single other node that's not what

play79:05

we want especially its initialization

play79:06

and so the scaling is used just to

play79:09

control the variance at initialization

play79:11

okay so having said all that let's now

play79:13

take our soft retention knowledge and

play79:15

let's take it for a spin

play79:18

so here in the code I created this head

play79:20

module and implements a single head of

play79:22

self-attention

play79:24

so you give it a head size and then here

play79:26

it creates the key query and the value

play79:28

linear layers typically people don't use

play79:30

biases in these

play79:32

so those are the linear projections that

play79:33

we're going to apply to all of our nodes

play79:36

now here I'm creating this Trill

play79:38

variable Trill is not a parameter of the

play79:40

module so in sort of pythonomic

play79:42

conventions this is called a buffer it's

play79:45

not a parameter and you have to call it

play79:46

you have to assign it to the module

play79:48

using a register buffer so that creates

play79:50

the trail

play79:51

uh the triangle lower triangular Matrix

play79:54

and when we're given the input X this

play79:56

should look very familiar now we

play79:57

calculate the keys the queries we call

play79:59

it clock in the attentions course inside

play80:01

way we normalize it so we're using

play80:04

scaled attention here

play80:06

then we make sure that a feature doesn't

play80:08

communicate with the past so this makes

play80:10

it a decoder block

play80:11

and then softmax and then aggregate the

play80:14

value and output

play80:16

then here in the language model I'm

play80:17

creating a head in the Constructor and

play80:20

I'm calling it self attention head and

play80:22

the head size I'm going to keep as the

play80:24

same and embed just for now

play80:28

and then here once we've encoded the

play80:31

information with the token embeddings

play80:33

and the position embeddings we're simply

play80:34

going to feed it into the

play80:35

self-attentioned head and then the

play80:37

output of that is going to go into uh

play80:40

the decoder language modeling head and

play80:43

create the logits so this is the sort of

play80:45

the simplest way to plug in a

play80:47

self-attention component into our

play80:49

Network right now

play80:50

I had to make one more change which is

play80:53

that here

play80:55

in the generate we have to make sure

play80:57

that our idx that we feed into the model

play81:00

because now we're using positional

play81:02

embeddings we can never have more than

play81:04

block size coming in because if idx is

play81:07

more than block size then our position

play81:09

embedding table is going to run out of

play81:11

scope because it only has embeddings for

play81:12

up to block size

play81:14

and so therefore I added some code here

play81:16

to crop the context that we're going to

play81:18

feed into self

play81:21

so that we never pass in more than block

play81:23

size elements

play81:25

so those are the changes and let's Now

play81:27

train the network okay so I also came up

play81:29

to the script here and I decreased the

play81:31

learning rate because the self-attention

play81:33

can't tolerate very very high learning

play81:34

rates and then I also increase the

play81:37

number of iterations because the

play81:38

learning rate is lower and then I

play81:39

trained it and previously we were only

play81:41

able to get to up to 2.5 and now we are

play81:43

down to 2.4 so we definitely see a

play81:46

little bit of an improvement from 2.5 to

play81:48

2.4 roughly but the text is still not

play81:50

amazing so clearly the self-attention

play81:53

head is doing some useful communication

play81:55

but

play81:57

um we still have a long way to go okay

play81:59

so now we've implemented the

play82:00

scale.product attention now next up in

play82:02

the attention is all you need paper

play82:04

there's something called multi-head

play82:06

attention and what is multi-head

play82:07

attention it's just applying multiple

play82:10

attentions in parallel and concatenating

play82:12

the results

play82:13

so they have a little bit of diagram

play82:15

here I don't know if this is super clear

play82:17

it's really just multiple attentions in

play82:20

parallel

play82:21

so let's Implement that fairly

play82:23

straightforward

play82:25

if we want a multi-head attention then

play82:27

we want multiple heads of self-attention

play82:28

running in parallel

play82:30

so in pytorch we can do this by simply

play82:33

creating multiple heads

play82:36

so however heads how many however many

play82:38

heads you want and then what is the head

play82:39

size of each

play82:41

and then we run all of them in parallel

play82:44

into a list and simply concatenate all

play82:47

of the outputs and we're concatenating

play82:49

over the channel dimension

play82:51

so the way this looks now is we don't

play82:53

have just a single attention

play82:55

that has a hit size of 32 because

play82:58

remember and in bed is 32.

play83:01

instead of having one Communication

play83:03

channel we now have four communication

play83:06

channels in parallel and each one of

play83:08

these communication channels typically

play83:10

will be smaller correspondingly so

play83:14

because we have four communication

play83:16

channels we want eight dimensional

play83:17

self-attention and so from each

play83:20

Communication channel we're going to

play83:21

gather eight dimensional vectors and

play83:23

then we have four of them and that

play83:25

concatenates to give us 32 which is the

play83:27

original and embed

play83:28

and so this is kind of similar to um if

play83:31

you're familiar with convolutions this

play83:32

is kind of like a group convolution

play83:33

because basically instead of having one

play83:36

large convolution we do convolutional

play83:38

groups and uh that's multi-headed

play83:40

self-attention

play83:42

and so then here we just use sa heads

play83:45

self-attussion Heads instead

play83:47

now I actually ran it and uh scrolling

play83:50

down

play83:52

I ran the same thing and then we now get

play83:53

this down to 2.28 roughly and the output

play83:58

is still the generation is still not

play83:59

amazing but clearly the validation loss

play84:01

is improving because we were at 2.4 just

play84:03

now

play84:04

and so it helps to have multiple

play84:06

communication channels because obviously

play84:08

these tokens have a lot to talk about

play84:10

and they want to find the consonants the

play84:12

vowels they want to find the vowels just

play84:14

from certain positions they want to find

play84:16

any kinds of different things and so it

play84:19

helps to create multiple independent

play84:20

channels of communication gather lots of

play84:22

different types of data and then decode

play84:25

the output now going back to the paper

play84:27

for a second of course I didn't explain

play84:28

this figure in full detail but we are

play84:31

starting to see some components of what

play84:32

we've already implemented we have the

play84:33

positional encodings the token encodings

play84:35

that add we have the masked multi-headed

play84:38

attention implemented now here's another

play84:41

multi-headed tension which is a cross

play84:42

attention to an encoder which we haven't

play84:45

we're not going to implement in this

play84:46

case I'm going to come back to that

play84:48

later

play84:49

but I want you to notice that there's a

play84:50

feed forward part here and then this is

play84:52

grouped into a block that gets repeated

play84:54

again and again

play84:55

now the feed forward part here is just a

play84:57

simple multi-layer perceptron

play85:00

um

play85:01

so the multi-headed so here position

play85:03

wise feed forward networks is just a

play85:06

simple little MLP

play85:08

so I want to start basically in a

play85:09

similar fashion also adding computation

play85:11

into the network

play85:13

and this computation is on the per node

play85:15

level so

play85:17

I've already implemented it and you can

play85:19

see the diff highlighted on the left

play85:20

here when I've added or changed things

play85:22

now before we had the multi-headed

play85:25

self-attention that did the

play85:26

communication but we went way too fast

play85:28

to calculate the logits so the tokens

play85:31

looked at each other but didn't really

play85:32

have a lot of time to think on what they

play85:35

found from the other tokens

play85:37

and so what I've implemented here is a

play85:40

little feed forward single layer and

play85:42

this little layer is just a linear

play85:44

followed by a relative nonlinearity and

play85:46

that's that's it

play85:47

so it's just a little layer and then I

play85:51

call it feed forward

play85:52

and embed

play85:54

and then this feed forward is just

play85:56

called sequentially right after the

play85:57

self-attention so we self-attend then we

play86:00

feed forward and you'll notice that the

play86:02

feet forward here when it's applying

play86:04

linear this is on a per token level all

play86:06

the tokens do this independently so the

play86:09

self-attention is the communication and

play86:11

then once they've gathered all the data

play86:13

now they need to think on that data

play86:14

individually

play86:16

and so that's what feed forward is doing

play86:17

and that's why I've added it here now

play86:20

when I train this the validation loss

play86:22

actually continues to go down now to

play86:24

2.24 which is down from 2.28 the output

play86:28

still look kind of terrible but at least

play86:30

we've improved the situation

play86:31

and so as a preview

play86:34

we're going to now start to intersperse

play86:36

the communication with the computation

play86:39

and that's also what the Transformer

play86:41

does when it has blocks that communicate

play86:43

and then compute and it groups them and

play86:46

replicates them

play86:48

okay so let me show you what we like to

play86:50

do we'd like to do something like this

play86:52

we have a block and this block is

play86:54

basically this part here except for the

play86:56

cross attention

play86:58

now the block basically intersperses

play87:00

communication and then computation the

play87:02

computation the communication is done

play87:04

using multi-headed self-attention and

play87:06

then the computation is done using the

play87:08

feed forward Network on all the tokens

play87:10

independently

play87:12

now what I've added here also is you'll

play87:15

notice

play87:17

this takes the number of embeddings in

play87:18

the embedding Dimension and number of

play87:20

heads that we would like which is kind

play87:21

of like group size in group convolution

play87:23

and I'm saying that number of heads we'd

play87:25

like is for and so because this is 32 we

play87:29

calculate that because this 32 the

play87:31

number of hats should be four

play87:33

um there's num the head size should be

play87:35

eight so that everything sort of works

play87:36

out Channel wise

play87:38

um so this is how the Transformer

play87:40

structures uh sort of the uh the sizes

play87:42

typically

play87:44

so the head size will become eight and

play87:45

then this is how we want to intersperse

play87:47

them and then here I'm trying to create

play87:49

blocks which is just a sequential

play87:51

application of block block so that we're

play87:54

interspersing communication feed forward

play87:55

many many times and then finally we

play87:58

decode

play87:59

now actually try to run this and the

play88:02

problem is this doesn't actually give a

play88:03

very good uh answer a very good result

play88:06

and the reason for that is we're

play88:08

starting to actually get like a pretty

play88:09

deep neural net and deep neural Nets uh

play88:12

suffer from optimization issues and I

play88:14

think that's where we're kind of like

play88:15

slightly starting to run into so we need

play88:17

one more idea that we can borrow from

play88:19

the

play88:20

um Transformer paper to resolve those

play88:21

difficulties now there are two

play88:23

optimizations that dramatically help

play88:25

with the depth of these networks and

play88:27

make sure that the networks remain

play88:29

optimizable let's talk about the first

play88:31

one

play88:31

the first one in this diagram is you see

play88:33

this Arrow here

play88:35

and then this arrow and this Arrow those

play88:37

are skip connections or sometimes called

play88:39

residual connections

play88:41

they come from this paper uh the

play88:43

procedural learning form and recognition

play88:45

from about 2015. that introduced the

play88:48

concept

play88:49

now these are basically what it means is

play88:53

you transform the data but then you have

play88:55

a skip connection with addition

play88:57

from the previous features now the way I

play89:00

like to visualize it that I prefer is

play89:03

the following here the computation

play89:04

happens from the top to bottom and

play89:08

basically you have this uh residual

play89:10

pathway and you are free to Fork off

play89:12

from the residual pathway perform some

play89:14

computation and then project back to the

play89:16

residual pathway via addition

play89:18

and so you go from the the inputs to the

play89:22

targets only the plus and plus and plus

play89:25

and the reason this is useful is because

play89:26

during that propagation remember from

play89:28

our micrograd video earlier addition

play89:31

distributes gradients equally to both of

play89:33

its branches that that fat as the input

play89:36

and so the supervision or the gradients

play89:40

from the loss basically hop

play89:42

through every addition node all the way

play89:45

to the input and then also Fork off into

play89:48

the residual blocks

play89:51

but basically you have this gradient

play89:52

Super Highway that goes directly from

play89:54

the supervision all the way to the input

play89:56

unimpeded and then these virtual blocks

play89:59

are usually initialized in the beginning

play90:00

so they contribute very very little if

play90:03

anything to the residual pathway they

play90:04

they are initialized that way so in the

play90:07

beginning they are sort of almost kind

play90:08

of like not there but then during the

play90:10

optimization they come online over time

play90:13

and they start to contribute but at

play90:16

least at the initialization you can go

play90:18

from directly supervision to the input

play90:20

gradient is unimpeded and just close and

play90:23

then the blocks over time kick in and so

play90:26

that dramatically helps with the

play90:27

optimization so let's implement this so

play90:30

coming back to our block here basically

play90:31

what we want to do is we want to do x

play90:34

equals X Plus

play90:36

solve the tension and x equals X Plus

play90:38

solve that feed forward

play90:40

so this is X and then we Fork off and do

play90:44

some communication and come back and we

play90:46

Fork off and we do some computation and

play90:47

come back

play90:48

so those are residual connections and

play90:51

then swinging back up here

play90:53

we also have to introduce this

play90:54

projection

play90:55

so nn.linear

play90:58

and this is going to be from

play91:01

after we concatenate this this is the

play91:03

precise and embed so this is the output

play91:05

of the soft tension itself

play91:07

but then we actually want the uh to

play91:10

apply the projection

play91:12

and that's the result

play91:14

so the projection is just a linear

play91:15

transformation of the outcome of this

play91:17

layer

play91:18

so that's the projection back into the

play91:20

residual pathway

play91:21

and then here in the feed forward it's

play91:23

going to be the same thing I could have

play91:25

a soft.projection here as well but let

play91:28

me just simplify it

play91:29

and let me

play91:31

couple it inside the same sequential

play91:33

container

play91:34

and so this is the projection layer

play91:36

going back into the residual pathway

play91:39

and so

play91:40

that's uh well that's it so now we can

play91:43

train this so I implemented one more

play91:44

small change when you look into the

play91:47

paper again you see that the

play91:49

dimensionality of input and output is

play91:51

512 for them and they're saying that the

play91:53

inner layer here in the feed forward has

play91:55

dimensionality of 2048. so there's a

play91:57

multiplier of four

play91:59

and so the inner layer of the feed

play92:01

forward Network

play92:02

should be multiplied by four in terms of

play92:04

Channel sizes so I came here and I

play92:06

multiplied to four times embed here for

play92:08

the feed forward and then from four

play92:10

times n embed coming back down to an

play92:12

embed when we go back to the project to

play92:14

the projection so adding a bit of

play92:16

computation here and growing that layer

play92:18

that is in the residual block on the

play92:20

side of the residual pathway

play92:22

and then I trained this and we actually

play92:24

get down all the way to uh 2.08

play92:27

validation loss and we also see that the

play92:29

network is starting to get big enough

play92:30

that our train loss is getting ahead of

play92:32

validation loss so we're starting to see

play92:33

like a little bit of overfitting

play92:36

and um our our um

play92:39

Generations here are still not amazing

play92:41

but at least you see that we can see

play92:43

like is here this now grieve sank

play92:46

like this starts to almost look like

play92:48

English so

play92:49

um yeah we're starting to really get

play92:51

there okay and the second Innovation

play92:52

that is very helpful for optimizing very

play92:54

deep neural networks is right here so we

play92:57

have this addition now that's the

play92:58

residual part but this Norm is referring

play93:00

to something called layer Norm

play93:02

so layer Norm is implemented in pi torch

play93:04

it's a paper that came out a while back

play93:06

here

play93:08

um

play93:10

and layer Norm is very very similar to

play93:11

Bachelor so remember back to our make

play93:14

more series part three we implemented

play93:16

batch normalization

play93:18

and patch normalization basically just

play93:20

made sure that across the batch

play93:23

Dimension any individual neuron had unit

play93:27

gaussian

play93:28

distribution so it was zero mean and

play93:31

unit standard deviation one standard

play93:33

deviation output

play93:35

so what I did here is I'm copy pasting

play93:37

The Bachelor 1D that we developed in our

play93:39

makemore series

play93:40

and see here we can initialize for

play93:43

example this module and we can have a

play93:45

batch of 32 100 dimensional vectors

play93:48

feeding through the bathroom layer

play93:50

so what this does is it guarantees

play93:53

that when we look at just the zeroth

play93:55

column

play93:56

it's a zero mean one standard deviation

play93:59

so it's normalizing every single column

play94:01

of this input

play94:03

now the rows are not going to be

play94:06

normalized by default because we're just

play94:08

normalizing columns so let's now

play94:10

implement the layer Norm uh it's very

play94:12

complicated look we come here we change

play94:15

this from 0 to 1. so we don't normalize

play94:18

The Columns we normalize the rows

play94:21

and now we've implemented layer Norm

play94:24

so now the columns are not going to be

play94:27

normalized

play94:29

but the rows are going to be normalized

play94:32

for every individual example it's 100

play94:34

dimensional Vector is normalized in this

play94:36

way and because our computation Now does

play94:39

not span across examples we can delete

play94:42

all of this buffers stuff because we can

play94:45

always apply this operation and don't

play94:48

need to maintain any running buffers so

play94:51

we don't need the buffers

play94:52

we don't There's no distinction between

play94:56

training and test time

play94:58

and we don't need these running buffers

play95:01

we do keep gamma and beta we don't need

play95:04

the momentum we don't care if it's

play95:05

training or not

play95:07

and this is now a layer Norm

play95:10

and it normalizes the rows instead of

play95:13

the columns and this here

play95:15

is identical to basically this here

play95:19

so let's now Implement layer Norm in our

play95:21

Transformer before I incorporate the

play95:23

layer Norm I just wanted to note that as

play95:25

I said very few details about the

play95:27

Transformer have changed in the last

play95:28

five years but this is actually

play95:30

something that slightly departs from the

play95:31

original paper you see that the ADD and

play95:33

Norm is applied after the transformation

play95:37

but um in now it is a bit more basically

play95:41

common to apply the layer Norm before

play95:43

the transformation so there's a

play95:45

reshuffling of the layer Norms uh so

play95:47

this is called the pre-norm formulation

play95:48

and that's the one that we're going to

play95:50

implement as well so slight deviation

play95:51

from the original paper

play95:53

basically we need two layer Norms layer

play95:55

Norm one is an N dot layer norm and we

play95:59

tell it how many

play96:00

um what is the embedding dimension

play96:02

and we need the second layer Norm

play96:05

and then here the layer rooms are

play96:07

applied immediately on x

play96:09

so self.layer number one in applied on x

play96:12

and salt on layer number two applied on

play96:14

X before it goes into sulfur tension and

play96:17

feed forward

play96:18

and the size of the layer Norm here is

play96:21

an embeds of 32. so when the layer Norm

play96:24

is normalizing our features it is the

play96:27

normalization here

play96:30

happens the mean and the variance are

play96:32

taking over 32 numbers so the batch and

play96:35

the time act as batch Dimensions both of

play96:37

them so this is kind of like a per token

play96:40

transformation that just normalizes the

play96:43

features and makes them a unit mean unit

play96:46

gaussian at initialization

play96:49

but of course because these layer Norms

play96:51

inside it have these gamma and beta

play96:53

trainable parameters

play96:54

the layer normal eventually create

play96:58

outputs that might not be unit gaussian

play97:00

but the optimization will determine that

play97:03

so for now this is the uh this is

play97:05

incorporating the layer norms and let's

play97:07

train them up okay so I let it run and

play97:10

we see that we get down to 2.06 which is

play97:12

better than the previous 2.08 so a

play97:14

slight Improvement by adding the layer

play97:16

norms and I'd expect that they help even

play97:18

more if we had bigger and deeper Network

play97:20

one more thing I forgot to add is that

play97:23

there should be a layer Norm here also

play97:25

typically as at the end of the

play97:27

Transformer and right before the final

play97:29

linear layer that decodes into

play97:31

vocabulary so I added that as well so at

play97:35

this stage we actually have a pretty

play97:36

complete Transformer according to the

play97:38

original paper and it's a decoder only

play97:40

Transformer I'll I'll talk about that in

play97:42

a second but at this stage the major

play97:45

pieces are in place so we can try to

play97:46

scale this up and see how well we can

play97:48

push this number

play97:49

now in order to scale out the model I

play97:51

had to perform some cosmetic changes

play97:52

here to make it nicer so I introduced

play97:55

this variable called end layer which

play97:57

just specifies how many layers of the

play97:59

blocks we're going to have I create a

play98:01

bunch of blocks and we have a new

play98:03

variable number of heads as well

play98:05

I pulled out the layer Norm here and so

play98:07

this is identical now one thing that I

play98:10

did briefly change is I added a dropout

play98:13

so Dropout is something that you can add

play98:15

right before the residual connection

play98:17

back

play98:18

or right before the connection back into

play98:20

the original pathway

play98:21

so we can drop out that as the last

play98:23

layer here

play98:24

we can drop out uh here at the end of

play98:27

the multi-headed extension as well

play98:29

and we can also drop out here when we

play98:32

calculate the um basically affinities

play98:35

and after the soft Max we can drop out

play98:37

some of those so we can randomly prevent

play98:39

some of the nodes from communicating

play98:41

and so Dropout comes from this paper

play98:45

from 2014 or so

play98:47

and basically it takes your neural net

play98:51

and it randomly every forward backward

play98:53

pass shuts off some subset of neurons

play98:57

so randomly drops them to zero and

play99:00

trains without them and what this does

play99:03

effectively is because the mask of

play99:05

what's being dropped out has changed

play99:07

every single forward backward pass it

play99:08

ends up kind of training an ensemble of

play99:11

sub Networks and then at this time

play99:14

everything is fully enabled and kind of

play99:15

all of those sub networks are merged

play99:17

into a single Ensemble if you can if you

play99:19

want to think about it that way so I

play99:21

would read the paper to get the full

play99:22

detail for now we're just going to stay

play99:24

on the level of this is a regularization

play99:26

technique and I added it because I'm

play99:29

about to scale up the model quite a bit

play99:30

and I was concerned about overfitting

play99:33

so now when we scroll up to the top uh

play99:36

we'll see that I changed a number of

play99:37

hyper parameters here about our neural

play99:39

net so I made the batch size B much

play99:41

larger now with 64.

play99:43

I changed the block size to be 256 so

play99:45

previously it was just eight eight

play99:47

characters of context now it is 256

play99:49

characters of context to predict the

play99:51

257th

play99:53

uh I brought down the learning rate a

play99:56

little bit because the neural net is now

play99:57

much bigger so I brought down the

play99:58

learning rate

play100:00

the embedding Dimension is now 384 and

play100:02

there are six heads so 384 divide 6

play100:06

means that every head is 64 dimensional

play100:08

as it as a standard

play100:10

and then there are going to be six

play100:12

layers of that

play100:13

and the Dropout will be of 0.2 so every

play100:15

forward backward passed 20 percent of

play100:17

all of these um

play100:19

intermediate calculations are disabled

play100:22

and dropped to zero

play100:23

and then I already trained this and I

play100:25

ran it so uh drum roll how well does it

play100:28

perform

play100:29

so let me just scroll up here

play100:32

we get a validation loss of 1.48 which

play100:36

is actually quite a bit of an

play100:37

improvement on what we had before which

play100:38

I think was 2.07 so we went from 2.07

play100:41

all the way down to 1.48 just by scaling

play100:43

up this neural nut with the code that we

play100:45

have and this of course ran for a lot

play100:47

longer this may be trained for I want to

play100:49

say about 15 minutes on my a100 GPU so

play100:53

that's a pretty good GPU and if you

play100:55

don't have a GPU you're not going to be

play100:56

able to reproduce this on a CPU this

play100:58

would be

play100:59

um I would not run this on the CPU or a

play101:01

Macbook or something like that you'll

play101:03

have to break down the number of layers

play101:05

and the embedding Dimension and so on

play101:08

but in about 15 minutes we can get this

play101:10

kind of a result and

play101:12

um I'm printing

play101:13

some of the Shakespeare here but what I

play101:15

did also is I printed 10 000 characters

play101:17

so a lot more and I wrote them to a file

play101:20

and so here we see some of the outputs

play101:24

so it's a lot more recognizable as the

play101:27

input text file so the input text file

play101:29

just for reference looked like this

play101:31

so there's always like someone speaking

play101:33

in this matter and uh

play101:36

our predictions now take on that form

play101:39

except of course they're they're

play101:40

nonsensical when you actually read them

play101:42

so

play101:44

it is every crimpy bee house oh those

play101:47

preparation we give heed

play101:52

um you know

play101:56

Oho sent me you mighty Lord

play102:00

anyway so you can read through this

play102:02

um it's nonsensical of course but this

play102:04

is just a Transformer trained on the

play102:06

Character level for 1 million characters

play102:08

that come from Shakespeare so they're

play102:10

sort of like blabbers on and Shakespeare

play102:12

like manner but it doesn't of course

play102:14

make sense at this scale

play102:16

uh but I think I think still a pretty

play102:19

good demonstration of what's possible

play102:21

so now

play102:24

I think uh that kind of like concludes

play102:26

the programming section of this video we

play102:28

basically kind of did a pretty good job

play102:30

in um of implementing this Transformer

play102:33

but the picture doesn't exactly match up

play102:36

to what we've done so what's going on

play102:37

with all these additional Parts here so

play102:40

let me finish explaining this

play102:41

architecture and why it looks so funky

play102:44

basically what's happening here is what

play102:46

we implemented here is a decoder only

play102:48

Transformer so there's no component here

play102:51

this part is called the encoder and

play102:53

there's no cross attention block here

play102:55

our block only has a self-attention and

play102:58

the feed forward so it is missing this

play103:00

third in between piece here this piece

play103:03

does cross attention so we don't have it

play103:05

and we don't have the encoder we just

play103:07

have the decoder and the reason we have

play103:09

a decoder only

play103:10

is because we are just generating text

play103:13

and it's unconditioned on anything we're

play103:15

just we're just blabbering on according

play103:16

to a given data set

play103:18

what makes it a decoder is that we are

play103:20

using the Triangular mask in our

play103:22

Transformer so it has this Auto

play103:24

regressive property where we can just go

play103:26

and sample from it

play103:28

so the fact that it's using the

play103:29

Triangular triangular mask to mask out

play103:32

the attention makes it a decoder and it

play103:34

can be used for language modeling now

play103:37

the reason that the original paper had

play103:39

an encoder decoder architecture is

play103:41

because it is a machine translation

play103:42

paper so it is concerned with a

play103:44

different setting in particular

play103:46

it expects some tokens that encode say

play103:50

for example French

play103:51

and then it is expected to decode the

play103:54

translation in English

play103:55

so so you typically these here are

play103:58

special tokens so you are expected to

play104:00

read in this and condition on it and

play104:03

then you start off the generation with a

play104:05

special token called start so this is a

play104:07

special new token that you introduce and

play104:10

always place in the beginning

play104:12

and then the network is expected to put

play104:14

neural networks are awesome and then a

play104:17

special end token to finish a generation

play104:21

so this part here will be decoded

play104:23

exactly as we we've done it neural

play104:25

networks are awesome will be identical

play104:27

to what we did

play104:28

but unlike what we did they want to

play104:31

condition the generation on some

play104:34

additional information and in that case

play104:36

this additional information is the

play104:38

French sentence that they should be

play104:39

translating

play104:40

so what they do now

play104:42

is they bring in the encoder now the

play104:45

encoder reads this part here so we're

play104:48

only going to take the part of French

play104:49

and we're going to create tokens from it

play104:52

exactly as we've seen in our video and

play104:54

we're going to put a Transformer on it

play104:56

but there's going to be no triangular

play104:58

mask and so all the tokens are allowed

play105:00

to talk to each other as much as they

play105:02

want and they're just encoding

play105:04

whatever's the content of this French

play105:06

sentence

play105:08

once they've encoded it

play105:10

they've they basically come out in the

play105:12

top here

play105:13

and then what happens here is in our

play105:15

decoder which does the language modeling

play105:18

there's an additional connection here to

play105:21

the outputs of the encoder

play105:23

and that is brought in through a cross

play105:25

attention

play105:27

so the queries are still generated from

play105:28

X but now the keys and the values are

play105:31

coming from the side the keys and the

play105:33

values are coming from the top

play105:35

generated by the nodes that came outside

play105:37

of the encoder

play105:39

and those tops the keys and the values

play105:41

there the top of it

play105:43

feeding on the side into every single

play105:45

block of the decoder and so that's why

play105:48

there's an additional cross attention

play105:49

and really what it's doing is it's

play105:51

conditioning the decoding not just on

play105:54

the past of this current decoding but

play105:57

also on having seen the full fully

play106:01

encoded French

play106:02

prompt sort of

play106:04

and so it's an encoder decoder model

play106:06

which is why we have those two

play106:07

Transformers an additional block and so

play106:09

on so we did not do this because we have

play106:12

no we have nothing to encode there's no

play106:13

conditioning we just have a text file

play106:15

and we just want to imitate it and

play106:17

that's why we are using a decoder only

play106:19

Transformer exactly as done in GPT

play106:22

okay so now I wanted to do a very brief

play106:24

walkthrough of Nano GPT which you can

play106:26

find on my GitHub and uh Nano GPT is

play106:29

basically two files of Interest there's

play106:31

train.pi and model.pi trained at Pi is

play106:34

all the boilerplate code for training

play106:36

the network it is basically all the

play106:38

stuff that we had here it's the training

play106:40

Loop

play106:41

it's just that it's a lot more

play106:42

complicated because we're saving and

play106:44

loading checkpoints and pre-trained

play106:46

weights and we are decaying the learning

play106:48

rate and compiling the model and using

play106:50

distributed training across multiple

play106:52

nodes or gpus so the training that Pi

play106:54

gets a little bit more hairy complicated

play106:56

there's more options Etc

play106:59

but the model.pi should look very very

play107:02

um similar to what we've done here in

play107:04

fact the model is is almost identical

play107:07

so first here we have the causal

play107:09

self-attention block and all of this

play107:11

should look very very recognizable to

play107:13

you we're producing queries Keys values

play107:15

we're doing Dot products we're masking

play107:18

applying softmax optionally dropping out

play107:20

and here we are pooling the values

play107:24

what is different here is that in our

play107:26

code

play107:27

I have separated out the multi-headed

play107:30

attention into just a single individual

play107:32

head

play107:33

and then here I have multiple heads and

play107:35

I explicitly concatenate them

play107:37

whereas here all of it is implemented in

play107:40

a batched manner inside a single causal

play107:42

self-attention and so we don't just have

play107:44

a b and a T and A C Dimension we also

play107:47

end up with a fourth dimension which is

play107:48

the heads

play107:49

and so it just gets a lot more sort of

play107:52

hairy because we have four dimensional

play107:53

array tensors now but it is equivalent

play107:57

mathematically so the exact same thing

play107:59

is happening is what we have it's just

play108:01

it's a bit more efficient because all

play108:02

the heads are not treated as a batch

play108:04

Dimension as well

play108:06

then we have to multiply perceptron it's

play108:08

using the gallon nonlinearity which is

play108:10

defined here except instead of relu and

play108:13

this is done just because openingi used

play108:15

it and I want to be able to load their

play108:16

checkpoints

play108:17

uh the blocks of the Transformer are

play108:19

identical the communicate and the

play108:21

compute phase as we saw

play108:23

and then the GPT will be identical we

play108:25

have the position encodings token

play108:26

encodings the blocks the layer Norm at

play108:29

the end the final linear layer

play108:32

and this should look all very

play108:33

recognizable

play108:34

and there's a bit more here because I'm

play108:36

loading checkpoints and stuff like that

play108:38

I'm separating out the parameters into

play108:40

building that should be weight decayed

play108:41

and those that shouldn't

play108:44

um but the generate function should also

play108:45

be very very similar so a few details

play108:48

are different but you should definitely

play108:49

be able to look at this uh file and be

play108:52

able to understand a lot of the pieces

play108:53

now so let's now bring things back to

play108:55

chat GPT

play108:56

what would it look like if we wanted to

play108:58

train chatgpt ourselves and how does it

play109:00

relate to what we learned today

play109:02

well to train in chat GPT there are

play109:04

roughly two stages first is the

play109:06

pre-training stage and then the fine

play109:07

tuning stage in the pre-training stage

play109:10

we are training on a large chunk of

play109:12

internet and just trying to get a first

play109:15

decoder only Transformer to Babel text

play109:18

so it's very very similar to what we've

play109:20

done ourselves

play109:22

except we've done like a tiny little

play109:24

baby pre-training step

play109:26

and so in our case uh this is how you

play109:30

print a number of parameters I printed

play109:32

it and it's about 10 million so this

play109:34

Transformer that I created here to

play109:36

create little Shakespeare

play109:38

um Transformer was about 10 million

play109:40

parameters our data set is roughly 1

play109:43

million uh characters so roughly 1

play109:45

million tokens but you have to remember

play109:47

that opening uses different vocabulary

play109:49

they're not on the Character level they

play109:51

use these um subword chunks of words and

play109:54

so they have a vocabulary of 50 000

play109:56

roughly elements and so their sequences

play109:58

are a bit more condensed

play110:01

so our data set the Shakespeare data set

play110:03

would be probably around 300 000 tokens

play110:06

in the openai vocabulary roughly

play110:09

so we trained about 10 million parameter

play110:11

model and roughly 300 000 tokens

play110:14

now when you go to the gpd3 paper

play110:17

and you look at the Transformers that

play110:20

they trained

play110:21

they trained a number of Transformers of

play110:23

different sizes but the biggest

play110:25

Transformer here has 175 billion

play110:27

parameters so ours is again 10 million

play110:30

they used this number of layers in the

play110:32

Transformer This is the End embed

play110:35

this is the number of heads and this is

play110:37

the head size

play110:39

and then this is the batch size so ours

play110:42

was 65.

play110:44

and the learning rate is similar now

play110:46

when they train this Transformer they

play110:48

trained on 300 billion tokens

play110:50

so again remember ours is about 300 000

play110:52

so this is uh about a million fold

play110:56

increase and this number would not be

play110:57

even that large by today's standards

play110:59

you'd be going up uh one trillion and

play111:01

above

play111:02

so they are training a significantly

play111:05

larger model

play111:07

on a good chunk of the internet and that

play111:10

is the pre-training stage but otherwise

play111:12

these hyper parameters should be fairly

play111:14

recognizable to you and the architecture

play111:16

is actually like nearly identical to

play111:17

what we implemented ourselves but of

play111:19

course it's a massive infrastructure

play111:21

challenge to train this you're talking

play111:23

about typically thousands of gpus having

play111:26

to you know talk to each other to train

play111:28

models of this size so that's just a

play111:30

pre-training stage now after you

play111:32

complete the pre-training stage you

play111:34

don't get something that responds to

play111:36

your questions with answers and is not

play111:38

helpful and Etc you get a document

play111:41

completer right so it babbles but it

play111:44

doesn't Babble Shakespeare in Babel's

play111:46

internet it will create arbitrary news

play111:48

articles and documents and it will try

play111:50

to complete documents because that's

play111:51

what it's trained for it's trying to

play111:52

complete the sequence so when you give

play111:54

it a question it would just uh

play111:56

potentially just give you more questions

play111:58

it would follow with more questions it

play112:00

will do whatever it looks like the some

play112:02

closed document would do in the training

play112:05

data on the internet and so who knows

play112:07

you're getting kind of like undefined

play112:08

Behavior it might basically answer with

play112:11

two questions with other questions it

play112:13

might ignore your question it might just

play112:15

try to complete some news article it's

play112:17

totally underlined as we say

play112:19

so the second fine tuning stage is to

play112:21

actually align it to be an assistant and

play112:24

this is the second stage

play112:26

and so this Chachi PT blog post from

play112:29

open AI talks a little bit about how the

play112:30

stage is achieved we basically

play112:34

um

play112:35

there's roughly three steps to the to

play112:36

this stage uh so what they do here is

play112:39

they start to collect training data that

play112:41

looks specifically like what an

play112:43

assistant would do so if you have

play112:44

documents that have the format where the

play112:46

question is on top and then an answer is

play112:48

below and they have a large number of

play112:50

these but probably not on the order of

play112:52

the internet this is probably on the

play112:53

order of maybe thousands of examples

play112:56

and so they they then fine-tuned the

play112:59

model to basically only focus on

play113:03

documents that look like that and so

play113:04

you're starting to slowly align it so

play113:06

it's going to expect a question at the

play113:08

top and it's going to expect to complete

play113:09

the answer

play113:10

and uh these very very large models are

play113:13

very sample efficient during their fine

play113:15

tuning so this actually somehow works

play113:17

but that's just step one that's just

play113:19

fine-tuning so then they actually have

play113:21

more steps where okay the second step is

play113:23

you let the model respond and then

play113:25

different Raiders look at the different

play113:27

responses and rank them for their

play113:29

preference as to which one is better

play113:30

than the other they use that to train a

play113:33

reward model so they can predict

play113:34

basically using a different network how

play113:37

much of any candidate response would be

play113:41

desirable

play113:42

and then once they have a reward model

play113:44

they run PPO which is a form of policy

play113:47

policy gradient um reinforcement

play113:49

learning optimizer

play113:50

to fine-tune this sampling policy so

play113:54

that the answers that the GPT GPT now

play113:57

generates are expected to score a high

play114:00

reward according to the reward model

play114:03

and so basically there's a whole the

play114:05

lining stage here or fine-tuning stage

play114:07

it's got multiple steps in between there

play114:09

as well and it takes the model from

play114:12

being a document completer to a question

play114:14

answer and that's like a whole separate

play114:17

stage a lot of this data is not

play114:19

available publicly it is internal to

play114:21

open Ai and it's much harder to

play114:24

replicate this stage

play114:26

um and so that's roughly what would give

play114:28

you a child GPD and Nano GPT focuses on

play114:31

the pre-training stage okay and that's

play114:33

everything that I wanted to cover today

play114:34

so we trained to summarize a decoder

play114:38

only Transformer following this famous

play114:41

paper attention is all you need from

play114:43

2017.

play114:44

and so that's basically a GPT we trained

play114:47

it on a tiny Shakespeare and got

play114:50

sensible results

play114:52

all of the training code is roughly

play114:55

200 lines of code I will be releasing

play114:58

this um code base so also it comes with

play115:02

all the git log commits along the way as

play115:04

we built it up

play115:06

in addition to this code I'm going to

play115:08

release the notebook of course the

play115:11

Google collab

play115:12

and I hope that gave you a sense for how

play115:14

you can train

play115:15

um

play115:16

these models like say gpt3 there will be

play115:18

architecturally basically identical to

play115:21

what we have but they are somewhere

play115:22

between ten thousand and one million

play115:24

times bigger depending on how you count

play115:26

and so that's all I have for now we did

play115:30

not talk about any of the fine tuning

play115:32

stages that would typically go on top of

play115:33

this so if you're interested in

play115:35

something that's not just language

play115:36

modeling but you actually want to you

play115:38

know say perform tasks or you want them

play115:41

to be aligned in a specific way or you

play115:43

want to detect sentiment or anything

play115:45

like that basically anytime you don't

play115:47

want something that's just a document

play115:48

completer you have to complete further

play115:51

stages of fine tuning which we did not

play115:52

cover

play115:53

uh and that could be simple supervised

play115:55

fine tuning or it can be something more

play115:57

fancy like we see in chargept we

play115:59

actually train a reward model and then

play116:01

do rounds of PPO to align it with

play116:03

respect to the reward model

play116:05

so there's a lot more that can be done

play116:06

on top of it I think for now we're

play116:08

starting to get to about two hours Mark

play116:09

so I'm going to

play116:11

um kind of finish here

play116:14

I hope you enjoyed the lecture and uh

play116:16

yeah go forth and transform see you

play116:18

later

Rate This

5.0 / 5 (0 votes)

Related Tags
人工智能Transformer自注意力GPT模型深度学习编程实践模型训练技术创新数据分析语言模型
Do you need a summary in English?