大语言模型微调之道5——准备数据

宝玉的技术分享
30 Aug 202310:53

Summary

TLDR本视频教程介绍了如何为训练准备高质量数据。强调了数据质量、多样性和真实性的重要性,并解释了数据预处理的步骤,包括收集指令响应对、拼接、标记化、填充或截断以及分割训练和测试数据集。通过使用Hugging Face的自动标记化器类,演示了如何将文本转换为模型可以理解的数字。最后,展示了如何将这些步骤应用于实际数据集,为模型训练做好准备。

Takeaways

  • 📈 高质量数据:准备用于训练的数据时,应优先考虑数据的质量,避免输入错误导致输出错误。
  • 🌐 数据多样性:确保数据覆盖广泛,以防止模型记忆并重复相同的输入输出对。
  • 🔍 真实与生成数据:尽可能使用真实数据,因为生成数据可能包含可检测的模式,限制模型学习新框架的能力。
  • 📚 数据量的重要性:虽然数据量很重要,但在大多数机器学习应用中,质量、多样性和真实性更为关键。
  • 🔢 收集数据:首先收集指令-响应对或问题-答案对,并将其从模板中提取出来。
  • 🔑 标记化数据:将文本数据转换为模型可以理解的数字,通常基于字符出现的频率。
  • 📏 填充和截断:为了使批次中的所有文本长度一致,使用填充策略,并将过长的文本进行截断。
  • 🔄 批量处理:在处理数据时,通常以批次的形式进行,以便于模型高效学习。
  • 📊 分割数据集:将数据集分为训练集和测试集,以便评估模型的性能。
  • 🛠️ 使用工具:利用如Hugging Face的Transformers库中的AutoTokenizer类来简化数据准备过程。
  • 🚀 准备训练:完成数据预处理后,即可开始模型的训练过程。

Q & A

  • 为什么高质量数据对于模型训练至关重要?

    -高质量数据对于模型训练至关重要,因为输入数据的质量直接影响模型的输出。如果输入的是低质量数据,模型可能会简单地复制这些数据的模式,导致输出也是低质量的。

  • 数据多样性在模型训练中的作用是什么?

    -数据多样性有助于模型学习到更多的应用场景和处理不同类型的输入。如果所有输入和输出都相同,模型可能会开始记忆它们,而不是学习新的模式或表达方式。

  • 真实数据与生成数据在模型训练中有何区别?

    -真实数据通常比生成数据更有效,因为生成数据可能已经包含了某些模式,这些模式可能会被用于检测生成内容。真实数据能够提供更自然、更多样化的模式,有助于模型学习到更广泛的应用场景。

  • 在数据预处理中,数据量的重要性如何?

    -虽然数据量在机器学习应用中很重要,但是在预训练阶段模型已经从大量互联网数据中学习,因此对于特定任务而言,数据的质量、多样性和真实性比数据量更为重要。

  • 什么是数据的分词(tokenization)过程?

    -分词是将文本数据转换成数字的过程,这些数字代表文本中的各个部分。分词不仅仅按单词进行,而是基于常见字符出现频率。例如,'ing'是一个常见的分词,因为它在所有类型的文本中都很常见。

  • 在模型训练中,为什么需要将数据分批处理?

    -模型训练需要固定大小的张量进行操作,因此需要将不同长度的文本统一长度。通过分批处理,可以确保每批数据的长度一致,便于模型处理。

  • 什么是填充(padding)策略?

    -填充是处理不同长度文本的策略,通过在较短的文本后面添加特定的符号(通常是零)来使所有文本长度一致,以便模型能够处理。

  • 什么是截断(truncation)策略?

    -截断是处理过长的编码文本的策略,通过减少文本的长度使其适应模型的最大处理长度。截断可以是从左边或右边进行,这取决于任务的需求。

  • 如何使用Hugging Face的AutoTokenizer类进行分词?

    -AutoTokenizer类可以自动为指定的模型找到正确的分词器。只需提供模型名称,它就会返回与模型训练时使用的分词器相匹配的分词结果。

  • 如何将分词函数应用到整个数据集上?

    -可以使用map函数将分词函数应用到整个数据集上。通过设置batch_size和drop_last参数,可以控制数据的批量处理和是否丢弃最后一个不完整的批次。

  • 数据集如何进行分割,以便进行模型训练和测试?

    -使用train_test_split函数可以将数据集分割成训练集和测试集。可以通过指定test_size参数来控制测试集的大小,并通过shuffle参数来随机化数据集的顺序。

Outlines

00:00

📚 数据准备与质量的重要性

本段落强调了在训练模型前准备高质量数据的重要性。首先,数据质量是最重要的因素,高质量的输入能够产生高质量的输出。其次,数据多样性也很关键,它能够防止模型仅记忆单一的输入输出对。此外,真实数据比生成数据更有效,尽管生成数据有其用途,但真实数据更能反映实际应用场景。最后,虽然数据量在机器学习中很重要,但相对于数据的质量和多样性,它的重要性略低。

05:01

🔍 数据处理步骤与标记化

这一部分详细介绍了数据处理的几个关键步骤,包括收集指令响应对、连接这些对、标记化数据以及将数据分割为训练集和测试集。标记化是将文本数据转换为模型可以理解的数字形式,这通常基于字符出现的频率。此外,还讨论了填充(padding)和截断(truncation)的概念,这是为了确保所有数据在批处理中长度一致,以及如何处理不同长度的文本。

10:02

🎯 应用数据集与模型训练准备

最后一部分讨论了如何将数据集应用到实际的模型训练中。首先,通过加载数据集文件,将问题和答案合并,然后使用标记器进行标记化处理。接着,将标记化的数据转换为适合模型的格式,并处理可能出现的批次大小不一致的问题。最后,数据集被分为训练集和测试集,为模型训练做好准备。此外,还提供了一些有趣的数据集,如Taylor Swift、BTS和开源大型语言模型的数据集,供用户选择和训练。

Mindmap

Keywords

💡数据准备

数据准备是指在使用机器学习模型进行训练之前,对数据进行清洗、格式化和优化的过程。在视频中,数据准备是为了确保输入到模型中的数据是高质量的,多样化的,并且尽可能真实。例如,视频提到了收集高质量的数据、确保数据的多样性以及使用真实的数据而非生成的数据来训练模型。

💡数据质量

数据质量是指数据的准确性、完整性和一致性。高质量的数据对于机器学习模型的性能至关重要,因为模型的输出往往取决于输入数据的质量。在视频中,强调了提供高质量数据的重要性,因为低质量的输入(垃圾输入)会导致模型产生低质量的输出(垃圾输出)。

💡数据多样性

数据多样性指的是数据集中包含多种类型的数据,这样可以覆盖更多的使用场景。在机器学习中,数据多样性有助于模型学习到更多的模式和关联,避免模型过度拟合特定的数据子集。视频强调了拥有多样化数据的重要性,以防止模型记忆特定的输入输出对,并确保模型能够泛化到新的、未见过的数据上。

💡真实数据与生成数据

真实数据是指从现实世界中收集的数据,而生成数据是通过算法或模型生成的。虽然生成数据可以用于模拟各种情况,但真实数据通常更能反映实际应用中的复杂性和多样性。视频中提到,尽管有很多方法可以创建生成数据,但真实数据对于训练模型来说通常更有效,尤其是在写作任务中,因为生成数据可能包含特定的模式,这些模式可能已经被用于检测生成内容的服务所识别。

💡数据量

数据量指的是用于训练机器学习模型的数据的规模。通常情况下,更多的数据可以帮助模型学习到更多的特征和模式,从而提高模型的性能。然而,视频中也提到,在预训练模型的情况下,由于模型已经从大量互联网数据中学习,因此对于特定任务的额外数据需求可能不如质量、多样性和真实性那么重要。

💡数据收集

数据收集是机器学习流程中的第一步,涉及获取和整理用于训练模型的数据。在视频中,数据收集包括获取指令-响应对,这些数据对是模型训练的基础。通过收集这些数据对,可以构建起模型所需的训练集,以便模型能够学习如何根据给定的指令生成正确的响应。

💡数据合并

数据合并是指将收集到的数据对(如问题和答案)拼接在一起,形成可以输入到模型中的完整数据集。这一步骤是为了创建一个连续的文本序列,以便模型能够理解输入的上下文。在视频中,数据合并通常涉及到将问题和答案配对,并添加到一个提示模板中,以便模型能够正确地处理和生成响应。

💡数据标记化

数据标记化是将文本数据转换成模型可以理解的数值形式的过程。这通常涉及到将文本分割成更小的单元(如单词、字符或子词),并将这些单元映射到特定的数值上。在机器学习中,标记化是准备数据以供模型训练的关键步骤,因为它决定了模型如何理解和处理输入的文本。

💡数据填充

数据填充是一种处理不同长度文本的技术,通过在较短的文本后面添加特定的符号(通常是零)来使所有文本长度一致。这样做是为了确保模型能够使用固定大小的张量进行训练,因为大多数模型无法处理可变长度的输入。在视频中,数据填充是为了确保所有的输入数据在送入模型之前长度一致。

💡数据截断

数据截断是指将超出模型处理能力的较长文本切割成较短的片段,以适应模型的最大输入长度限制。这种技术用于确保所有输入数据都能被模型有效处理,同时避免因输入过长而导致的信息丢失。在视频中,数据截断是为了处理那些超出模型最大长度限制的编码文本,使其能够被模型接受。

💡数据集分割

数据集分割是将数据集分为训练集和测试集的过程,这是为了在训练模型时使用一部分数据,并保留另一部分数据用于评估模型的性能。在视频中,数据集分割是为了创建一个独立的测试集,以便在模型训练完成后对其进行评估和验证。

Highlights

数据准备对于训练模型至关重要,高质量数据是首要需求。

数据多样性对于避免模型记忆输入输出对并提高模型泛化能力非常重要。

真实数据比生成数据更有效,尤其是在写作任务中。

预训练模型已经从大量数据中学习,因此在数据量上的需求不如数据质量重要。

数据收集的第一步是收集指令-响应对或问题-答案对。

拼接指令-响应对是数据准备的第二步,需要将它们从提示模板中分离出来。

数据预处理的第三步是数据标记化,包括添加填充或截断以适应模型的输入要求。

标记化是将文本数据转换为数字,这些数字代表文本中的字符或词汇出现的频率。

不同的标记器与特定的模型相关联,使用错误的标记器会导致模型混乱。

Hugging Face的AutoTokenizer类可以自动识别并使用正确的标记器。

在批处理输入时,所有文本需要调整为相同的长度,这通常通过填充来实现。

模型有一个最大长度限制,超过这个长度的文本需要通过截断来适应模型。

Transcripts

play00:01

now after exploring the data that you'll

play00:03

be using in this lesson you'll learn

play00:05

about how to prepare that data for

play00:07

training

play00:08

all right let's jump into it

play00:12

so next on what kind of data you need to

play00:15

prep well there are a few good best

play00:18

practices so one is you want higher

play00:21

quality data and actually that is the

play00:23

number one thing you need for fine

play00:24

tuning rather than lower quality data

play00:26

what I mean by that is if you give it

play00:28

you know garbage inputs it'll try to

play00:30

Parrot them and give you garbage output

play00:32

so giving really high quality data is

play00:34

important

play00:35

next is diversity so having diverse data

play00:38

that covers a lot of aspects of your use

play00:40

case is helpful if all your inputs and

play00:43

outputs are the same then the model can

play00:46

start to memorize them and if that's not

play00:48

exactly what you want then the model

play00:50

will start to just only spout the same

play00:52

thing over and over again so having

play00:53

diversity in your data is is really

play00:56

important next is real or generated I

play00:59

know there are a lot of ways to create

play01:01

generated data and you've already seen

play01:03

one way of doing that using an llm but

play01:05

actually having real data is very very

play01:08

effective and helpful most of the time

play01:10

especially for those writing tasks and

play01:12

that's because generated data already

play01:14

has certain patterns to it you might

play01:16

have heard of some services that are

play01:18

trying to detect whether something is

play01:20

generated or not and that's actually

play01:21

because there are patterns in generated

play01:23

data that they're trying to detect and

play01:25

as a result if you train on more of the

play01:27

same patterns it's not going to learn

play01:28

necessarily new patterns or new ways of

play01:31

framing things and finally I put this

play01:33

last because actually in most machine

play01:36

learning applications having way more

play01:37

data is important than less data but as

play01:40

you actually just seen before

play01:42

pre-training handles a lot of this

play01:44

problem pre-training has learned from a

play01:47

lot of data you know all from the

play01:49

internet and so it already has a good

play01:51

base understanding it's not starting

play01:53

from zero and so having more data is

play01:57

helpful for the model but not as

play01:58

important as the top three and

play02:00

definitely not as important as quality

play02:02

so first let's go through some of the

play02:04

steps of collecting your data so you've

play02:06

already seen some of those instruction

play02:07

response pairs so the first step is

play02:09

collect them the next one is concatenate

play02:11

those pairs are out of prompt template

play02:13

you've already seen that as well the

play02:15

next step is tokenizing the data adding

play02:18

padding or truncating the data so it's

play02:20

the right size going into the model and

play02:23

you'll see how to tokenize that in the

play02:24

lab

play02:25

so the steps to propping your data is

play02:27

one collecting those instructional

play02:29

response pairs maybe that's question

play02:30

answer Pairs and then it's concatenating

play02:32

those pairs together adding some prompt

play02:34

template like you did before the third

play02:36

step is tokenizing that data and the

play02:38

last step is splitting that data into

play02:39

training and testing now in tokenizing

play02:42

what what does that really mean well

play02:43

tokenizing your data is taking your text

play02:46

data and actually turning that into

play02:49

numbers that represent each of those

play02:51

pieces of text it's not actually

play02:53

necessarily by word it's based on the

play02:55

frequency of you know common character

play02:57

occurrences and so in this case one of

play02:59

my favorites is the ing token which is

play03:02

very common in tokenizers and that's

play03:04

because that happens in every single

play03:06

genre so in here you can see fine tuning

play03:09

ing so every single you know verb in the

play03:12

gerund you know fine-tuning or

play03:14

tokenizing all has Ing and that maps

play03:17

onto the token 278 here and when you

play03:20

decode it with the same tokenizer it

play03:23

turns back into the same text

play03:25

now there are a lot of different

play03:27

tokenizers and a tokenizer is really

play03:29

associated with a specific model for

play03:31

each model as it was trained on it and

play03:33

if you give the wrong tokenizer to your

play03:35

model it'll be very confused because it

play03:38

will expect different numbers to

play03:39

represent different sets of letters and

play03:41

different words so make sure you use the

play03:44

right tokenizer and you'll see how to do

play03:45

that easily in the lab cool so let's

play03:48

head over to the notebook

play03:50

okay so first we'll import a few

play03:52

different libraries and actually the

play03:54

most important one to see here is the

play03:56

auto tokenizer class from the

play03:58

Transformers Library by hugging face and

play04:01

what it does is amazing and it

play04:04

automatically finds the right tokenizer

play04:05

for your model when you just specify

play04:07

what the model is so all you have to do

play04:09

is put the model and name in and this is

play04:10

the same model name that you saw before

play04:12

which is a 70 million pythian-based

play04:15

model

play04:16

okay so maybe you have some text that

play04:19

says you know Hi how are you

play04:22

so now let's tokenize that text so put

play04:26

that in

play04:27

boom so let's see what encoded text is

play04:31

all right so that's different numbers

play04:33

representing text here uh tokenizer

play04:36

outputs a dictionary with input IDs that

play04:38

represent the token so I'm just printing

play04:40

that here and then let's actually D code

play04:43

that back into the text and see if it

play04:46

actually turns back into Hi how are you

play04:49

cool awesome it turns back into Hi how

play04:52

are you so that's great all right so uh

play04:55

when tokenizing you probably are putting

play04:57

in batches of input so let's just take a

play04:58

look at a few different inputs together

play05:01

so there's Hi how are you I'm good and

play05:03

yes so putting that list of text through

play05:05

you can just put it in a batch like that

play05:07

into the tokenizer you get a few

play05:09

different things here so here's Hi how

play05:10

are you again I'm good and smaller and

play05:13

yes it's just one token

play05:15

so as you can see these are varying in

play05:18

length actually something that's really

play05:20

important for models is that everything

play05:23

in a batch is the same length because

play05:26

you're operating with fixed size tensors

play05:28

and so the text needs to be the same so

play05:31

one thing that we do do is something

play05:33

called padding padding is a strategy to

play05:35

handle these variable length and coded

play05:37

texts and for our padding token you have

play05:41

to specify you know what you want to

play05:42

what number you want to represent for

play05:44

for padding and specifically we're using

play05:47

zero which is actually the end of

play05:48

sentence token as well so when we run

play05:50

pad equals true through the tokenizer

play05:53

you can see the yes string has a lot of

play05:55

zeros padded there on the right just to

play05:57

match the length of this High how are

play05:59

you string

play06:01

your model will also have a Max Lane

play06:03

that it can handle and take in so it

play06:06

can't just fit everything in and you've

play06:08

played with prompts before and you've

play06:09

noticed probably that there is a limit

play06:11

to The Prompt length and so this is the

play06:13

same thing and truncation is a strategy

play06:15

to handle uh making those encoded texts

play06:18

much shorter and that fit actually into

play06:21

the model so this is one way to make it

play06:24

shorter so as you can see here I'm just

play06:25

artificially changing the max length to

play06:28

three setting truncation to true and

play06:31

then seeing how it's much shorter now

play06:33

for Hi how are you it's truncating from

play06:36

the right so it's just getting rid of

play06:38

everything here on the right now

play06:40

realistically actually one thing that's

play06:41

very common is you know you're writing a

play06:43

prompt maybe you have your instruction

play06:45

somewhere and you have a lot of the

play06:47

important things maybe on the other side

play06:49

on on the right and that's getting

play06:51

truncated out so you know specifying

play06:53

truncation side to the left actually can

play06:55

truncate it the other way so this really

play06:57

depends on your task and realistically

play06:59

for padding and truncation you want to

play07:01

use bolts so let's just actually set

play07:03

both in there so truncation's true and

play07:05

padding's true here I'm just printing

play07:08

that out so you can see the zeros here

play07:10

but also getting truncated down to three

play07:12

great so that was really a toy example

play07:16

um I'm going to now

play07:17

paste some code that you did in the

play07:20

previous lab on prompts so here it's

play07:23

loading up the data set file with the

play07:25

questions and answers putting it into

play07:27

the prompt hydrating those prompts uh

play07:30

all in one go so now you can see one

play07:33

data point here of question and answer

play07:36

so now you can run this tokenizer on

play07:38

just one of those data points so first

play07:40

concatenating that question with that

play07:41

answer and then running it through the

play07:43

tokenizer I'm just returning the tensors

play07:45

as a numpy array here just to be simple

play07:47

and running it with just padding and

play07:49

that's because I don't know how long

play07:52

these tokens actually will be and so

play07:55

what's important is that I then figure

play07:57

out you know the minimum between the max

play07:59

length and the tokenized inputs of

play08:03

course you can always just pad to the

play08:04

longest you can always pad to the max

play08:06

length and so that's what that is here

play08:09

um and then I'm tokenizing again with

play08:11

truncation up to that max length

play08:15

so let me just print that out

play08:20

and just specify them from the

play08:22

dictionary and cool so that's what the

play08:24

tokens look like all right so let's

play08:27

actually wrap this into a full-fledged

play08:30

function so you can run it through your

play08:31

entire data set so this is again the

play08:34

same things happening here that you

play08:36

already looked at grabbing the max

play08:37

length setting the truncation side so

play08:40

that's a function for tokenizing your

play08:42

data set

play08:43

and now what you can do is you can load

play08:47

up that data set

play08:49

there's a great map function here so you

play08:51

can map the tokenize function onto that

play08:53

data set and you'll see here I'm doing

play08:55

something really simple so I'm setting

play08:56

batch sides to one it's very simple it

play08:58

is going to be batched and dropping last

play09:00

batch true that's often what you know we

play09:03

do to to help with mixed size inputs and

play09:06

so the last batch might be a different

play09:08

different size

play09:10

cool

play09:12

great and then so the next step is to

play09:15

split the data set so first I have to

play09:18

add in this labels columns that's for

play09:20

hugging face to handle it and then I'm

play09:23

going to run this

play09:25

train test split function and I'm going

play09:28

to specify the test size as 10 of the

play09:31

data so of course you can change this

play09:34

depending on how big your data set is

play09:36

Shuffle is true so I'm randomizing the

play09:38

order of this data set I'm just gonna

play09:41

print that out here so now you can see

play09:43

that the data set has been split across

play09:45

training and tests at 140 for a test set

play09:48

there

play09:50

and of course this is already loaded up

play09:52

in hugging face like you had seen before

play09:54

so you can go there and download it and

play09:57

see that it is the same

play09:59

so while that's a professional data set

play10:02

it's about a company maybe this is

play10:03

related to your company for example you

play10:06

could adopt it to your company we

play10:08

thought that might be a bit boring it

play10:10

doesn't have to be so we included a few

play10:12

more interesting data sets that you can

play10:13

also work with and feel free to

play10:15

customize and train your models for

play10:16

these instead one is for Taylor Swift

play10:18

one's for the popular band BTS and one

play10:22

is on actually open source large

play10:24

language models that you can play with

play10:26

and just looking at you know one data

play10:28

point from the tete data set let's take

play10:32

a look all right what's most popular

play10:34

Taylor Swift song among Millennials how

play10:36

does a song relate to the millennial

play10:38

generation okay okay so you can take a

play10:41

look at this yourself and yeah these

play10:43

data sets are available via hugging face

play10:45

and now in the next Lab now that you've

play10:48

prepped all this data tokenized it

play10:50

you're ready to train the model

Rate This

5.0 / 5 (0 votes)

Related Tags
数据准备模型训练高质量数据数据多样性真实数据生成数据数据预处理文本编码数据分割机器学习
Do you need a summary in English?