大语言模型微调之道5——准备数据
Summary
TLDR本视频教程介绍了如何为训练准备高质量数据。强调了数据质量、多样性和真实性的重要性,并解释了数据预处理的步骤,包括收集指令响应对、拼接、标记化、填充或截断以及分割训练和测试数据集。通过使用Hugging Face的自动标记化器类,演示了如何将文本转换为模型可以理解的数字。最后,展示了如何将这些步骤应用于实际数据集,为模型训练做好准备。
Takeaways
- 📈 高质量数据:准备用于训练的数据时,应优先考虑数据的质量,避免输入错误导致输出错误。
- 🌐 数据多样性:确保数据覆盖广泛,以防止模型记忆并重复相同的输入输出对。
- 🔍 真实与生成数据:尽可能使用真实数据,因为生成数据可能包含可检测的模式,限制模型学习新框架的能力。
- 📚 数据量的重要性:虽然数据量很重要,但在大多数机器学习应用中,质量、多样性和真实性更为关键。
- 🔢 收集数据:首先收集指令-响应对或问题-答案对,并将其从模板中提取出来。
- 🔑 标记化数据:将文本数据转换为模型可以理解的数字,通常基于字符出现的频率。
- 📏 填充和截断:为了使批次中的所有文本长度一致,使用填充策略,并将过长的文本进行截断。
- 🔄 批量处理:在处理数据时,通常以批次的形式进行,以便于模型高效学习。
- 📊 分割数据集:将数据集分为训练集和测试集,以便评估模型的性能。
- 🛠️ 使用工具:利用如Hugging Face的Transformers库中的AutoTokenizer类来简化数据准备过程。
- 🚀 准备训练:完成数据预处理后,即可开始模型的训练过程。
Q & A
为什么高质量数据对于模型训练至关重要?
-高质量数据对于模型训练至关重要,因为输入数据的质量直接影响模型的输出。如果输入的是低质量数据,模型可能会简单地复制这些数据的模式,导致输出也是低质量的。
数据多样性在模型训练中的作用是什么?
-数据多样性有助于模型学习到更多的应用场景和处理不同类型的输入。如果所有输入和输出都相同,模型可能会开始记忆它们,而不是学习新的模式或表达方式。
真实数据与生成数据在模型训练中有何区别?
-真实数据通常比生成数据更有效,因为生成数据可能已经包含了某些模式,这些模式可能会被用于检测生成内容。真实数据能够提供更自然、更多样化的模式,有助于模型学习到更广泛的应用场景。
在数据预处理中,数据量的重要性如何?
-虽然数据量在机器学习应用中很重要,但是在预训练阶段模型已经从大量互联网数据中学习,因此对于特定任务而言,数据的质量、多样性和真实性比数据量更为重要。
什么是数据的分词(tokenization)过程?
-分词是将文本数据转换成数字的过程,这些数字代表文本中的各个部分。分词不仅仅按单词进行,而是基于常见字符出现频率。例如,'ing'是一个常见的分词,因为它在所有类型的文本中都很常见。
在模型训练中,为什么需要将数据分批处理?
-模型训练需要固定大小的张量进行操作,因此需要将不同长度的文本统一长度。通过分批处理,可以确保每批数据的长度一致,便于模型处理。
什么是填充(padding)策略?
-填充是处理不同长度文本的策略,通过在较短的文本后面添加特定的符号(通常是零)来使所有文本长度一致,以便模型能够处理。
什么是截断(truncation)策略?
-截断是处理过长的编码文本的策略,通过减少文本的长度使其适应模型的最大处理长度。截断可以是从左边或右边进行,这取决于任务的需求。
如何使用Hugging Face的AutoTokenizer类进行分词?
-AutoTokenizer类可以自动为指定的模型找到正确的分词器。只需提供模型名称,它就会返回与模型训练时使用的分词器相匹配的分词结果。
如何将分词函数应用到整个数据集上?
-可以使用map函数将分词函数应用到整个数据集上。通过设置batch_size和drop_last参数,可以控制数据的批量处理和是否丢弃最后一个不完整的批次。
数据集如何进行分割,以便进行模型训练和测试?
-使用train_test_split函数可以将数据集分割成训练集和测试集。可以通过指定test_size参数来控制测试集的大小,并通过shuffle参数来随机化数据集的顺序。
Outlines
📚 数据准备与质量的重要性
本段落强调了在训练模型前准备高质量数据的重要性。首先,数据质量是最重要的因素,高质量的输入能够产生高质量的输出。其次,数据多样性也很关键,它能够防止模型仅记忆单一的输入输出对。此外,真实数据比生成数据更有效,尽管生成数据有其用途,但真实数据更能反映实际应用场景。最后,虽然数据量在机器学习中很重要,但相对于数据的质量和多样性,它的重要性略低。
🔍 数据处理步骤与标记化
这一部分详细介绍了数据处理的几个关键步骤,包括收集指令响应对、连接这些对、标记化数据以及将数据分割为训练集和测试集。标记化是将文本数据转换为模型可以理解的数字形式,这通常基于字符出现的频率。此外,还讨论了填充(padding)和截断(truncation)的概念,这是为了确保所有数据在批处理中长度一致,以及如何处理不同长度的文本。
🎯 应用数据集与模型训练准备
最后一部分讨论了如何将数据集应用到实际的模型训练中。首先,通过加载数据集文件,将问题和答案合并,然后使用标记器进行标记化处理。接着,将标记化的数据转换为适合模型的格式,并处理可能出现的批次大小不一致的问题。最后,数据集被分为训练集和测试集,为模型训练做好准备。此外,还提供了一些有趣的数据集,如Taylor Swift、BTS和开源大型语言模型的数据集,供用户选择和训练。
Mindmap
Keywords
💡数据准备
💡数据质量
💡数据多样性
💡真实数据与生成数据
💡数据量
💡数据收集
💡数据合并
💡数据标记化
💡数据填充
💡数据截断
💡数据集分割
Highlights
数据准备对于训练模型至关重要,高质量数据是首要需求。
数据多样性对于避免模型记忆输入输出对并提高模型泛化能力非常重要。
真实数据比生成数据更有效,尤其是在写作任务中。
预训练模型已经从大量数据中学习,因此在数据量上的需求不如数据质量重要。
数据收集的第一步是收集指令-响应对或问题-答案对。
拼接指令-响应对是数据准备的第二步,需要将它们从提示模板中分离出来。
数据预处理的第三步是数据标记化,包括添加填充或截断以适应模型的输入要求。
标记化是将文本数据转换为数字,这些数字代表文本中的字符或词汇出现的频率。
不同的标记器与特定的模型相关联,使用错误的标记器会导致模型混乱。
Hugging Face的AutoTokenizer类可以自动识别并使用正确的标记器。
在批处理输入时,所有文本需要调整为相同的长度,这通常通过填充来实现。
模型有一个最大长度限制,超过这个长度的文本需要通过截断来适应模型。
Transcripts
now after exploring the data that you'll
be using in this lesson you'll learn
about how to prepare that data for
training
all right let's jump into it
so next on what kind of data you need to
prep well there are a few good best
practices so one is you want higher
quality data and actually that is the
number one thing you need for fine
tuning rather than lower quality data
what I mean by that is if you give it
you know garbage inputs it'll try to
Parrot them and give you garbage output
so giving really high quality data is
important
next is diversity so having diverse data
that covers a lot of aspects of your use
case is helpful if all your inputs and
outputs are the same then the model can
start to memorize them and if that's not
exactly what you want then the model
will start to just only spout the same
thing over and over again so having
diversity in your data is is really
important next is real or generated I
know there are a lot of ways to create
generated data and you've already seen
one way of doing that using an llm but
actually having real data is very very
effective and helpful most of the time
especially for those writing tasks and
that's because generated data already
has certain patterns to it you might
have heard of some services that are
trying to detect whether something is
generated or not and that's actually
because there are patterns in generated
data that they're trying to detect and
as a result if you train on more of the
same patterns it's not going to learn
necessarily new patterns or new ways of
framing things and finally I put this
last because actually in most machine
learning applications having way more
data is important than less data but as
you actually just seen before
pre-training handles a lot of this
problem pre-training has learned from a
lot of data you know all from the
internet and so it already has a good
base understanding it's not starting
from zero and so having more data is
helpful for the model but not as
important as the top three and
definitely not as important as quality
so first let's go through some of the
steps of collecting your data so you've
already seen some of those instruction
response pairs so the first step is
collect them the next one is concatenate
those pairs are out of prompt template
you've already seen that as well the
next step is tokenizing the data adding
padding or truncating the data so it's
the right size going into the model and
you'll see how to tokenize that in the
lab
so the steps to propping your data is
one collecting those instructional
response pairs maybe that's question
answer Pairs and then it's concatenating
those pairs together adding some prompt
template like you did before the third
step is tokenizing that data and the
last step is splitting that data into
training and testing now in tokenizing
what what does that really mean well
tokenizing your data is taking your text
data and actually turning that into
numbers that represent each of those
pieces of text it's not actually
necessarily by word it's based on the
frequency of you know common character
occurrences and so in this case one of
my favorites is the ing token which is
very common in tokenizers and that's
because that happens in every single
genre so in here you can see fine tuning
ing so every single you know verb in the
gerund you know fine-tuning or
tokenizing all has Ing and that maps
onto the token 278 here and when you
decode it with the same tokenizer it
turns back into the same text
now there are a lot of different
tokenizers and a tokenizer is really
associated with a specific model for
each model as it was trained on it and
if you give the wrong tokenizer to your
model it'll be very confused because it
will expect different numbers to
represent different sets of letters and
different words so make sure you use the
right tokenizer and you'll see how to do
that easily in the lab cool so let's
head over to the notebook
okay so first we'll import a few
different libraries and actually the
most important one to see here is the
auto tokenizer class from the
Transformers Library by hugging face and
what it does is amazing and it
automatically finds the right tokenizer
for your model when you just specify
what the model is so all you have to do
is put the model and name in and this is
the same model name that you saw before
which is a 70 million pythian-based
model
okay so maybe you have some text that
says you know Hi how are you
so now let's tokenize that text so put
that in
boom so let's see what encoded text is
all right so that's different numbers
representing text here uh tokenizer
outputs a dictionary with input IDs that
represent the token so I'm just printing
that here and then let's actually D code
that back into the text and see if it
actually turns back into Hi how are you
cool awesome it turns back into Hi how
are you so that's great all right so uh
when tokenizing you probably are putting
in batches of input so let's just take a
look at a few different inputs together
so there's Hi how are you I'm good and
yes so putting that list of text through
you can just put it in a batch like that
into the tokenizer you get a few
different things here so here's Hi how
are you again I'm good and smaller and
yes it's just one token
so as you can see these are varying in
length actually something that's really
important for models is that everything
in a batch is the same length because
you're operating with fixed size tensors
and so the text needs to be the same so
one thing that we do do is something
called padding padding is a strategy to
handle these variable length and coded
texts and for our padding token you have
to specify you know what you want to
what number you want to represent for
for padding and specifically we're using
zero which is actually the end of
sentence token as well so when we run
pad equals true through the tokenizer
you can see the yes string has a lot of
zeros padded there on the right just to
match the length of this High how are
you string
your model will also have a Max Lane
that it can handle and take in so it
can't just fit everything in and you've
played with prompts before and you've
noticed probably that there is a limit
to The Prompt length and so this is the
same thing and truncation is a strategy
to handle uh making those encoded texts
much shorter and that fit actually into
the model so this is one way to make it
shorter so as you can see here I'm just
artificially changing the max length to
three setting truncation to true and
then seeing how it's much shorter now
for Hi how are you it's truncating from
the right so it's just getting rid of
everything here on the right now
realistically actually one thing that's
very common is you know you're writing a
prompt maybe you have your instruction
somewhere and you have a lot of the
important things maybe on the other side
on on the right and that's getting
truncated out so you know specifying
truncation side to the left actually can
truncate it the other way so this really
depends on your task and realistically
for padding and truncation you want to
use bolts so let's just actually set
both in there so truncation's true and
padding's true here I'm just printing
that out so you can see the zeros here
but also getting truncated down to three
great so that was really a toy example
um I'm going to now
paste some code that you did in the
previous lab on prompts so here it's
loading up the data set file with the
questions and answers putting it into
the prompt hydrating those prompts uh
all in one go so now you can see one
data point here of question and answer
so now you can run this tokenizer on
just one of those data points so first
concatenating that question with that
answer and then running it through the
tokenizer I'm just returning the tensors
as a numpy array here just to be simple
and running it with just padding and
that's because I don't know how long
these tokens actually will be and so
what's important is that I then figure
out you know the minimum between the max
length and the tokenized inputs of
course you can always just pad to the
longest you can always pad to the max
length and so that's what that is here
um and then I'm tokenizing again with
truncation up to that max length
so let me just print that out
and just specify them from the
dictionary and cool so that's what the
tokens look like all right so let's
actually wrap this into a full-fledged
function so you can run it through your
entire data set so this is again the
same things happening here that you
already looked at grabbing the max
length setting the truncation side so
that's a function for tokenizing your
data set
and now what you can do is you can load
up that data set
there's a great map function here so you
can map the tokenize function onto that
data set and you'll see here I'm doing
something really simple so I'm setting
batch sides to one it's very simple it
is going to be batched and dropping last
batch true that's often what you know we
do to to help with mixed size inputs and
so the last batch might be a different
different size
cool
great and then so the next step is to
split the data set so first I have to
add in this labels columns that's for
hugging face to handle it and then I'm
going to run this
train test split function and I'm going
to specify the test size as 10 of the
data so of course you can change this
depending on how big your data set is
Shuffle is true so I'm randomizing the
order of this data set I'm just gonna
print that out here so now you can see
that the data set has been split across
training and tests at 140 for a test set
there
and of course this is already loaded up
in hugging face like you had seen before
so you can go there and download it and
see that it is the same
so while that's a professional data set
it's about a company maybe this is
related to your company for example you
could adopt it to your company we
thought that might be a bit boring it
doesn't have to be so we included a few
more interesting data sets that you can
also work with and feel free to
customize and train your models for
these instead one is for Taylor Swift
one's for the popular band BTS and one
is on actually open source large
language models that you can play with
and just looking at you know one data
point from the tete data set let's take
a look all right what's most popular
Taylor Swift song among Millennials how
does a song relate to the millennial
generation okay okay so you can take a
look at this yourself and yeah these
data sets are available via hugging face
and now in the next Lab now that you've
prepped all this data tokenized it
you're ready to train the model
5.0 / 5 (0 votes)