Lessons From Fine-Tuning Llama-2
Summary
TLDRThis video script delves into the valuable insights gained from fine-tuning open-source language models like LLaMA 2. The speakers, Kurosh and Arthur, shed light on the importance of fine-tuning for addressing format issues and improving performance on niche tasks. They emphasize the crucial role of data curation, consistent training and inference formats, and robust evaluation pipelines. Additionally, they highlight the advantages of parameter-efficient fine-tuning techniques like LoRA, balancing model quality with memory footprint and serving efficiency. The talk provides a comprehensive exploration of the challenges, learnings, and best practices for successfully fine-tuning large language models.
Takeaways
- π Open source language models like LLaMA offer cost-effectiveness and data control compared to proprietary models like GPT-4, while recent progress has narrowed the performance gap.
- π― Fine-tuning language models addresses the issue of models not following the desired output format or intent, enabling better control over their behavior.
- π Data curation and quality are crucial for fine-tuning, ensuring clean and representative examples that capture the intended model behavior.
- βοΈ Consistency between training and inference data formats is essential for effective fine-tuning and model performance.
- π§ͺ Proper evaluation pipelines, potentially leveraging more powerful models like GPT-4, are vital for accurately assessing fine-tuned model performance.
- π Ray Train provides a powerful and user-friendly framework for distributed training of language models, enabling efficient fine-tuning.
- π‘ Fine-tuning excels at tasks like SQL generation and functional representation, where models learn to map input formats to desired outputs without deep reasoning.
- β‘ Parameter-efficient fine-tuning techniques like LoRA (Low-Rank Adaptation) offer memory and storage efficiency benefits while maintaining good performance.
- βοΈ LoRA is sensitive to hyperparameters like learning rate and benefits from techniques like prompting for improved stability during training.
- π While full parameter fine-tuning may still have a slight edge in quality, LoRA offers significant advantages in serving efficiency and memory footprint.
Q & A
What is the motivation behind fine-tuning open source language models?
-The motivation is to address the problems of hallucination and not following the intended format with open source language models. Fine-tuning can help these models better adhere to specific formats and reduce hallucinations for niche tasks.
Why is data curation and formatting important for fine-tuning language models?
-High-quality curated data that captures the intended behavior is crucial. The way the data is formatted during training should be consistent with how the model will be used during inference, as inconsistencies can lead to incorrect or unexpected outputs.
How does Ray Train assist in distributed fine-tuning of language models?
-Ray Train provides a simple, Pythonic API for orchestrating multi-process training workloads. It seamlessly integrates with other Ray libraries like Ray Data for distributed data ingestion, and offers features like automatic distributed environment setup, job scheduling, and observability tools for debugging.
What are the key factors to consider when setting up an evaluation pipeline for fine-tuned language models?
-It is important to set up a reliable and scalable evaluation pipeline that accurately measures the model's performance. This may involve techniques like using more powerful models like GPT-4 to create mock test cases or automate parts of the evaluation process.
What tasks are particularly well-suited for fine-tuning open source language models?
-Tasks that involve following specific formats, such as natural language to SQL query generation or functional representation tasks, are well-suited for fine-tuning. These tasks do not necessarily require deep understanding of the world, but rather learning to map input formats to output formats.
What is parameter-efficient fine-tuning, and how does it differ from full parameter fine-tuning?
-Parameter-efficient fine-tuning, like LoRA (Low-Rank Adaptation of LMs), involves fine-tuning only a small subset of additional parameters instead of the entire model's parameters. This reduces memory footprint and checkpoint sizes compared to full parameter fine-tuning.
How does LoRA (Low-Rank Adaptation) work for parameter-efficient fine-tuning?
-In LoRA, the pre-trained weights are frozen, and two low-rank matrices A and B with far fewer parameters are added to the model during fine-tuning. This significantly reduces the number of trainable parameters while still allowing the model to adapt to the new task.
What are some advantages of using LoRA for fine-tuning language models?
-LoRA allows for fine-tuning large language models on smaller hardware instances due to its reduced memory footprint. It also results in much smaller checkpoint sizes, making it more efficient for serving fine-tuned models in production.
What factors can affect the performance and stability of LoRA fine-tuning?
-The learning rate and prompting techniques used during training can impact the stability and performance of LoRA fine-tuning. Additionally, LoRA's performance may vary depending on the task complexity, with more challenging tasks like mathematical reasoning potentially seeing a larger quality gap compared to full parameter fine-tuning.
What is the trade-off between LoRA and full parameter fine-tuning in terms of model quality and efficiency?
-While full parameter fine-tuning may still have an edge in model quality (1-3% relative accuracy), LoRA offers significant advantages in terms of memory footprint and serving efficiency. The choice depends on whether model quality or serving efficiency is the higher priority for a given use case.
Outlines
π Introducing the Talk and Motivating Open Source LMs
The speaker, Kurosh, begins by welcoming the audience and introducing the talk's focus on lessons learned from fine-tuning LLaMa 2, an open-source language model. He highlights the promise of open-source LMs like LLaMa 2, which offer cost-effectiveness and data governance control compared to closed-source models like GPT-4. Kurosh emphasizes the recent progress in open-source LMs, with LLaMa 2 models nearing the performance of GPT-3.5. However, he notes two main challenges: factual grounding and format adherence. Fine-tuning is presented as a technique to address format issues, while retrieval-augmented generation tackles hallucination problems.
π Benefits of Fine-tuning and Ray's Role in Distributed Training
Kurosh outlines several reasons to fine-tune language models. Few-shot prompting is limited by context window size, so fine-tuning can bake examples into the model's parameters. Fine-tuning also excels at handling tasks with specific formatting or tone requirements that are difficult to describe with prompts alone. It can save tokens and reduce serving costs compared to verbose prompts. Kurosh then introduces Ray and its train library, highlighting its advantages for distributed deep learning, such as its simple API, integration with Ray data, faster development tools, elegant job scheduling, and observability tools.
βοΈ Setting up Fine-tuning Problems: Data and Evaluation
Kurosh emphasizes the importance of data collection, formatting, and evaluation when setting up fine-tuning problems for language models. Using the example of natural language to SQL query generation, he stresses the need for high-quality, curated datasets that capture the intended model behavior. Consistent formatting between training and inference data is crucial for optimal performance. Kurosh also highlights the importance of reliable evaluation pipelines, describing their approach of using GPT-4 to create mock tables and unit tests for evaluating SQL query outputs.
π§ͺ Experimental Results on Fine-tuning LLaMa 2
Kurosh presents experimental results of fine-tuning LLaMa 2 on various tasks, including functional representation, SQL generation, and math reasoning. The results show that while out-of-the-box language models perform poorly, fine-tuning can significantly boost performance, even outperforming GPT-4 on certain format-following tasks like SQL generation. However, for tasks requiring more reasoning and understanding, such as math problems, fine-tuning still lags behind GPT-4's performance. Kurosh suggests that fine-tuning excels in tasks where models need to learn input-output mappings without deeper understanding.
π Parameter-Efficient Fine-tuning with LoRA
Arthur introduces parameter-efficient fine-tuning, specifically the LoRA (Low-Rank Adaptation) technique. LoRA freezes the pre-trained weights and adds low-rank matrices, reducing the number of trainable parameters. Experimental results show that LoRA performs almost as well as full fine-tuning on tasks like functional representation and SQL generation but lags slightly behind on math tasks, possibly due to the more complex optimization landscape. Arthur discusses LoRA's sensitivity to learning rates and the benefits of prompting for training stability. The main advantages of LoRA are reduced memory footprint during training and improved serving efficiency with smaller checkpoint sizes.
π Lessons Learned and Closing Remarks
In the closing part, Kurosh and Arthur summarize the key lessons learned from their fine-tuning experiments. They emphasize the crucial importance of data set quality, consistent formatting between training and inference data, and the use of reliable evaluation pipelines (like GPT-4 in their case). They discuss LoRA's sensitivity to learning rates and prompting for training stability, as well as its advantages in memory footprint and serving efficiency compared to full fine-tuning. Finally, they highlight the potential of fine-tuning open-source models for niche, format-following tasks and invite the audience to another related talk.
Mindmap
Keywords
π‘Open Source Language Models
π‘Fine-tuning
π‘Prompt Engineering
π‘Parameter Efficient Fine-tuning
π‘Data Curation
π‘Evaluation Pipeline
π‘RayTrain
π‘Hallucination
π‘Niche Tasks
π‘Serving Efficiency
Highlights
Open source language models like LLaMA 2 are closing the gap compared to proprietary models like GPT-4 in terms of performance on various tasks, making them a promising alternative.
Fine-tuning language models can address the issue of models not following the desired format or intent, by baking the format or style into the model's internal knowledge.
Ray Train is a powerful framework for orchestrating multi-process training workloads, providing a simple API, distributed data ingestion, and observability tools for debugging.
Data curation and formatting are crucial for fine-tuning language models, ensuring high-quality and consistent data that captures the intended behavior.
Leveraging powerful models like GPT-4 can automate the setup of reliable evaluation pipelines for complex tasks where traditional evaluation methods may not work well.
Fine-tuning small language models can outperform larger models like GPT-4 on specific niche tasks that don't require extensive reasoning or world knowledge.
Low-rank adaptation (LoRA) is a parameter-efficient fine-tuning technique that adds a small number of trainable parameters, reducing memory footprint and enabling the use of smaller hardware.
LoRA can achieve comparable performance to full fine-tuning on certain tasks like functional representation and SQL generation, while falling slightly behind on more complex tasks like math reasoning.
LoRA is sensitive to hyperparameters like learning rate, and prompting can help improve training stability.
LoRA significantly reduces the checkpoint size and enables serving task-specific models efficiently, making it suitable for deploying fine-tuned models in production.
While full fine-tuning may still have a slight edge in model quality, LoRA offers substantial memory and serving efficiency advantages, enabling the deployment of fine-tuned models on smaller hardware.
The speakers emphasize the importance of consistent data formats between training and inference for language models to generalize effectively.
Fine-tuning can save tokens and reduce computational costs during deployment by baking the prompt or context into the model's internal knowledge.
Ray Train provides seamless integration with other libraries like Ray Data, enabling distributed data ingestion for large datasets.
The speakers highlight the benefits of open-source language models, such as lower costs, better data governance, and more control over the technology stack.
Transcripts
[Applause]
hello everyone can you guys hear me yeah
um Welcome to our talk my name is kurosh
I'm a tech lead in the AI team here at
any scale and together with Arthur we're
going to be talking about some of the
lessons we learned from fine-tuning
llama 2. I hope these insights that we
uncover in this talk will be of help to
you as well
so here's the outline of the talk
um I'm going to start by motivating the
promise behind open source L Ms and why
especially we need to fine-tune them I'm
going to briefly talk about how raytrain
fits into picture when it comes to llm
distributed training and then we're
going to cover some learnings around
fine tuning problem set up and parameter
efficient fine tuning
so since the emergence of chat GPT we've
seen two major separations in the street
Trends on one hand we have closed Source
language models this includes models
like gpd4 or Cloud V2 from anthropic
um these kind of serve as a very
powerful general purpose assistant model
that is capable of solving a wide
variety of tasks but
one of the kind of like things that are
on top of Mind of people is that they're
prohibitively expensive to run in
production and also more importantly
there's a lot of ambiguity around data
governance and how your data is get
getting used when you're using these
systems
um at the same time we have open
language models
this includes models like llama 2 from
meta or Falcon models or mosaic MPT
models
they kind of have promises on the other
side of this spectrum which is they're
often smaller and cheaper to run and
they more importantly they give you more
control to over your data and your
technology stack in serving them
what is more interesting is that in
recent months we've seen an immense
progress on the open language models
closing the Gap compared to proprietary
models like gpd4 this is a leaderboard
from lmsys kind of an organization UC
Berkeley which kind of keeps track of
the the progress that is made on
language models by evaluating these
models on across a wide range of kind of
tasks and then puts them on this
leaderboard
llama2 models have come very close to
kind of like GPD 3.5 and other property
models
um but one of the kind of problems that
exists like in these language models you
can categorize them into two subsets
they're often like when these models
produce like completions what they
output is oftentimes not factually
grounded they often hallucinate and make
things up
and there's another category of problems
which is they often don't follow the
format that you have in your mind or
like in intent and to use these language
models for this figure kind of shows a
spectrum of techniques that kind of try
to address these two types of problems
on the bottom we've got prompt tuning or
prompt engineering and then few shot
prompting we have fine tuning which
addresses following a form problem
um and then we've got retrieval
assistant generation which explicitly
addresses the hallucination and on top
we've got reinforcement learning and
training from scratch which are kind of
like more complex and only available to
a few companies today we're going to
talk about fine-tuning and how it
addresses the form problems with these
language models
so why fine-tune language models in the
next few slides I'm going to cover a few
reasons that show that highlights that
shows the benefits of fine-tune language
models
first thing to point out is few shot
prompting is a technique that enables in
context learning meaning that we found
that like you can in language models you
can provide a few examples of desired
input outputs and fit them into the
context of these language models as
input and have them model generalize
that same pattern matching to unseen
data points
but there are often many times that your
data is huge and doesn't fit The Limited
context window that these language
models provide
so in this case in these scenarios what
you can do is instead of putting these
examples into the context bake them into
the neural network rates that
essentially present the internal
knowledge of these language models
um another reason to think about
fine-tuning is there are a lot of tasks
that are hard to describe in words some
of these like subtleties go around like
formatting out the output is a specific
output format that you have in mind or
having the model generate something in a
specific tone you may attempt to fix
these by prompting with phrases like
output this thing in this Json format or
like put something like the final answer
in this integer format that I want to
parse later in my software
but there are often many times that
language models don't respect these kind
of like phrases and you may need to
provide several examples to kind of
reinforce what you mean
um in the following a specific tone
another example is like you may say
something like hey write this in a
concise respectful or helpful manual
manner without being explicit what these
kind of words mean and you may need to
again provide some examples what these
words mean for the model
so with fine tuning we can actually
leverage a lot of illustrations and bake
that into the internal knowledge of the
model
it can also save you tokens
um there are many applications that you
can get away with prompt engineering but
oftentimes this prompt end up being too
wordy or verbose with many examples
but what what thing the thing that you
have to keep in mind
is if you want to run this in production
for every single request and every input
token output token that you want to
generate you have to fit in the scene
the same context scene and you're going
to have to perform computation on it so
if you have cases where this is too
verbose it's going to actually incur a
lot of cost during deployment with fine
tuning you can kind of implicitly bake
that this prompt again into the
knowledge of the network and get away
with like a cheaper serving cost
and last but not least as we show later
in the talk
with fine tuning you can oftentimes get
a faster cheaper model
at the same quality for some of the
niche tasks compared to let's say larger
models or even gpd4 in some cases
um so this is a plot that I think you
guys have seen already in the Keynotes
and other talks here which kind of
demonstrates an example of what we mean
by Niche test like a SQL data generation
how we can
fine-tune these small models to kind of
outperform other powerful models for
this specific task
we're gonna cover more about like more
of the experimentation side later in the
top
now I want to just highlight and briefly
talk about how Rey kind of fits into
this picture
and there's a great talk that was
presented by June Sean yesterday that
dives deeper into how raytrain is a
production ready library for distributed
deep learning I'm not gonna cover
um as much details but I'm gonna just
highlight some of the features that
makes raytrain great for this type of
workload
um so what is rate train rate train in
my opinion is the best framework for
orchestrating multi-process training
workload and here is why
first of all it provides a very simple
API 100 pythonic that you can take
existing python code in your favorite
framework and just integrate it with
great train to distribute it across your
cluster
um plus it has also seamless integration
with other libraries in the repo system
like Ray data that provides distributed
data ingestion which can be very helpful
when you have when you're dealing with
large data sets
it provides Tools around faster
development
um for example it automatically sets up
distributed environments so that these
lower level libraries like Cuda Nico
these things can communicate to each
other and as an ml developer you have
you don't have to think about them and
just can focus on your model training
and you know lost scares and things like
that
um
another way to look at raytrain is that
it is a simple and elegant Java
scheduling with features like Auto
scaling or support for heterogeneous
resources you can actually survive in
today's world where like gpus are very
scarce and there's like capacity issues
at reservation you can put together
heterogeneous clusters and get unblocked
when you're training something in
development
and last but not least there is a lot of
observability tools built around gray
that helps us like easily debug
distributed applications and unblock
ourselves
um yeah so now that we talked about
um kind of the infrastructure side and
why we should do fine tuning let's talk
about what it takes to do fine tuning
how do we set up problems for
fine-tuning language models
so there are two main pillars that you
have to think about very carefully when
you want to set up a fine tuning problem
obviously there is data collection and
formatting and I want to really
highlight the importance of evaluation
so to concrete to crystallize these
things into concrete examples we're
going to use this natural language to
SQL query generation
so data set quality is crucial I think
you've heard it already from even Adobe
stock here
um in generative AI data set is kind of
the king and you have to invest a lot of
time in it to make sure you've got high
quality curated data that captures your
intention of how these language models
should behave
so in SQL generation
um we've the examples are formatted like
this you have like a natural language
statement that poses a question about a
data set and there is like a table
schema presented by a bunch of tables
and then
um like variable names and what data
type they have and then at the end a
desired query that you want these models
to generate
it's very important to make sure these
data sets are clean V for this type of
study we did a lot of data curation
manually went through all these data
sets make sure kind of understood what
are the common errors in the data set
fixed them filter them to make sure for
example table names makes sense they
represent what the underlying data is
um data types match for example the
query that is generated so to get these
good results you gotta curate your data
and I can emphasize it enough just by
one like slides
next
thing that you have to think about the
data is
um and this is kind of an important one
is the way that you kind of format them
during training is going to impact how
you want to use them like ask the model
to do something so training and
inference data format should be very
consistent with each other
so I'm going to give you an example in
this SQL generation imagine my training
data set I structure all my examples
like this write a SQL query to answer
this question based on a table schema
followed by two new line symbols context
two new line against symbols and then
the question
and then have the model learn how to
Output the kind of corresponding query
I go ahead and train a model with this
but at inference time I come back and
ask the model the same question but in a
different format like here is a database
maybe I don't specify the schema
and then I ask it hey convert the
following to a SQL command like show
names blah blah and then when I see what
the model produces it's kind of like
wrong in subtle senses like it doesn't
it for example forgets the name of the
the schema or it doesn't do this order
by like descending
but
the reason behind like this thing is
that you have to think about how the
model has seen the data before it has
only seen the data in this particular
format and then you're throwing it at it
like a new kind of format of data which
kind of gets to converted to new symbols
that this model may not even recognize
and may generalize may not generalize
very well too
so it's very important
um to kind of have a consistent format
when you're actually running inference
on training on these models or if you
want to have variations in the type of
like data that goes into inference you
have to have the same type of variation
in your data as well so these models
learn to be robust to those type of
variations
and now I want to talk about a little
bit about setting up evaluation
Pipelines
this example is kind of very specific to
the SQL generation but it kind of
inspires other ways to think about it so
for see let's talk about SQL so SQL like
you your model output something like
select block and then you have a
reference output that you want to check
whether the model what model generated
is equivalent to
there are cases there are like this is a
contrived example but this kind of
greatly captures the nuances behind this
task it's very complicated to ensure
whether what the model outputted is
consistent or the same as the reference
output you cannot do character for
character matching you cannot do more
complex methods like abstract syntax
stream matching maybe you have like
expressions Math Expressions that are
equivalent but look different than this
a steam matching method would also not
work out well
uh what we did here was to use gpt4
actually a powerful model
although it costs you can cost and it
can become expensive you're doing an
evaluation pipeline so it's like a
one-time cost that you can pay up front
to set up evaluation pipelines that are
that you kind of keep consistent
throughout your experimentation
so what we did here was we asked chat
gpt4 to create a bunch of mock tables
their like conditioned on for example
the the reference output and the table
schema where if we ran the reference
output against we could check what got
um as a as what came out as a result of
running that query would match the same
thing that would come out as there is
running the like the model output
against the same mod table so by doing
so we kind of curated and handcrafted
maybe like 200 300 examples of such unit
tests where we could like run all of our
experimentations against and make sure
we've got like consistent evaluation
pipeline in a scalable way when we are
experimenting with this like fine tuning
tasks
so
um the takeaway from this is that there
are tasks out there that you may want to
apply fine-tuning to that
evacuating and evaluation may be a hard
thing to do but you can leverage these
like more powerful models to kind of
automate that part and take some of the
human effort out of the the loop
now let's talk about some of the
learnings we had from running these
experiments on llama2 models
so
um this spot was kind of shown in the
keynote as well
um we have
applied fine tuning to kind of several
tasks that we thought might be relevant
to what other people might want to do
with these language models I already
talked about the SQL generation task in
the in details in the middle that's
that's what's shown in the middle
on the left side we have functional
representation which is just a task
where you have a
like a honestructured text
asking a question or like have a comment
about something and then your task is to
read that text and convert it to a
structured data this is a very common
tasks that in like Health space where
for example doctors write a lot of notes
and you have to kind of parse that and
extract it in a structured format
um and so that's that's basically what
is shown here and we've got another task
which is more more geared toward
um mathematical reasoning and logical
reasoning GSM 8K is a data set of around
8 000 examples of basic math questions
followed by some answer and you want to
evaluate better language models can
solve this type of task
so what is shown here is that these
darker bars are
the performance and success rate of
these like models then fine tune the
chat fine-tune models right out of the
box so you don't do any specialized fine
tuning on them and they compare to gpt4
for example they do very poorly
they they're not even close to the
performance
but if you kind of use the training data
that is curated for these tasks and then
fine-tune these models and then do the
evaluation again you'll see that the
performance gets boosted so much that it
can actually beat gpd4 in these two kind
of
tasks however
in some of the tasks that involve more
things than just following a format
right math involved requires
um more understanding of like reasoning
and logic piecing together different
piece things about the logic behind the
question and although fine-tuning can
help get you from let's say I don't know
40 to 50 it is still far behind
um kind of performance of more powerful
and general purpose models like gpt4
what this kind of presents is the
opportunity for applying fine tuning on
these four and following fact tasks like
dysfunctional representation or SQL
generation are the kind of task that the
model does not have to really kind of
understand the world or how the work
functions they just have to learn how to
map like a certain format of input to
assert another format in the output and
this is where like fine tuning can
really help
um now I'm gonna hand it off to Archer
to talk about learnings from parameter
efficient fine tuning right thanks
karosh uh hello everyone all right so
another we have seen the the value of
these models let's talk about parameter
for tuning so first first of all what is
parameter efficient fine tuning um in in
full parameter fine tuning what you do
is just a continuation of the training
but on Specialized data and parameter
efficient fine tuning is the same the
same thing but uh your only fine-tuning
a small number of parameters so this
could be a subset of the parameters of
the original model or it could be some
additional parameters
the point being that it has to be very
few parameters and there's a couple of
techniques that exist out there to do
this and one of these techniques is
Laura
so Laura means low rank adaptation of
LMS and you see here on the left side a
kind of schematic of the internals of a
transformer and on the right side here
you see how Laura works in principle
so for any given layer for from the
Transformer that is dense so for example
like a feed forward layer you can grab
that layer and you can apply Laura to it
so what does that mean
um
well you have these pre-trained weights
and what you do during training with
Laura is you freeze them and you set
them aside and this will become quite
important later so you set those pre you
freeze them and you set them aside and
then you add an additional Matrix a
times B that can be decomposed into two
low rank matrices A and B
and these two matrices combined have
very few parameters compared to the
original parameters the pre-trained
weights that you would normally be
fine-tuning so this is really where the
where the trick is here and this can
bring you two things
um first of all during training
obviously there's a a much smaller
Optimizer state to be kept in memory and
then second of all you're left with much
smaller checkpoints and we'll talk more
about this later but let's first talk a
little bit more about the quality of the
models that we gained out of fine tuning
with Laura
right so this should look somewhat
familiar these are the same tasks that
Crush talked about earlier so we have
the functional representation test SQL
generation and the math task
and we added another shade here a medium
shade to the to the dark shade that
signifies the Baseline and the Light
trade that signifies how well full
parameter fine-tuning does so we added
the medium shade here to signify how
well Laura does and you can see for the
left two tests for functional
representation and SQL generation that
Laura did basically almost as well as
full parameter fine tuning so the
relative difference in accuracy here is
like one or two percent
and we can learn from this already that
with Laura we're able to solve some like
real world problems uh very well
actually better than uh what we got out
of gpt4
and
um but on the right side you see the
math test again where Laura is lacking a
little bit behind so for the 13 and 70b
parameter models um we're seeing
differences of like two or three percent
and for the seven billion parameter
model
um the lack and quality was even greater
and our hypothesis around why this might
be is that you know like math is
generally hard for LMS to do as we know
and then Laura is also a more difficult
optimization task so since you have much
fewer parameters to play with the the
optimization landscape is a little more
more tricky and this might just add up
so something we can maybe learn from
this and this has to be seen like in
future tasks that we look at is that the
performance of Laura might depend a
little bit on the type of task that
you're looking at
right so another thing that we learned
with Laura was that it's sensitive to
the learning rate so with full parameter
fine tuning what you'll find generally
is that it's very stable across a wide
range of of learning rates
and when we use Laura we encountered
some instabilities here so a learning
rate that you'll see widely used on the
Internet is 1e minus four and we use
that at first as well and then ran to
some of these instabilities and you can
see here how just by tweaking they're
learning right a little bit we got a
much smoother learning curve here in
this purple one
yeah and then another thing that we did
to improve stability was interestingly
prompting so what you can do during
training and obviously you have to do
the same thing during evaluation as
career said is you can apply some prompt
engineering basically during fine tuning
so you you create some helpful context
for the model like for example you know
you're a helpful assistant this is like
a SQL table in the query and stuff like
that and then you prepend that to what
you're normally inputting to your model
and what that leaves you with if you
like we fixed everything else here like
a seating and learning rate and
everything right but what that left us
with was
um even even smoother
um learning curve here the the orange
one
yeah
cool so now um that we've talked about
like how well Laura does on these on
these problems and that we just might
have to tweak it a little bit here and
there let's look at the upsides of Laura
so first of all as I said in the
beginning
um the the optimizer state is much
smaller right so for the 7 billion
parameter model for example that we
fine-tuned we were able to fine tune the
seven billion parameter model
um
on a single AWS
p4de 24x large instance and we were
simply not able to do the same thing
with full parameter fine tuning and the
other thing is as you can see here the
the checkpoint sizes are much smaller so
with our Laura settings we were left
we're left with checkpoints that are
like 40 megabytes for the 7 billion
parameter model and 12.6 gigabytes for
the full parameter fine tuning
so obviously with full parameter
fine-tuning every time you checkpoint
you have to check one the entire thing
right with Laura you're just
checkpointing these two matrices A and B
cool so this brings us to our sixth
learning
um so as I said in the beginning you
during training you freeze these weights
right and you set them aside and you add
these two matrices A and B that are your
your Laura weights and what this means
during during serving is that you you
take those frozen weights like the
original model and you put it in memory
and then along with that you have an
array of uh Laura whites that are tasks
task specific
so this ties in very well with what
kurosh said initially about
um in order to beat these these open and
large and general purpose and very
expensive models we need to find small
models that we fine-tune and like a
niche specific tasks so you can imagine
like one Laura one set of lower weights
per task here
right so
what have we learned about Laura now in
terms of a trade-off
first of all if your sole concern is
model quality there's no way around full
parameter fine tuning still you'll still
have this edge of one or two or three
percent of relative accuracy
um and um the the training time between
the between the two the difference in
training time is is really not there so
initially we thought like Laura must be
much quicker in like fewer parameters
fewer things to checkpoint stuff like
that but it turns out that if you look
at the time it takes the model to
converge um as in like wall clock time
to a given perplexity it's roughly the
same between the two methods and then
what we really gained from Laura is
first of all the memory footprint that
can really unblock you on using smaller
instance types in training and second of
all the the serving efficiency that's
just greatly enhanced
right so here are all the learnings that
we mentioned today first of all data set
quality is crucial
training and inference data form and
consistency is crucial and we use gpd4
to set up a reliable evaluation pipeline
then Laura being sensitive to the
learning rate and prompting data sets
help with training stability
and lastly the large big Advantage is
really the serving efficiency
yeah so one more thing here there's
another talk about these LMS in
production by our chief scientist Walid
and that's going to be at 3 15 PM in
Gate Ballroom B
cool thanks everyone for attending
thank you
Browse More Related Video
![](https://i.ytimg.com/vi/MyFrMFab6bo/hq720.jpg)
LLM Foundations (LLM Bootcamp)
![](https://i.ytimg.com/vi/RBzXsQHjptQ/hq720.jpg)
Introduction to Large Language Models
![](https://i.ytimg.com/vi/aSwoB4JnRBk/hqdefault.jpg?sqp=-oaymwEXCJADEOABSFryq4qpAwkIARUAAIhCGAE=&rs=AOn4CLDLGCpz4Sa_W1USblODgVvEAis2cw)
Episode 1- Efficient LLM training with Unsloth.ai Co-Founder
![](https://i.ytimg.com/vi/TRjq7t2Ms5I/hqdefault.jpg?sqp=-oaymwEXCJADEOABSFryq4qpAwkIARUAAIhCGAE=&rs=AOn4CLDRfTRa4V1hfpCUcJ6VFtfn_zieuA)
Building Production-Ready RAG Applications: Jerry Liu
![](https://i.ytimg.com/vi/qGkzHFllWDY/hq720.jpg?sqp=-oaymwEmCIAKENAF8quKqQMa8AEB-AH-CYAC0AWKAgwIABABGDcgIih_MA8=&rs=AOn4CLAWLFIhtGdHPrc363XVW9KJ3pruBg)
Stanford CS25: V1 I Transformers in Language: The development of GPT Models, GPT3
![](https://i.ytimg.com/vi/ZLbVdvOoTKM/hq720.jpg)
How to Build an LLM from Scratch | An Overview
5.0 / 5 (0 votes)