Lessons From Fine-Tuning Llama-2

Anyscale
12 Oct 202328:57

Summary

TLDRThis video script delves into the valuable insights gained from fine-tuning open-source language models like LLaMA 2. The speakers, Kurosh and Arthur, shed light on the importance of fine-tuning for addressing format issues and improving performance on niche tasks. They emphasize the crucial role of data curation, consistent training and inference formats, and robust evaluation pipelines. Additionally, they highlight the advantages of parameter-efficient fine-tuning techniques like LoRA, balancing model quality with memory footprint and serving efficiency. The talk provides a comprehensive exploration of the challenges, learnings, and best practices for successfully fine-tuning large language models.

Takeaways

  • πŸ˜€ Open source language models like LLaMA offer cost-effectiveness and data control compared to proprietary models like GPT-4, while recent progress has narrowed the performance gap.
  • 🎯 Fine-tuning language models addresses the issue of models not following the desired output format or intent, enabling better control over their behavior.
  • πŸ“‚ Data curation and quality are crucial for fine-tuning, ensuring clean and representative examples that capture the intended model behavior.
  • βš–οΈ Consistency between training and inference data formats is essential for effective fine-tuning and model performance.
  • πŸ§ͺ Proper evaluation pipelines, potentially leveraging more powerful models like GPT-4, are vital for accurately assessing fine-tuned model performance.
  • πŸš€ Ray Train provides a powerful and user-friendly framework for distributed training of language models, enabling efficient fine-tuning.
  • πŸ’‘ Fine-tuning excels at tasks like SQL generation and functional representation, where models learn to map input formats to desired outputs without deep reasoning.
  • ⚑ Parameter-efficient fine-tuning techniques like LoRA (Low-Rank Adaptation) offer memory and storage efficiency benefits while maintaining good performance.
  • βš™οΈ LoRA is sensitive to hyperparameters like learning rate and benefits from techniques like prompting for improved stability during training.
  • πŸ† While full parameter fine-tuning may still have a slight edge in quality, LoRA offers significant advantages in serving efficiency and memory footprint.

Q & A

  • What is the motivation behind fine-tuning open source language models?

    -The motivation is to address the problems of hallucination and not following the intended format with open source language models. Fine-tuning can help these models better adhere to specific formats and reduce hallucinations for niche tasks.

  • Why is data curation and formatting important for fine-tuning language models?

    -High-quality curated data that captures the intended behavior is crucial. The way the data is formatted during training should be consistent with how the model will be used during inference, as inconsistencies can lead to incorrect or unexpected outputs.

  • How does Ray Train assist in distributed fine-tuning of language models?

    -Ray Train provides a simple, Pythonic API for orchestrating multi-process training workloads. It seamlessly integrates with other Ray libraries like Ray Data for distributed data ingestion, and offers features like automatic distributed environment setup, job scheduling, and observability tools for debugging.

  • What are the key factors to consider when setting up an evaluation pipeline for fine-tuned language models?

    -It is important to set up a reliable and scalable evaluation pipeline that accurately measures the model's performance. This may involve techniques like using more powerful models like GPT-4 to create mock test cases or automate parts of the evaluation process.

  • What tasks are particularly well-suited for fine-tuning open source language models?

    -Tasks that involve following specific formats, such as natural language to SQL query generation or functional representation tasks, are well-suited for fine-tuning. These tasks do not necessarily require deep understanding of the world, but rather learning to map input formats to output formats.

  • What is parameter-efficient fine-tuning, and how does it differ from full parameter fine-tuning?

    -Parameter-efficient fine-tuning, like LoRA (Low-Rank Adaptation of LMs), involves fine-tuning only a small subset of additional parameters instead of the entire model's parameters. This reduces memory footprint and checkpoint sizes compared to full parameter fine-tuning.

  • How does LoRA (Low-Rank Adaptation) work for parameter-efficient fine-tuning?

    -In LoRA, the pre-trained weights are frozen, and two low-rank matrices A and B with far fewer parameters are added to the model during fine-tuning. This significantly reduces the number of trainable parameters while still allowing the model to adapt to the new task.

  • What are some advantages of using LoRA for fine-tuning language models?

    -LoRA allows for fine-tuning large language models on smaller hardware instances due to its reduced memory footprint. It also results in much smaller checkpoint sizes, making it more efficient for serving fine-tuned models in production.

  • What factors can affect the performance and stability of LoRA fine-tuning?

    -The learning rate and prompting techniques used during training can impact the stability and performance of LoRA fine-tuning. Additionally, LoRA's performance may vary depending on the task complexity, with more challenging tasks like mathematical reasoning potentially seeing a larger quality gap compared to full parameter fine-tuning.

  • What is the trade-off between LoRA and full parameter fine-tuning in terms of model quality and efficiency?

    -While full parameter fine-tuning may still have an edge in model quality (1-3% relative accuracy), LoRA offers significant advantages in terms of memory footprint and serving efficiency. The choice depends on whether model quality or serving efficiency is the higher priority for a given use case.

Outlines

00:00

πŸ”Š Introducing the Talk and Motivating Open Source LMs

The speaker, Kurosh, begins by welcoming the audience and introducing the talk's focus on lessons learned from fine-tuning LLaMa 2, an open-source language model. He highlights the promise of open-source LMs like LLaMa 2, which offer cost-effectiveness and data governance control compared to closed-source models like GPT-4. Kurosh emphasizes the recent progress in open-source LMs, with LLaMa 2 models nearing the performance of GPT-3.5. However, he notes two main challenges: factual grounding and format adherence. Fine-tuning is presented as a technique to address format issues, while retrieval-augmented generation tackles hallucination problems.

05:01

🌍 Benefits of Fine-tuning and Ray's Role in Distributed Training

Kurosh outlines several reasons to fine-tune language models. Few-shot prompting is limited by context window size, so fine-tuning can bake examples into the model's parameters. Fine-tuning also excels at handling tasks with specific formatting or tone requirements that are difficult to describe with prompts alone. It can save tokens and reduce serving costs compared to verbose prompts. Kurosh then introduces Ray and its train library, highlighting its advantages for distributed deep learning, such as its simple API, integration with Ray data, faster development tools, elegant job scheduling, and observability tools.

10:03

βš™οΈ Setting up Fine-tuning Problems: Data and Evaluation

Kurosh emphasizes the importance of data collection, formatting, and evaluation when setting up fine-tuning problems for language models. Using the example of natural language to SQL query generation, he stresses the need for high-quality, curated datasets that capture the intended model behavior. Consistent formatting between training and inference data is crucial for optimal performance. Kurosh also highlights the importance of reliable evaluation pipelines, describing their approach of using GPT-4 to create mock tables and unit tests for evaluating SQL query outputs.

15:06

πŸ§ͺ Experimental Results on Fine-tuning LLaMa 2

Kurosh presents experimental results of fine-tuning LLaMa 2 on various tasks, including functional representation, SQL generation, and math reasoning. The results show that while out-of-the-box language models perform poorly, fine-tuning can significantly boost performance, even outperforming GPT-4 on certain format-following tasks like SQL generation. However, for tasks requiring more reasoning and understanding, such as math problems, fine-tuning still lags behind GPT-4's performance. Kurosh suggests that fine-tuning excels in tasks where models need to learn input-output mappings without deeper understanding.

20:07

πŸ” Parameter-Efficient Fine-tuning with LoRA

Arthur introduces parameter-efficient fine-tuning, specifically the LoRA (Low-Rank Adaptation) technique. LoRA freezes the pre-trained weights and adds low-rank matrices, reducing the number of trainable parameters. Experimental results show that LoRA performs almost as well as full fine-tuning on tasks like functional representation and SQL generation but lags slightly behind on math tasks, possibly due to the more complex optimization landscape. Arthur discusses LoRA's sensitivity to learning rates and the benefits of prompting for training stability. The main advantages of LoRA are reduced memory footprint during training and improved serving efficiency with smaller checkpoint sizes.

25:09

πŸŽ“ Lessons Learned and Closing Remarks

In the closing part, Kurosh and Arthur summarize the key lessons learned from their fine-tuning experiments. They emphasize the crucial importance of data set quality, consistent formatting between training and inference data, and the use of reliable evaluation pipelines (like GPT-4 in their case). They discuss LoRA's sensitivity to learning rates and prompting for training stability, as well as its advantages in memory footprint and serving efficiency compared to full fine-tuning. Finally, they highlight the potential of fine-tuning open-source models for niche, format-following tasks and invite the audience to another related talk.

Mindmap

Keywords

πŸ’‘Open Source Language Models

Open source language models refer to large language models like LLaMA 2, Falcon, and MPT that are publicly available and can be fine-tuned and customized by anyone. They promise lower costs and more data control compared to closed source models like GPT-4. The video highlights the immense progress in open language models closing the gap with proprietary models like GPT-3.5.

πŸ’‘Fine-tuning

Fine-tuning is the process of taking a pre-trained language model and continuing its training on a specific dataset to specialize it for a particular task or domain. The video discusses how fine-tuning can help language models follow the desired output format, tone, or structure better than prompt engineering alone. Fine-tuning is presented as a solution for tasks that are hard to describe with words.

πŸ’‘Prompt Engineering

Prompt engineering refers to the techniques used to provide context and examples to a language model through prompts to guide its outputs. The video contrasts prompt engineering with fine-tuning, explaining that while prompting can enable in-context learning, fine-tuning is necessary when the data is too large to fit in the model's context window or when the desired output format is difficult to specify through prompts alone.

πŸ’‘Parameter Efficient Fine-tuning

Parameter efficient fine-tuning, like the LoRA (Low-Rank Adaptation) technique, involves fine-tuning only a small subset of the model's parameters or adding a few additional parameters instead of tuning the entire model. This approach reduces memory requirements during training and results in smaller model checkpoints, enabling fine-tuning on smaller hardware while retaining most of the performance gains of full fine-tuning.

πŸ’‘Data Curation

Data curation refers to the process of carefully cleaning, formatting, and validating the training data used for fine-tuning language models. The video emphasizes the importance of high-quality, curated datasets that capture the intended behavior of the language model. For example, in SQL query generation, the presenters manually went through the data to fix errors and ensure consistency between table names, data types, and expected query outputs.

πŸ’‘Evaluation Pipeline

An evaluation pipeline is a systematic process for assessing the performance of fine-tuned language models on a specific task. The video discusses using GPT-4 to create a scalable and automated evaluation pipeline for SQL query generation, where mock tables were generated based on reference outputs to compare the model's outputs against the expected results.

πŸ’‘RayTrain

RayTrain is a distributed deep learning library that the presenters highlight as a valuable tool for fine-tuning large language models. It provides a simple API for integrating existing Python code, seamless integration with data ingestion libraries, and tools for faster development, distributed environment setup, and observability.

πŸ’‘Hallucination

Hallucination refers to the tendency of language models to generate outputs that are not factually grounded or consistent with the provided context. The video mentions techniques like retrieval-augmented generation and reinforcement learning as potential solutions for addressing hallucination, while fine-tuning is presented as a way to address issues with following the desired output format or intent.

πŸ’‘Niche Tasks

Niche tasks are specific, narrow tasks or domains where fine-tuned language models can outperform larger, more general-purpose models like GPT-4. The video showcases examples like SQL query generation and functional representation extraction as niche tasks where fine-tuning smaller models can achieve better performance than GPT-4.

πŸ’‘Serving Efficiency

Serving efficiency refers to the computational cost and resource requirements for deploying and serving a fine-tuned language model in production. The video highlights parameter efficient fine-tuning techniques like LoRA as a way to improve serving efficiency by reducing the model checkpoint sizes and memory footprint, enabling deployment on smaller hardware while retaining most of the performance gains from fine-tuning.

Highlights

Open source language models like LLaMA 2 are closing the gap compared to proprietary models like GPT-4 in terms of performance on various tasks, making them a promising alternative.

Fine-tuning language models can address the issue of models not following the desired format or intent, by baking the format or style into the model's internal knowledge.

Ray Train is a powerful framework for orchestrating multi-process training workloads, providing a simple API, distributed data ingestion, and observability tools for debugging.

Data curation and formatting are crucial for fine-tuning language models, ensuring high-quality and consistent data that captures the intended behavior.

Leveraging powerful models like GPT-4 can automate the setup of reliable evaluation pipelines for complex tasks where traditional evaluation methods may not work well.

Fine-tuning small language models can outperform larger models like GPT-4 on specific niche tasks that don't require extensive reasoning or world knowledge.

Low-rank adaptation (LoRA) is a parameter-efficient fine-tuning technique that adds a small number of trainable parameters, reducing memory footprint and enabling the use of smaller hardware.

LoRA can achieve comparable performance to full fine-tuning on certain tasks like functional representation and SQL generation, while falling slightly behind on more complex tasks like math reasoning.

LoRA is sensitive to hyperparameters like learning rate, and prompting can help improve training stability.

LoRA significantly reduces the checkpoint size and enables serving task-specific models efficiently, making it suitable for deploying fine-tuned models in production.

While full fine-tuning may still have a slight edge in model quality, LoRA offers substantial memory and serving efficiency advantages, enabling the deployment of fine-tuned models on smaller hardware.

The speakers emphasize the importance of consistent data formats between training and inference for language models to generalize effectively.

Fine-tuning can save tokens and reduce computational costs during deployment by baking the prompt or context into the model's internal knowledge.

Ray Train provides seamless integration with other libraries like Ray Data, enabling distributed data ingestion for large datasets.

The speakers highlight the benefits of open-source language models, such as lower costs, better data governance, and more control over the technology stack.

Transcripts

play00:03

[Applause]

play00:06

hello everyone can you guys hear me yeah

play00:10

um Welcome to our talk my name is kurosh

play00:13

I'm a tech lead in the AI team here at

play00:16

any scale and together with Arthur we're

play00:18

going to be talking about some of the

play00:20

lessons we learned from fine-tuning

play00:22

llama 2. I hope these insights that we

play00:25

uncover in this talk will be of help to

play00:27

you as well

play00:29

so here's the outline of the talk

play00:32

um I'm going to start by motivating the

play00:34

promise behind open source L Ms and why

play00:38

especially we need to fine-tune them I'm

play00:40

going to briefly talk about how raytrain

play00:42

fits into picture when it comes to llm

play00:45

distributed training and then we're

play00:48

going to cover some learnings around

play00:49

fine tuning problem set up and parameter

play00:52

efficient fine tuning

play00:54

so since the emergence of chat GPT we've

play00:58

seen two major separations in the street

play01:01

Trends on one hand we have closed Source

play01:04

language models this includes models

play01:07

like gpd4 or Cloud V2 from anthropic

play01:10

um these kind of serve as a very

play01:13

powerful general purpose assistant model

play01:15

that is capable of solving a wide

play01:18

variety of tasks but

play01:20

one of the kind of like things that are

play01:23

on top of Mind of people is that they're

play01:26

prohibitively expensive to run in

play01:28

production and also more importantly

play01:30

there's a lot of ambiguity around data

play01:34

governance and how your data is get

play01:35

getting used when you're using these

play01:37

systems

play01:39

um at the same time we have open

play01:41

language models

play01:43

this includes models like llama 2 from

play01:45

meta or Falcon models or mosaic MPT

play01:49

models

play01:50

they kind of have promises on the other

play01:53

side of this spectrum which is they're

play01:55

often smaller and cheaper to run and

play01:59

they more importantly they give you more

play02:01

control to over your data and your

play02:05

technology stack in serving them

play02:07

what is more interesting is that in

play02:10

recent months we've seen an immense

play02:13

progress on the open language models

play02:16

closing the Gap compared to proprietary

play02:19

models like gpd4 this is a leaderboard

play02:22

from lmsys kind of an organization UC

play02:26

Berkeley which kind of keeps track of

play02:28

the the progress that is made on

play02:31

language models by evaluating these

play02:33

models on across a wide range of kind of

play02:36

tasks and then puts them on this

play02:38

leaderboard

play02:39

llama2 models have come very close to

play02:42

kind of like GPD 3.5 and other property

play02:47

models

play02:49

um but one of the kind of problems that

play02:53

exists like in these language models you

play02:55

can categorize them into two subsets

play02:57

they're often like when these models

play03:00

produce like completions what they

play03:04

output is oftentimes not factually

play03:07

grounded they often hallucinate and make

play03:10

things up

play03:11

and there's another category of problems

play03:14

which is they often don't follow the

play03:16

format that you have in your mind or

play03:18

like in intent and to use these language

play03:21

models for this figure kind of shows a

play03:25

spectrum of techniques that kind of try

play03:27

to address these two types of problems

play03:30

on the bottom we've got prompt tuning or

play03:33

prompt engineering and then few shot

play03:34

prompting we have fine tuning which

play03:37

addresses following a form problem

play03:41

um and then we've got retrieval

play03:43

assistant generation which explicitly

play03:45

addresses the hallucination and on top

play03:47

we've got reinforcement learning and

play03:48

training from scratch which are kind of

play03:51

like more complex and only available to

play03:53

a few companies today we're going to

play03:56

talk about fine-tuning and how it

play03:58

addresses the form problems with these

play04:00

language models

play04:03

so why fine-tune language models in the

play04:07

next few slides I'm going to cover a few

play04:09

reasons that show that highlights that

play04:12

shows the benefits of fine-tune language

play04:14

models

play04:15

first thing to point out is few shot

play04:18

prompting is a technique that enables in

play04:21

context learning meaning that we found

play04:24

that like you can in language models you

play04:28

can provide a few examples of desired

play04:30

input outputs and fit them into the

play04:33

context of these language models as

play04:35

input and have them model generalize

play04:37

that same pattern matching to unseen

play04:41

data points

play04:43

but there are often many times that your

play04:45

data is huge and doesn't fit The Limited

play04:49

context window that these language

play04:50

models provide

play04:52

so in this case in these scenarios what

play04:55

you can do is instead of putting these

play04:58

examples into the context bake them into

play05:01

the neural network rates that

play05:03

essentially present the internal

play05:05

knowledge of these language models

play05:09

um another reason to think about

play05:11

fine-tuning is there are a lot of tasks

play05:14

that are hard to describe in words some

play05:17

of these like subtleties go around like

play05:20

formatting out the output is a specific

play05:23

output format that you have in mind or

play05:25

having the model generate something in a

play05:28

specific tone you may attempt to fix

play05:30

these by prompting with phrases like

play05:33

output this thing in this Json format or

play05:36

like put something like the final answer

play05:39

in this integer format that I want to

play05:41

parse later in my software

play05:43

but there are often many times that

play05:46

language models don't respect these kind

play05:49

of like phrases and you may need to

play05:51

provide several examples to kind of

play05:53

reinforce what you mean

play05:55

um in the following a specific tone

play05:58

another example is like you may say

play06:00

something like hey write this in a

play06:02

concise respectful or helpful manual

play06:04

manner without being explicit what these

play06:07

kind of words mean and you may need to

play06:10

again provide some examples what these

play06:13

words mean for the model

play06:15

so with fine tuning we can actually

play06:17

leverage a lot of illustrations and bake

play06:20

that into the internal knowledge of the

play06:22

model

play06:24

it can also save you tokens

play06:26

um there are many applications that you

play06:28

can get away with prompt engineering but

play06:30

oftentimes this prompt end up being too

play06:33

wordy or verbose with many examples

play06:36

but what what thing the thing that you

play06:38

have to keep in mind

play06:39

is if you want to run this in production

play06:42

for every single request and every input

play06:45

token output token that you want to

play06:47

generate you have to fit in the scene

play06:49

the same context scene and you're going

play06:51

to have to perform computation on it so

play06:54

if you have cases where this is too

play06:57

verbose it's going to actually incur a

play06:59

lot of cost during deployment with fine

play07:02

tuning you can kind of implicitly bake

play07:05

that this prompt again into the

play07:06

knowledge of the network and get away

play07:08

with like a cheaper serving cost

play07:11

and last but not least as we show later

play07:14

in the talk

play07:16

with fine tuning you can oftentimes get

play07:19

a faster cheaper model

play07:21

at the same quality for some of the

play07:23

niche tasks compared to let's say larger

play07:26

models or even gpd4 in some cases

play07:30

um so this is a plot that I think you

play07:32

guys have seen already in the Keynotes

play07:35

and other talks here which kind of

play07:37

demonstrates an example of what we mean

play07:39

by Niche test like a SQL data generation

play07:42

how we can

play07:44

fine-tune these small models to kind of

play07:46

outperform other powerful models for

play07:49

this specific task

play07:51

we're gonna cover more about like more

play07:53

of the experimentation side later in the

play07:55

top

play07:57

now I want to just highlight and briefly

play08:00

talk about how Rey kind of fits into

play08:02

this picture

play08:04

and there's a great talk that was

play08:07

presented by June Sean yesterday that

play08:10

dives deeper into how raytrain is a

play08:13

production ready library for distributed

play08:15

deep learning I'm not gonna cover

play08:18

um as much details but I'm gonna just

play08:20

highlight some of the features that

play08:22

makes raytrain great for this type of

play08:25

workload

play08:27

um so what is rate train rate train in

play08:29

my opinion is the best framework for

play08:31

orchestrating multi-process training

play08:34

workload and here is why

play08:36

first of all it provides a very simple

play08:39

API 100 pythonic that you can take

play08:43

existing python code in your favorite

play08:45

framework and just integrate it with

play08:47

great train to distribute it across your

play08:49

cluster

play08:51

um plus it has also seamless integration

play08:53

with other libraries in the repo system

play08:56

like Ray data that provides distributed

play09:00

data ingestion which can be very helpful

play09:02

when you have when you're dealing with

play09:03

large data sets

play09:06

it provides Tools around faster

play09:08

development

play09:10

um for example it automatically sets up

play09:12

distributed environments so that these

play09:15

lower level libraries like Cuda Nico

play09:17

these things can communicate to each

play09:19

other and as an ml developer you have

play09:21

you don't have to think about them and

play09:23

just can focus on your model training

play09:25

and you know lost scares and things like

play09:28

that

play09:30

um

play09:31

another way to look at raytrain is that

play09:33

it is a simple and elegant Java

play09:35

scheduling with features like Auto

play09:37

scaling or support for heterogeneous

play09:40

resources you can actually survive in

play09:44

today's world where like gpus are very

play09:46

scarce and there's like capacity issues

play09:50

at reservation you can put together

play09:52

heterogeneous clusters and get unblocked

play09:55

when you're training something in

play09:57

development

play09:58

and last but not least there is a lot of

play10:00

observability tools built around gray

play10:03

that helps us like easily debug

play10:06

distributed applications and unblock

play10:08

ourselves

play10:11

um yeah so now that we talked about

play10:14

um kind of the infrastructure side and

play10:16

why we should do fine tuning let's talk

play10:19

about what it takes to do fine tuning

play10:21

how do we set up problems for

play10:23

fine-tuning language models

play10:25

so there are two main pillars that you

play10:28

have to think about very carefully when

play10:30

you want to set up a fine tuning problem

play10:32

obviously there is data collection and

play10:35

formatting and I want to really

play10:37

highlight the importance of evaluation

play10:41

so to concrete to crystallize these

play10:44

things into concrete examples we're

play10:46

going to use this natural language to

play10:47

SQL query generation

play10:50

so data set quality is crucial I think

play10:54

you've heard it already from even Adobe

play10:57

stock here

play10:58

um in generative AI data set is kind of

play11:02

the king and you have to invest a lot of

play11:04

time in it to make sure you've got high

play11:06

quality curated data that captures your

play11:09

intention of how these language models

play11:10

should behave

play11:12

so in SQL generation

play11:15

um we've the examples are formatted like

play11:17

this you have like a natural language

play11:19

statement that poses a question about a

play11:22

data set and there is like a table

play11:24

schema presented by a bunch of tables

play11:27

and then

play11:28

um like variable names and what data

play11:31

type they have and then at the end a

play11:33

desired query that you want these models

play11:35

to generate

play11:37

it's very important to make sure these

play11:40

data sets are clean V for this type of

play11:42

study we did a lot of data curation

play11:45

manually went through all these data

play11:47

sets make sure kind of understood what

play11:50

are the common errors in the data set

play11:52

fixed them filter them to make sure for

play11:55

example table names makes sense they

play11:57

represent what the underlying data is

play12:01

um data types match for example the

play12:04

query that is generated so to get these

play12:07

good results you gotta curate your data

play12:10

and I can emphasize it enough just by

play12:13

one like slides

play12:15

next

play12:16

thing that you have to think about the

play12:18

data is

play12:20

um and this is kind of an important one

play12:22

is the way that you kind of format them

play12:24

during training is going to impact how

play12:27

you want to use them like ask the model

play12:30

to do something so training and

play12:32

inference data format should be very

play12:35

consistent with each other

play12:36

so I'm going to give you an example in

play12:38

this SQL generation imagine my training

play12:41

data set I structure all my examples

play12:43

like this write a SQL query to answer

play12:46

this question based on a table schema

play12:47

followed by two new line symbols context

play12:51

two new line against symbols and then

play12:53

the question

play12:54

and then have the model learn how to

play12:56

Output the kind of corresponding query

play12:59

I go ahead and train a model with this

play13:02

but at inference time I come back and

play13:04

ask the model the same question but in a

play13:06

different format like here is a database

play13:09

maybe I don't specify the schema

play13:12

and then I ask it hey convert the

play13:14

following to a SQL command like show

play13:16

names blah blah and then when I see what

play13:19

the model produces it's kind of like

play13:21

wrong in subtle senses like it doesn't

play13:24

it for example forgets the name of the

play13:27

the schema or it doesn't do this order

play13:29

by like descending

play13:31

but

play13:33

the reason behind like this thing is

play13:36

that you have to think about how the

play13:37

model has seen the data before it has

play13:39

only seen the data in this particular

play13:42

format and then you're throwing it at it

play13:45

like a new kind of format of data which

play13:47

kind of gets to converted to new symbols

play13:50

that this model may not even recognize

play13:52

and may generalize may not generalize

play13:55

very well too

play13:56

so it's very important

play13:59

um to kind of have a consistent format

play14:02

when you're actually running inference

play14:04

on training on these models or if you

play14:07

want to have variations in the type of

play14:09

like data that goes into inference you

play14:11

have to have the same type of variation

play14:12

in your data as well so these models

play14:14

learn to be robust to those type of

play14:16

variations

play14:19

and now I want to talk about a little

play14:21

bit about setting up evaluation

play14:23

Pipelines

play14:25

this example is kind of very specific to

play14:27

the SQL generation but it kind of

play14:30

inspires other ways to think about it so

play14:34

for see let's talk about SQL so SQL like

play14:37

you your model output something like

play14:39

select block and then you have a

play14:41

reference output that you want to check

play14:43

whether the model what model generated

play14:44

is equivalent to

play14:46

there are cases there are like this is a

play14:48

contrived example but this kind of

play14:50

greatly captures the nuances behind this

play14:53

task it's very complicated to ensure

play14:55

whether what the model outputted is

play14:57

consistent or the same as the reference

play15:00

output you cannot do character for

play15:02

character matching you cannot do more

play15:06

complex methods like abstract syntax

play15:08

stream matching maybe you have like

play15:09

expressions Math Expressions that are

play15:12

equivalent but look different than this

play15:15

a steam matching method would also not

play15:16

work out well

play15:18

uh what we did here was to use gpt4

play15:21

actually a powerful model

play15:24

although it costs you can cost and it

play15:27

can become expensive you're doing an

play15:28

evaluation pipeline so it's like a

play15:30

one-time cost that you can pay up front

play15:32

to set up evaluation pipelines that are

play15:34

that you kind of keep consistent

play15:36

throughout your experimentation

play15:39

so what we did here was we asked chat

play15:42

gpt4 to create a bunch of mock tables

play15:46

their like conditioned on for example

play15:48

the the reference output and the table

play15:51

schema where if we ran the reference

play15:54

output against we could check what got

play15:57

um as a as what came out as a result of

play16:00

running that query would match the same

play16:02

thing that would come out as there is

play16:05

running the like the model output

play16:07

against the same mod table so by doing

play16:09

so we kind of curated and handcrafted

play16:13

maybe like 200 300 examples of such unit

play16:16

tests where we could like run all of our

play16:19

experimentations against and make sure

play16:21

we've got like consistent evaluation

play16:23

pipeline in a scalable way when we are

play16:26

experimenting with this like fine tuning

play16:28

tasks

play16:29

so

play16:31

um the takeaway from this is that there

play16:34

are tasks out there that you may want to

play16:36

apply fine-tuning to that

play16:39

evacuating and evaluation may be a hard

play16:42

thing to do but you can leverage these

play16:45

like more powerful models to kind of

play16:47

automate that part and take some of the

play16:49

human effort out of the the loop

play16:53

now let's talk about some of the

play16:56

learnings we had from running these

play16:58

experiments on llama2 models

play17:02

so

play17:03

um this spot was kind of shown in the

play17:06

keynote as well

play17:07

um we have

play17:10

applied fine tuning to kind of several

play17:13

tasks that we thought might be relevant

play17:15

to what other people might want to do

play17:18

with these language models I already

play17:21

talked about the SQL generation task in

play17:23

the in details in the middle that's

play17:25

that's what's shown in the middle

play17:27

on the left side we have functional

play17:29

representation which is just a task

play17:32

where you have a

play17:34

like a honestructured text

play17:37

asking a question or like have a comment

play17:40

about something and then your task is to

play17:43

read that text and convert it to a

play17:45

structured data this is a very common

play17:48

tasks that in like Health space where

play17:51

for example doctors write a lot of notes

play17:53

and you have to kind of parse that and

play17:55

extract it in a structured format

play17:59

um and so that's that's basically what

play18:01

is shown here and we've got another task

play18:03

which is more more geared toward

play18:06

um mathematical reasoning and logical

play18:07

reasoning GSM 8K is a data set of around

play18:11

8 000 examples of basic math questions

play18:14

followed by some answer and you want to

play18:16

evaluate better language models can

play18:18

solve this type of task

play18:20

so what is shown here is that these

play18:23

darker bars are

play18:26

the performance and success rate of

play18:30

these like models then fine tune the

play18:33

chat fine-tune models right out of the

play18:35

box so you don't do any specialized fine

play18:38

tuning on them and they compare to gpt4

play18:42

for example they do very poorly

play18:44

they they're not even close to the

play18:47

performance

play18:48

but if you kind of use the training data

play18:50

that is curated for these tasks and then

play18:53

fine-tune these models and then do the

play18:55

evaluation again you'll see that the

play18:58

performance gets boosted so much that it

play19:01

can actually beat gpd4 in these two kind

play19:05

of

play19:05

tasks however

play19:07

in some of the tasks that involve more

play19:10

things than just following a format

play19:12

right math involved requires

play19:16

um more understanding of like reasoning

play19:18

and logic piecing together different

play19:21

piece things about the logic behind the

play19:23

question and although fine-tuning can

play19:26

help get you from let's say I don't know

play19:28

40 to 50 it is still far behind

play19:33

um kind of performance of more powerful

play19:35

and general purpose models like gpt4

play19:38

what this kind of presents is the

play19:41

opportunity for applying fine tuning on

play19:45

these four and following fact tasks like

play19:48

dysfunctional representation or SQL

play19:50

generation are the kind of task that the

play19:53

model does not have to really kind of

play19:56

understand the world or how the work

play19:58

functions they just have to learn how to

play20:00

map like a certain format of input to

play20:02

assert another format in the output and

play20:05

this is where like fine tuning can

play20:06

really help

play20:08

um now I'm gonna hand it off to Archer

play20:10

to talk about learnings from parameter

play20:13

efficient fine tuning right thanks

play20:15

karosh uh hello everyone all right so

play20:18

another we have seen the the value of

play20:20

these models let's talk about parameter

play20:23

for tuning so first first of all what is

play20:26

parameter efficient fine tuning um in in

play20:28

full parameter fine tuning what you do

play20:30

is just a continuation of the training

play20:32

but on Specialized data and parameter

play20:35

efficient fine tuning is the same the

play20:36

same thing but uh your only fine-tuning

play20:40

a small number of parameters so this

play20:42

could be a subset of the parameters of

play20:44

the original model or it could be some

play20:46

additional parameters

play20:48

the point being that it has to be very

play20:49

few parameters and there's a couple of

play20:51

techniques that exist out there to do

play20:53

this and one of these techniques is

play20:55

Laura

play20:57

so Laura means low rank adaptation of

play20:59

LMS and you see here on the left side a

play21:02

kind of schematic of the internals of a

play21:05

transformer and on the right side here

play21:07

you see how Laura works in principle

play21:10

so for any given layer for from the

play21:12

Transformer that is dense so for example

play21:14

like a feed forward layer you can grab

play21:16

that layer and you can apply Laura to it

play21:18

so what does that mean

play21:20

um

play21:21

well you have these pre-trained weights

play21:22

and what you do during training with

play21:25

Laura is you freeze them and you set

play21:27

them aside and this will become quite

play21:29

important later so you set those pre you

play21:31

freeze them and you set them aside and

play21:33

then you add an additional Matrix a

play21:36

times B that can be decomposed into two

play21:40

low rank matrices A and B

play21:43

and these two matrices combined have

play21:46

very few parameters compared to the

play21:48

original parameters the pre-trained

play21:50

weights that you would normally be

play21:51

fine-tuning so this is really where the

play21:54

where the trick is here and this can

play21:55

bring you two things

play21:57

um first of all during training

play21:58

obviously there's a a much smaller

play22:01

Optimizer state to be kept in memory and

play22:04

then second of all you're left with much

play22:07

smaller checkpoints and we'll talk more

play22:10

about this later but let's first talk a

play22:12

little bit more about the quality of the

play22:14

models that we gained out of fine tuning

play22:16

with Laura

play22:18

right so this should look somewhat

play22:20

familiar these are the same tasks that

play22:21

Crush talked about earlier so we have

play22:23

the functional representation test SQL

play22:25

generation and the math task

play22:27

and we added another shade here a medium

play22:30

shade to the to the dark shade that

play22:32

signifies the Baseline and the Light

play22:34

trade that signifies how well full

play22:36

parameter fine-tuning does so we added

play22:38

the medium shade here to signify how

play22:40

well Laura does and you can see for the

play22:42

left two tests for functional

play22:43

representation and SQL generation that

play22:45

Laura did basically almost as well as

play22:48

full parameter fine tuning so the

play22:50

relative difference in accuracy here is

play22:52

like one or two percent

play22:53

and we can learn from this already that

play22:56

with Laura we're able to solve some like

play22:59

real world problems uh very well

play23:01

actually better than uh what we got out

play23:03

of gpt4

play23:04

and

play23:06

um but on the right side you see the

play23:08

math test again where Laura is lacking a

play23:10

little bit behind so for the 13 and 70b

play23:12

parameter models um we're seeing

play23:14

differences of like two or three percent

play23:15

and for the seven billion parameter

play23:17

model

play23:18

um the lack and quality was even greater

play23:20

and our hypothesis around why this might

play23:23

be is that you know like math is

play23:26

generally hard for LMS to do as we know

play23:29

and then Laura is also a more difficult

play23:32

optimization task so since you have much

play23:33

fewer parameters to play with the the

play23:36

optimization landscape is a little more

play23:38

more tricky and this might just add up

play23:41

so something we can maybe learn from

play23:43

this and this has to be seen like in

play23:45

future tasks that we look at is that the

play23:47

performance of Laura might depend a

play23:49

little bit on the type of task that

play23:51

you're looking at

play23:53

right so another thing that we learned

play23:55

with Laura was that it's sensitive to

play23:58

the learning rate so with full parameter

play24:01

fine tuning what you'll find generally

play24:02

is that it's very stable across a wide

play24:03

range of of learning rates

play24:05

and when we use Laura we encountered

play24:08

some instabilities here so a learning

play24:10

rate that you'll see widely used on the

play24:12

Internet is 1e minus four and we use

play24:14

that at first as well and then ran to

play24:16

some of these instabilities and you can

play24:17

see here how just by tweaking they're

play24:19

learning right a little bit we got a

play24:21

much smoother learning curve here in

play24:24

this purple one

play24:26

yeah and then another thing that we did

play24:28

to improve stability was interestingly

play24:31

prompting so what you can do during

play24:35

training and obviously you have to do

play24:36

the same thing during evaluation as

play24:38

career said is you can apply some prompt

play24:41

engineering basically during fine tuning

play24:42

so you you create some helpful context

play24:45

for the model like for example you know

play24:47

you're a helpful assistant this is like

play24:50

a SQL table in the query and stuff like

play24:53

that and then you prepend that to what

play24:56

you're normally inputting to your model

play24:59

and what that leaves you with if you

play25:02

like we fixed everything else here like

play25:04

a seating and learning rate and

play25:06

everything right but what that left us

play25:08

with was

play25:09

um even even smoother

play25:11

um learning curve here the the orange

play25:14

one

play25:15

yeah

play25:17

cool so now um that we've talked about

play25:20

like how well Laura does on these on

play25:22

these problems and that we just might

play25:25

have to tweak it a little bit here and

play25:26

there let's look at the upsides of Laura

play25:29

so first of all as I said in the

play25:31

beginning

play25:32

um the the optimizer state is much

play25:35

smaller right so for the 7 billion

play25:37

parameter model for example that we

play25:38

fine-tuned we were able to fine tune the

play25:40

seven billion parameter model

play25:42

um

play25:42

on a single AWS

play25:45

p4de 24x large instance and we were

play25:48

simply not able to do the same thing

play25:49

with full parameter fine tuning and the

play25:52

other thing is as you can see here the

play25:54

the checkpoint sizes are much smaller so

play25:56

with our Laura settings we were left

play25:58

we're left with checkpoints that are

play26:00

like 40 megabytes for the 7 billion

play26:02

parameter model and 12.6 gigabytes for

play26:06

the full parameter fine tuning

play26:09

so obviously with full parameter

play26:11

fine-tuning every time you checkpoint

play26:12

you have to check one the entire thing

play26:14

right with Laura you're just

play26:15

checkpointing these two matrices A and B

play26:19

cool so this brings us to our sixth

play26:22

learning

play26:24

um so as I said in the beginning you

play26:26

during training you freeze these weights

play26:28

right and you set them aside and you add

play26:30

these two matrices A and B that are your

play26:32

your Laura weights and what this means

play26:35

during during serving is that you you

play26:38

take those frozen weights like the

play26:39

original model and you put it in memory

play26:41

and then along with that you have an

play26:43

array of uh Laura whites that are tasks

play26:46

task specific

play26:49

so this ties in very well with what

play26:51

kurosh said initially about

play26:53

um in order to beat these these open and

play26:56

large and general purpose and very

play26:58

expensive models we need to find small

play27:01

models that we fine-tune and like a

play27:04

niche specific tasks so you can imagine

play27:06

like one Laura one set of lower weights

play27:09

per task here

play27:12

right so

play27:14

what have we learned about Laura now in

play27:17

terms of a trade-off

play27:18

first of all if your sole concern is

play27:20

model quality there's no way around full

play27:23

parameter fine tuning still you'll still

play27:24

have this edge of one or two or three

play27:26

percent of relative accuracy

play27:28

um and um the the training time between

play27:31

the between the two the difference in

play27:33

training time is is really not there so

play27:36

initially we thought like Laura must be

play27:39

much quicker in like fewer parameters

play27:40

fewer things to checkpoint stuff like

play27:42

that but it turns out that if you look

play27:44

at the time it takes the model to

play27:46

converge um as in like wall clock time

play27:48

to a given perplexity it's roughly the

play27:51

same between the two methods and then

play27:53

what we really gained from Laura is

play27:55

first of all the memory footprint that

play27:56

can really unblock you on using smaller

play27:58

instance types in training and second of

play28:01

all the the serving efficiency that's

play28:03

just greatly enhanced

play28:05

right so here are all the learnings that

play28:08

we mentioned today first of all data set

play28:10

quality is crucial

play28:12

training and inference data form and

play28:13

consistency is crucial and we use gpd4

play28:16

to set up a reliable evaluation pipeline

play28:19

then Laura being sensitive to the

play28:22

learning rate and prompting data sets

play28:25

help with training stability

play28:28

and lastly the large big Advantage is

play28:31

really the serving efficiency

play28:34

yeah so one more thing here there's

play28:37

another talk about these LMS in

play28:39

production by our chief scientist Walid

play28:43

and that's going to be at 3 15 PM in

play28:45

Gate Ballroom B

play28:47

cool thanks everyone for attending

play28:50

thank you

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
AILanguage ModelsFine-tuningOpen SourceRay TrainMachine LearningParameter EfficiencyData CurationEvaluationModel Performance