EASIEST Way to Fine-Tune LLAMA-3.2 and Run it in Ollama
Summary
TLDRThis video demonstrates how to fine-tune Meta's newly released Llama 3.2 models using the Unslot platform. It focuses on fine-tuning the 3 billion-parameter model and running it locally with Olama. The tutorial walks through the process of preparing datasets, adjusting parameters, and loading models for efficient on-device use. It also covers using Lora adapters for fine-tuning and saving models for local deployment. The video emphasizes the ease of running smaller models locally and hints at future videos on the vision capabilities of the 11 and 90 billion-parameter models.
Takeaways
- 🚀 Meta released LLaMA 3.2 with four models, including lightweight and multimodal versions.
- 🧠 The lightweight models (1B and 3B) are ideal for on-device tasks, while the larger models (11B and 90B) focus on vision-related tasks.
- 🎯 Fine-tuning LLaMA 3.2 models can be done using the Unslot library, which provides an efficient way to work with large language models.
- 💾 LLaMA Stack was introduced, offering a streamlined developer experience for deploying these models.
- 📊 The fine-tuning process in the video uses the Finetom dataset with 100,000 multi-turn conversation examples.
- ⚙️ Key hyperparameters include max sequence length (2048), floating-point precision (4-bit quantization), and batch size, all impacting memory usage and training performance.
- 🔧 LoRA adapters are used for efficient fine-tuning by training specific modules and merging them with the original model.
- 📜 The importance of using a correct prompt template for instruct and chat versions of LLaMA 3.2 is emphasized during fine-tuning.
- 💡 The trained model can be run locally using the OLLaMA tool, and fine-tuned models can be saved in GGUF format for local use.
- 💻 The example shows how fast the 3B model performs locally for tasks like generating Python code, highlighting the potential of running LLaMA models on-device.
Q & A
What is Lama 3.2, and what are its key features?
-Lama 3.2 is a new family of models released by Meta, consisting of four different models, including multimodal models designed for both language and vision tasks. The key features include lightweight 1 and 3 billion parameter models, along with larger 11 and 90 billion parameter models for advanced tasks. The smaller models can run on devices, while the larger ones are more suited for complex tasks like vision.
Why are the 1 and 3 billion models significant?
-The 1 and 3 billion models are significant because they can run on-device, such as on smartphones. This makes them more accessible and practical for everyday use, providing high performance without requiring large computational resources.
What is 'unslot' and how is it used in the fine-tuning process?
-Unslot is a framework used for fine-tuning language models, like Lama 3.2. In this video, it is used to fine-tune a pre-trained model on a specific dataset. Unslot simplifies the process by providing tools like the fast language model class for handling large language models efficiently.
How does one prepare their dataset for fine-tuning a Lama model?
-To prepare a dataset for fine-tuning a Lama model, the dataset must be formatted to fit the model's prompt template. For Lama 3.1 and 3.2 instruct models, the template expects a role-based approach, such as 'system', 'user', and 'assistant' roles. Any dataset used must be adjusted to match this structure.
What is the role of Lora adapters in fine-tuning, and why are they used?
-Lora adapters are used to fine-tune smaller parts of the model, instead of updating all the model's parameters. This reduces memory usage and computational requirements, making fine-tuning more efficient, especially for large models. They allow targeted adjustments while keeping the original model weights intact.
What parameters are important when loading the Lama 3.2 model for fine-tuning?
-Key parameters include the max sequence length, which is dependent on the dataset size, data types (FP16, FP8, or automatic selection based on hardware), and quantization to reduce memory usage. For fine-tuning, a 4-bit quantization is used to decrease the model's memory footprint.
How is the supervised fine-tuning process handled using the TRL library?
-The TRL library from Hugging Face is used for supervised fine-tuning. It involves providing the model, tokenizer, and dataset, specifying the columns for prompts and responses, and defining parameters like sequence length and batch size. The process includes calculating the training loss based on the model's output.
What is the significance of 'max steps' and 'epochs' in the training process?
-Max steps and epochs control how long the model trains on the dataset. An epoch is one complete pass through the entire dataset, while max steps limit the number of training steps within an epoch. Adjusting these allows balancing the training time with the size of the dataset and the desired output quality.
What are the benefits of running a fine-tuned Lama 3.2 model locally using O Lama?
-Running a fine-tuned Lama 3.2 model locally allows faster and more private inference without relying on external servers. This makes the model more accessible, especially for lightweight versions like the 3 billion parameter model, which can run efficiently on local devices.
What are the next steps for fine-tuning larger models like the 11 and 90 billion versions?
-For larger models like the 11 and 90 billion versions, fine-tuning will involve handling their multimodal capabilities, particularly for vision tasks. These models require more resources and have additional complexities due to their vision component, but future videos will focus on these applications.
Outlines
🚀 Meta Releases Llama 3.2: Overview of New Models
Meta recently introduced Llama 3.2, a new family of four models, including multimodal ones, impressive for both language and vision tasks. This video will cover how to fine-tune Llama 3.2 models using Unslot and then run the fine-tuned model locally using Olama. Meta released both lightweight models (1 and 3 billion parameters) and larger multimodal models (11 and 90 billion), departing from the usual 7 or 8 billion models. Lightweight models are ideal for running on devices, while the 11 and 90 billion models are more suited for vision tasks, which will be explored in a future video.
🎯 Fine-tuning Llama 3.2 with Unslot
The video walks through how to fine-tune smaller Llama 3.2 models using Unslot. It starts by explaining the need for a dataset, such as the fine-tuned Fine Tom dataset with 100,000 examples of multi-turn conversations. After setting up the environment by installing the nightly version of Unslot, the tutorial explains the model loading process, the use of Lura adapters for more efficient fine-tuning, and details on adjusting parameters like max sequence length and data type. The 3 billion parameter model is used for on-device fine-tuning, but larger models may require more resources. The tutorial also emphasizes the importance of structuring the dataset according to the model's prompt format.
📝 Prompt Formatting and Data Preparation for Fine-Tuning
The importance of matching your dataset's prompt format with the instruct version of Llama 3.2's format is discussed. The data must follow a specific role-based template, converting conversations to a system-user-assistant format. Using functions from Unslot, such as `get_chat_template`, ensures correct formatting. The video also explains how the Llama 3.1 and 3.2 models handle system messages, including adding the model's cutoff date and masking unnecessary system prompts during training. The TRL library from Hugging Face is used for supervised fine-tuning, with parameters like sequence length, batch size, and learning rate highlighted as important tuning aspects.
📊 Fine-Tuning Parameters and Optimization Techniques
Fine-tuning parameters like the number of epochs, max steps, and batch size are key to controlling how long the model trains and how well it performs. By running only 60 steps of training on a large dataset like Fine Tom's 100,000 examples, the model won’t achieve optimal results in this example. The video also covers how to calculate training loss using the model's outputs rather than inputs and how to ensure efficient resource usage when fine-tuning on local devices. Adjustments to learning rates and batch sizes can significantly affect the speed and quality of training.
💾 Saving and Running the Fine-Tuned Model Locally with Olama
After training the model, it can be saved locally in GGUF format for deployment on Olama. The video walks through the process of saving the model using the `save_pretrained` function and explains the setup required to run it locally. It highlights the fast performance of the 3 billion parameter model when run locally, using commands in Olama to create and run models. Example outputs show the model quickly generating responses to user prompts, demonstrating the efficiency of running the fine-tuned model entirely on local hardware.
🎥 Conclusion and Upcoming Vision Models for Fine-Tuning
The video wraps up by noting that while the 1 and 3 billion models can be fine-tuned and run locally with the approach shown, fine-tuning the 11 and 90 billion models, which include a vision component, will require different techniques. Future videos will focus on fine-tuning these larger models and their applications, especially for vision-based retrieval-augmented generation (RAG) tasks. Viewers are encouraged to subscribe for more content on vision models and Llama 3.2's capabilities.
Mindmap
Keywords
💡Llama 3.2
💡Fine-tuning
💡Unslot
💡Multimodal models
💡Quantization
💡Model parameters
💡Lora adapters
💡Prompt template
💡Supervised fine-tuning
💡Hugging Face TRL
Highlights
Meta released Llama 3.2, a new family of four models, including multimodal ones, optimized for both language and vision tasks.
The Llama 3.2 family includes models of different sizes: 1, 3, 11, and 90 billion parameters, with Meta moving away from the traditional 7-8 billion models.
The smaller models (1 and 3 billion parameters) are notable because they can run on-device, making them accessible for local deployment.
The 11 billion and 90 billion models are multimodal, designed for vision tasks, though they will be covered in more depth in future videos.
Llama 3.2 comes with Llama Stack, Meta's opinionated developer experience for easier deployment of models.
Unslot is used for fine-tuning Llama 3.2, which can be customized with your own dataset to make the model more task-specific.
A dataset with multi-turn conversations, such as the Fine Tom dataset (100,000 examples), is used for fine-tuning the model.
Unslot allows you to perform low-rank adaptation (LoRA) to efficiently fine-tune the model by adding specific modules instead of full fine-tuning.
LoRA parameters, such as the 'rank' and 'LoRA alpha,' impact both the fine-tuning performance and the required memory resources.
It is essential to ensure that the prompt template of your dataset matches the format expected by Llama 3.2, especially when fine-tuning instruct models.
You can use Hugging Face's TRL library to perform supervised fine-tuning by providing the model, tokenizer, dataset, and a customized prompt format.
Key hyperparameters such as max sequence length, learning rate, and batch size significantly impact the fine-tuning process, influencing both model performance and training time.
Llama 3.2's system instruction includes details about the model's training cutoff date (December 2023), which may appear in responses.
After fine-tuning, you can save the model in GGUF format and run it locally using OLaMa for on-device inference, offering fast, efficient execution.
The method demonstrated can fine-tune both 1 billion and 3 billion models for fast on-device use, while larger models (11B, 90B) require different handling due to their vision capabilities.
Transcripts
last week meta released Lama 3.2 which
is a new family of four different models
including multimodal models and they're
pretty impressive both for language and
vision tasks for their respective sizes
but you know what's better than that
it's your custom fine tune llama 3.2
that's exactly what we're going to learn
in this video we will use unslot for
fine tuning then I'll show you how you
can run that fine tune model locally
using ola
because what's the point of a fine tune
model if you can't run it locally but
before then let's have a quick look at
the release block
post this new release has two sets of
models one are lightweight which is one
and 3 billion model and the other set is
multimodel with 11 and 90 billion there
is no 405b this time meta is moving away
from the standard seven or 8 billion
models now they have a 11 and 90 billion
instead of 8 or 70 billion model but I
think the most interesting one are the 1
and 3 billion models because you can run
them on device we will look at the 11
and 90 billion models for vision tasks
in another video apart from these models
meta has also released llama stack which
is their opinionated version of how
developer experience should look it's
great to see that these model providers
are now building text TXS for deployment
let's talk about how you can fine tune
uh one of these smaller models on your
own data set and then I'll show you how
you can run this locally using o Lama
fine tune Lama 3.2 we will use the
official notebook from the unslot team I
have covered variations of this notebook
in my earlier videos for fine-tuning
other variance of Lama this is going to
be a quick recap of those notebooks
first we need a data set uh to fine-tune
the model on for this example we're
using the fine Tom data set which has
100,000 examples so it's a relatively
huge data set and it has multi- turn
conversations this data set is collected
from multiple different sources so I
think it's a very good candidate if you
are fine-tuning uh LM in general but if
you're fine-tuning this model for your
own specific task you will just need to
provide your own data set and I'll later
on show you how you can structure your
data set first we need to install unslot
they recommend to use the nightly
version which is basically the the
latest version unot uses a fast language
model class for dealing with llms we're
going to load the Lama 3.23 billion
instruct model we're using this model
because it's a relatively smaller model
that you can potentially run on uh on
device device such as on a smartphone uh
another thing to highlight is I'm using
the unslot version you can also use the
Llama version directly you'll need to
provide your hugging face um token ID
and accept their terms and conditions
the 11 billion and 90 billion models are
not available in all regions and that
has to do with its uh Vision
capabilities so you need just need to be
careful Uno currently does not support
Vision models yet but hopefully they
will um add support soon when you're
loading the model you need to Define
three different parameters the first one
is the max sequence length in our case
we are setting it to
2048 this number is dependent on your
training data set look at your training
examples and see the maximum sequence
length available in your data set and
I'll recommend to set it to that but
setting it to a higher value will also
need more TPU vram so you need to be
careful of that data types you can set
to floating Point 16 or 8 but if you
keep it none it will automatically
select depending on your Hardware we're
going to be using the 4bit quantization
to reduce the U memory usage or memory
footprint so here we're loading both the
model as well as the tokenizer next I'm
adding Lura adopters we are not using
full fine tuning even though the model
is pretty small we're adding Laura
adopters these are different modules
that we are targeting we train
completely separate modules and then
merge it with the original model weights
there are a couple of other things to
keep in mind one is the r or rank this
determines how uh parameters are going
to be in your Lowa adapter if you uh set
it to a high number this will give you
much better uh fine tuning or the
performance is usually going to be
better but again your um fine tuning a
large number of uh in your low adopter
so that will mean that you will need
more uh resources in terms of vram to
fine-tune uh or train the low adopters
so usually 16 or 32 provides you um good
compromise between the memory footprint
and the performance another thing is the
impact of this Lura when you merging it
back to the original weights of the
model so that is set through the Laura
Alpha now some points on the prompt
template so here's the PR template that
the Lama 3.1 and 3.2 uses you need to
make sure that your data set that you're
providing in order to find T the model
actually follows this specific prompt
template because we're using the
instruct version of the models for fine
tuning if you're fine-tuning the base
model you can provide your own template
but if you're working with instructor
chat version then you have to follow the
template used by the model itself the
promt template
expects role and content but here you
can see that the data set we're using
actually uses another format which is
from human and then I think there is
from GP team right so it uses a
different promt templat so we need to
adjust this prompt template and for that
you can use the get chat template class
or function from unslot basically we
provide the token use the prompt
template from Lama 3.1 which is uh
similar to 3.2 and that will take all
the data sets and convert it to our
specific prompt template so here we're
loading the data set now we need to go
from this which is from system and then
you provide the value or from human or
from GPT to the role based approach
everything should be converted to RO
system Ro user androll assistant we do
that through the standardized share GPT
uh function that we just created now if
you look at um some example
conversations here you can see that we
went to the content so here's the
content then here's the role role is
user and that's the uh question asked by
the user then we have a role of
assistant and this is the response
generated by the assistant when you're
are formatting your own data set you
will have to follow this specific prompt
template in order to fine-tune a Lama
3.1 instruct version another thing is
that the Lama 3.1 instruct defaults chat
template adds this specific sentence in
the system instruction so it's actually
telling the model that this cut off
training date was in December 2023 and
it adds today's date to be uh July 26 so
if you see something like this in
responses from the model don't be
alarmed because that's just part of the
system instruction and later on they
actually masked this for now in order to
train the model we are using the TRL
library from hugging face and we're
going to be using the supervised fine
tuning trainer because we are doing
supervised fine tuning in this case so
we provide the model name the tokenizer
these are coming from the unslot then we
provide our data set we also tell it
which column to use as basically our
prompt template that we have already
formatted we added a text column to the
data and the maximum sequence length in
the training data set now here are some
other uh specific parameters a couple of
things which I want to highlight is if
you set the number of epoch so for
example if it sets to one it will go
through the whole data set at once only
during training but 100,000 examples are
a pretty huge data set so that's going
to take a while that's why we set the
max steps to 60 you can either set the
max steps or uh you can set the the
number of epoch now what's the
relationship between the two that is
determined by our batch size in order to
get the total number of steps in an
Epoch you can divide the size of the
data set by the batch size for example
if you have 100 examples if you divide
it by two you will get a maximum of 50
steps in the EPO we're just running it
for 60 St step steps which is the
fraction of the total number of steps
possible for 100,000 examp examples the
reason we do it is because we don't want
to run it for a long time I just want to
show you an example and that's why you
probably are not going to see a greatly
trained model in order to get really
good training output you definitely want
to run it for a lot longer the learning
rate determines the speed of convergence
if you set it to a high number the
training speed is going to be faster but
the training might not converge you
usually want to find a suite spot where
the learning rate is small enough that
it converges but that will also take
much longer to train okay one more thing
that you want to train the model on the
output not on the inputs so that's why
you want to calculate the loss of the
model on the output from the assistant
not based on the inputs from the user so
the model should see the user input
generate a response and then compare the
output with the original or gold
standard output or ground truth and
that's where you compute the loss so
this section takes care of that it
forces the model to only use the output
for uh computation of the training loss
or the test loss depending on if you
have a test data set now you can look at
uh how the tokenized version of the data
set looks like you can see that we have
clearly added the system role here is
the well formatted user input and then
we have the well formatted assistant
response and this is the data set that
we will use to train our mod want to get
rid of this part which is the um system
message part you can mask that here
we're masking that and now you can see
that you don't really see the original
system message you only see the output
the model is supposed to generate okay
next we call the train function on the
trainer that we created you can see that
the loss goes down then comes up again
the reason is that we're running it for
a way small number of steps probably we
can play around with the learning rate
as well that will control the speed of
convergence these are different
parameters that you need to play around
with if you're using bigger batch sizes
you can set the learning rate to a
relatively higher value but bigger batch
sizes will also depends on the available
GPU vram that you have so there has to
be a compromise between these
hyperparameters that you're working with
okay so after this training you can see
that if we run uh this specific um
prompt on the train model uh then here
is the response that we get so here we
see the system message but we'll have to
mask that ourselves in terms of the user
input here's the user input continue the
Fibonacci sequence so we provide the
Fibonacci sequence and then the response
generated by the model is here now you
can also stream this if you want here's
an output of the stream response which
basically does the same same thing but
in a streaming fashion okay once you
train the model you can either push this
to GitHub or store it locally I'm mostly
interested in how to store the GG verion
of the model because I want to load this
in olama and run it locally for that to
work you just need to call the save
pre-trained ggf provide the model name
I'm calling it model uh 3 billion
provide the token and the level of
quantization since it's a atively
smaller model I wanted to run it in uh
16bit floating Point Precision keep in
mind this step will take quite a long
time because it has to First download
and install Lama CPP and then convert uh
this model to GG UF format so here's the
model that I downloaded from Google
collab if you run the training locally
you are going to see unslot
fp16 ggf I downloaded the model from
Google collab now let me show you the
rest of the process next let me show you
how to run that trained file locally
using ama ama uses the concept of model
file which is basically a set of
configurations that you will need to
provide for AMA to use a model locally
there are a number of different things
you can use uh there's an instruction
called from where you tell it which
model to use you can set different uh
parameters such as temperature uh Max
context window and so on you can also
provide the full prompt template for the
uh model here is a quick example if you
want to use uh L 3.2 with different
configurations than the default you're
going to say from Lama 3.2 here they're
changing the temperature to one the
contact Max contact window is changed to
4096 and you can also provide a simple
system instruction if you go to any
model on AMA you can see this template
if you click on this this is the model
file used by any model available on AMA
I have downloaded the ggf file that was
created after uh fine-tuning the model
on um Google collab notebook you just
want to look at the file that is uh GF
so downloaded here and then created
another file called Fine Lama in here
I'm saying from and then providing that
model name with uh GG at the end so this
is basically the model file that we're
going to be using we can also include a
temp template that will Define The
Prompt template but since it's already
in the tokenizer so I don't need to do
that you can also Define the system
prompt but in this case we want to mask
it so I'm not going to add that either
now you need to have Ama up and running
after that you need to provide some
details to create this model in ama we
type the command AMA create then what
you want the model to be called so I'm
going to call it then you can use this
DF parameter and you need to provide the
path of the uh model file that we
created so it's in the same directory
when you click on it this will uh start
transferring the data and if everything
goes well uh it will create a model file
for us it's using the template from Lama
3 instruct seems like everything is
successful now we can run our model but
before that let me show you if this
shows up in the model list so here we
have have the fine Lama this is
basically the model that we just created
I have a whole bunch of other models
that I have already downloaded and now
in order to run this model all we need
to do is just type AMA run and just like
any other AMA model we just need to
provide the name now it's a 3 billion
model so it's going to be extremely fast
if I say hi you can say uh that it
generates responses pretty great and
pretty quickly all right so I'm going to
ask it to write a program in Python to
move files from uh three to a local
directory and you can see it's really
fast because it's just a 3 billion model
that is running completely locally and
that's the finetune model that we just
fine tuned okay so this was a quick
video on how to fine tune Lama 3.2 using
Onslaught and then run it locally on
your own machine using AMA I hope uh
this was helpful I'll put a link to the
um Google collab in the video
description in this video I only focused
on the 3 billion uh same approach will
apply app to the 1 billion model for uh
11 and 90 billion models the approach is
a little different because it has a
adopter for the uh Vision component so
the same approach probably is not going
to apply but I'm going to be creating
some videos specifically focused on the
vision model because I think there are
some great applications there
specifically for vision based rag which
is a topic I'm personally interested in
if that interests you make sure to
subscribe to the Channel I hope you
found this video useful thanks for
watching and as always see you in the
next one
Weitere ähnliche Videos ansehen
Llama 3.2 is HERE and has VISION 👀
Lessons From Fine-Tuning Llama-2
Fine-tuning Gemini with Google AI Studio Tutorial - [Customize a model for your application]
EASIEST Way to Fine-Tune a LLM and Use It With Ollama
New Llama 3.1 is The Most Powerful Open AI Model Ever! (Beats GPT-4)
Using Ollama to Run Local LLMs on the Raspberry Pi 5
5.0 / 5 (0 votes)