EASIEST Way to Fine-Tune LLAMA-3.2 and Run it in Ollama

Prompt Engineering
29 Sept 202417:35

Summary

TLDRThis video demonstrates how to fine-tune Meta's newly released Llama 3.2 models using the Unslot platform. It focuses on fine-tuning the 3 billion-parameter model and running it locally with Olama. The tutorial walks through the process of preparing datasets, adjusting parameters, and loading models for efficient on-device use. It also covers using Lora adapters for fine-tuning and saving models for local deployment. The video emphasizes the ease of running smaller models locally and hints at future videos on the vision capabilities of the 11 and 90 billion-parameter models.

Takeaways

  • πŸš€ Meta released LLaMA 3.2 with four models, including lightweight and multimodal versions.
  • 🧠 The lightweight models (1B and 3B) are ideal for on-device tasks, while the larger models (11B and 90B) focus on vision-related tasks.
  • 🎯 Fine-tuning LLaMA 3.2 models can be done using the Unslot library, which provides an efficient way to work with large language models.
  • πŸ’Ύ LLaMA Stack was introduced, offering a streamlined developer experience for deploying these models.
  • πŸ“Š The fine-tuning process in the video uses the Finetom dataset with 100,000 multi-turn conversation examples.
  • βš™οΈ Key hyperparameters include max sequence length (2048), floating-point precision (4-bit quantization), and batch size, all impacting memory usage and training performance.
  • πŸ”§ LoRA adapters are used for efficient fine-tuning by training specific modules and merging them with the original model.
  • πŸ“œ The importance of using a correct prompt template for instruct and chat versions of LLaMA 3.2 is emphasized during fine-tuning.
  • πŸ’‘ The trained model can be run locally using the OLLaMA tool, and fine-tuned models can be saved in GGUF format for local use.
  • πŸ’» The example shows how fast the 3B model performs locally for tasks like generating Python code, highlighting the potential of running LLaMA models on-device.

Q & A

  • What is Lama 3.2, and what are its key features?

    -Lama 3.2 is a new family of models released by Meta, consisting of four different models, including multimodal models designed for both language and vision tasks. The key features include lightweight 1 and 3 billion parameter models, along with larger 11 and 90 billion parameter models for advanced tasks. The smaller models can run on devices, while the larger ones are more suited for complex tasks like vision.

  • Why are the 1 and 3 billion models significant?

    -The 1 and 3 billion models are significant because they can run on-device, such as on smartphones. This makes them more accessible and practical for everyday use, providing high performance without requiring large computational resources.

  • What is 'unslot' and how is it used in the fine-tuning process?

    -Unslot is a framework used for fine-tuning language models, like Lama 3.2. In this video, it is used to fine-tune a pre-trained model on a specific dataset. Unslot simplifies the process by providing tools like the fast language model class for handling large language models efficiently.

  • How does one prepare their dataset for fine-tuning a Lama model?

    -To prepare a dataset for fine-tuning a Lama model, the dataset must be formatted to fit the model's prompt template. For Lama 3.1 and 3.2 instruct models, the template expects a role-based approach, such as 'system', 'user', and 'assistant' roles. Any dataset used must be adjusted to match this structure.

  • What is the role of Lora adapters in fine-tuning, and why are they used?

    -Lora adapters are used to fine-tune smaller parts of the model, instead of updating all the model's parameters. This reduces memory usage and computational requirements, making fine-tuning more efficient, especially for large models. They allow targeted adjustments while keeping the original model weights intact.

  • What parameters are important when loading the Lama 3.2 model for fine-tuning?

    -Key parameters include the max sequence length, which is dependent on the dataset size, data types (FP16, FP8, or automatic selection based on hardware), and quantization to reduce memory usage. For fine-tuning, a 4-bit quantization is used to decrease the model's memory footprint.

  • How is the supervised fine-tuning process handled using the TRL library?

    -The TRL library from Hugging Face is used for supervised fine-tuning. It involves providing the model, tokenizer, and dataset, specifying the columns for prompts and responses, and defining parameters like sequence length and batch size. The process includes calculating the training loss based on the model's output.

  • What is the significance of 'max steps' and 'epochs' in the training process?

    -Max steps and epochs control how long the model trains on the dataset. An epoch is one complete pass through the entire dataset, while max steps limit the number of training steps within an epoch. Adjusting these allows balancing the training time with the size of the dataset and the desired output quality.

  • What are the benefits of running a fine-tuned Lama 3.2 model locally using O Lama?

    -Running a fine-tuned Lama 3.2 model locally allows faster and more private inference without relying on external servers. This makes the model more accessible, especially for lightweight versions like the 3 billion parameter model, which can run efficiently on local devices.

  • What are the next steps for fine-tuning larger models like the 11 and 90 billion versions?

    -For larger models like the 11 and 90 billion versions, fine-tuning will involve handling their multimodal capabilities, particularly for vision tasks. These models require more resources and have additional complexities due to their vision component, but future videos will focus on these applications.

Outlines

00:00

πŸš€ Meta Releases Llama 3.2: Overview of New Models

Meta recently introduced Llama 3.2, a new family of four models, including multimodal ones, impressive for both language and vision tasks. This video will cover how to fine-tune Llama 3.2 models using Unslot and then run the fine-tuned model locally using Olama. Meta released both lightweight models (1 and 3 billion parameters) and larger multimodal models (11 and 90 billion), departing from the usual 7 or 8 billion models. Lightweight models are ideal for running on devices, while the 11 and 90 billion models are more suited for vision tasks, which will be explored in a future video.

05:03

🎯 Fine-tuning Llama 3.2 with Unslot

The video walks through how to fine-tune smaller Llama 3.2 models using Unslot. It starts by explaining the need for a dataset, such as the fine-tuned Fine Tom dataset with 100,000 examples of multi-turn conversations. After setting up the environment by installing the nightly version of Unslot, the tutorial explains the model loading process, the use of Lura adapters for more efficient fine-tuning, and details on adjusting parameters like max sequence length and data type. The 3 billion parameter model is used for on-device fine-tuning, but larger models may require more resources. The tutorial also emphasizes the importance of structuring the dataset according to the model's prompt format.

10:06

πŸ“ Prompt Formatting and Data Preparation for Fine-Tuning

The importance of matching your dataset's prompt format with the instruct version of Llama 3.2's format is discussed. The data must follow a specific role-based template, converting conversations to a system-user-assistant format. Using functions from Unslot, such as `get_chat_template`, ensures correct formatting. The video also explains how the Llama 3.1 and 3.2 models handle system messages, including adding the model's cutoff date and masking unnecessary system prompts during training. The TRL library from Hugging Face is used for supervised fine-tuning, with parameters like sequence length, batch size, and learning rate highlighted as important tuning aspects.

15:07

πŸ“Š Fine-Tuning Parameters and Optimization Techniques

Fine-tuning parameters like the number of epochs, max steps, and batch size are key to controlling how long the model trains and how well it performs. By running only 60 steps of training on a large dataset like Fine Tom's 100,000 examples, the model won’t achieve optimal results in this example. The video also covers how to calculate training loss using the model's outputs rather than inputs and how to ensure efficient resource usage when fine-tuning on local devices. Adjustments to learning rates and batch sizes can significantly affect the speed and quality of training.

πŸ’Ύ Saving and Running the Fine-Tuned Model Locally with Olama

After training the model, it can be saved locally in GGUF format for deployment on Olama. The video walks through the process of saving the model using the `save_pretrained` function and explains the setup required to run it locally. It highlights the fast performance of the 3 billion parameter model when run locally, using commands in Olama to create and run models. Example outputs show the model quickly generating responses to user prompts, demonstrating the efficiency of running the fine-tuned model entirely on local hardware.

πŸŽ₯ Conclusion and Upcoming Vision Models for Fine-Tuning

The video wraps up by noting that while the 1 and 3 billion models can be fine-tuned and run locally with the approach shown, fine-tuning the 11 and 90 billion models, which include a vision component, will require different techniques. Future videos will focus on fine-tuning these larger models and their applications, especially for vision-based retrieval-augmented generation (RAG) tasks. Viewers are encouraged to subscribe for more content on vision models and Llama 3.2's capabilities.

Mindmap

Keywords

πŸ’‘Llama 3.2

Llama 3.2 is Meta’s newly released family of AI models, which includes both lightweight and multimodal versions. These models are optimized for both language and vision tasks, offering notable improvements in efficiency and performance. In the video, the presenter focuses on fine-tuning the 3.2 version of the model, highlighting its ability to run locally on smaller devices.

πŸ’‘Fine-tuning

Fine-tuning refers to the process of adapting a pre-trained model to a specific task or dataset. In the video, fine-tuning is discussed using the 'Unslot' platform and demonstrated with a 3 billion parameter Llama 3.2 model. The purpose of fine-tuning is to enhance the model's performance for custom use cases, like adapting a general AI model to perform better in specific contexts or on unique datasets.

πŸ’‘Unslot

Unslot is a platform used for fine-tuning AI models, such as Llama 3.2, as discussed in the video. It provides a user-friendly environment for modifying large language models like Llama, allowing developers to train on their datasets using efficient methodologies. The presenter walks through how Unslot facilitates this process with examples from the Llama 3.2 model.

πŸ’‘Multimodal models

Multimodal models refer to AI models capable of processing and understanding multiple types of data, such as text, images, or video. In the context of Llama 3.2, Meta has released multimodal models with 11 and 90 billion parameters. These models are particularly useful for complex tasks that require both language and visual comprehension, though the video focuses primarily on language-based fine-tuning.

πŸ’‘Quantization

Quantization is a technique used to reduce the computational and memory demands of AI models by compressing the model weights, typically by reducing the precision of numbers used in calculations. In the video, the presenter discusses the 4-bit quantization technique used for the Llama 3.2 model, which helps reduce memory usage while maintaining model performance, making it feasible to run on devices with limited hardware.

πŸ’‘Model parameters

Model parameters are the internal variables that the model learns during training, determining its behavior and outputs. Llama 3.2 is available in several sizes, including models with 1, 3, 11, and 90 billion parameters. The video discusses how smaller models like the 3 billion-parameter version can be run locally on devices like smartphones, making them accessible for practical use in limited-resource environments.

πŸ’‘Lora adapters

Lora adapters are additional layers added to pre-trained models to enable efficient fine-tuning without modifying the entire model's weights. In the video, the presenter explains that Lora adapters allow for modular and memory-efficient training by targeting specific parts of the model. This technique is used to adjust the model for a particular task without requiring extensive hardware resources.

πŸ’‘Prompt template

A prompt template is the structure or format in which input data is presented to the model during fine-tuning or inference. In the case of Llama 3.2, the video discusses the importance of following a specific prompt template, especially when working with instruct models, to ensure that the model understands the inputs correctly. It also highlights the need to adapt datasets to fit these templates.

πŸ’‘Supervised fine-tuning

Supervised fine-tuning involves training a model using labeled data where both inputs and correct outputs are known, enabling the model to learn the correct associations. The video discusses the use of supervised fine-tuning with Llama 3.2, using Hugging Face’s TRL library to train the model on the Fine Tom dataset, which contains 100,000 multi-turn conversations.

πŸ’‘Hugging Face TRL

Hugging Face TRL (Transformers Reinforcement Learning) is a library for training and fine-tuning large language models. In the video, the presenter uses TRL to train the Llama 3.2 model with a fine-tuned dataset, demonstrating how this library simplifies the training process and helps achieve optimal model performance. TRL supports several fine-tuning techniques, including supervised learning.

Highlights

Meta released Llama 3.2, a new family of four models, including multimodal ones, optimized for both language and vision tasks.

The Llama 3.2 family includes models of different sizes: 1, 3, 11, and 90 billion parameters, with Meta moving away from the traditional 7-8 billion models.

The smaller models (1 and 3 billion parameters) are notable because they can run on-device, making them accessible for local deployment.

The 11 billion and 90 billion models are multimodal, designed for vision tasks, though they will be covered in more depth in future videos.

Llama 3.2 comes with Llama Stack, Meta's opinionated developer experience for easier deployment of models.

Unslot is used for fine-tuning Llama 3.2, which can be customized with your own dataset to make the model more task-specific.

A dataset with multi-turn conversations, such as the Fine Tom dataset (100,000 examples), is used for fine-tuning the model.

Unslot allows you to perform low-rank adaptation (LoRA) to efficiently fine-tune the model by adding specific modules instead of full fine-tuning.

LoRA parameters, such as the 'rank' and 'LoRA alpha,' impact both the fine-tuning performance and the required memory resources.

It is essential to ensure that the prompt template of your dataset matches the format expected by Llama 3.2, especially when fine-tuning instruct models.

You can use Hugging Face's TRL library to perform supervised fine-tuning by providing the model, tokenizer, dataset, and a customized prompt format.

Key hyperparameters such as max sequence length, learning rate, and batch size significantly impact the fine-tuning process, influencing both model performance and training time.

Llama 3.2's system instruction includes details about the model's training cutoff date (December 2023), which may appear in responses.

After fine-tuning, you can save the model in GGUF format and run it locally using OLaMa for on-device inference, offering fast, efficient execution.

The method demonstrated can fine-tune both 1 billion and 3 billion models for fast on-device use, while larger models (11B, 90B) require different handling due to their vision capabilities.

Transcripts

play00:00

last week meta released Lama 3.2 which

play00:03

is a new family of four different models

play00:05

including multimodal models and they're

play00:08

pretty impressive both for language and

play00:11

vision tasks for their respective sizes

play00:14

but you know what's better than that

play00:16

it's your custom fine tune llama 3.2

play00:20

that's exactly what we're going to learn

play00:21

in this video we will use unslot for

play00:24

fine tuning then I'll show you how you

play00:26

can run that fine tune model locally

play00:29

using ola

play00:30

because what's the point of a fine tune

play00:32

model if you can't run it locally but

play00:35

before then let's have a quick look at

play00:38

the release block

play00:40

post this new release has two sets of

play00:43

models one are lightweight which is one

play00:46

and 3 billion model and the other set is

play00:48

multimodel with 11 and 90 billion there

play00:51

is no 405b this time meta is moving away

play00:55

from the standard seven or 8 billion

play00:57

models now they have a 11 and 90 billion

play01:01

instead of 8 or 70 billion model but I

play01:04

think the most interesting one are the 1

play01:07

and 3 billion models because you can run

play01:09

them on device we will look at the 11

play01:13

and 90 billion models for vision tasks

play01:16

in another video apart from these models

play01:18

meta has also released llama stack which

play01:21

is their opinionated version of how

play01:24

developer experience should look it's

play01:27

great to see that these model providers

play01:30

are now building text TXS for deployment

play01:33

let's talk about how you can fine tune

play01:35

uh one of these smaller models on your

play01:37

own data set and then I'll show you how

play01:40

you can run this locally using o Lama

play01:43

fine tune Lama 3.2 we will use the

play01:46

official notebook from the unslot team I

play01:48

have covered variations of this notebook

play01:51

in my earlier videos for fine-tuning

play01:53

other variance of Lama this is going to

play01:56

be a quick recap of those notebooks

play01:58

first we need a data set uh to fine-tune

play02:01

the model on for this example we're

play02:04

using the fine Tom data set which has

play02:06

100,000 examples so it's a relatively

play02:09

huge data set and it has multi- turn

play02:13

conversations this data set is collected

play02:15

from multiple different sources so I

play02:18

think it's a very good candidate if you

play02:20

are fine-tuning uh LM in general but if

play02:24

you're fine-tuning this model for your

play02:26

own specific task you will just need to

play02:29

provide your own data set and I'll later

play02:31

on show you how you can structure your

play02:33

data set first we need to install unslot

play02:37

they recommend to use the nightly

play02:39

version which is basically the the

play02:41

latest version unot uses a fast language

play02:45

model class for dealing with llms we're

play02:48

going to load the Lama 3.23 billion

play02:52

instruct model we're using this model

play02:54

because it's a relatively smaller model

play02:57

that you can potentially run on uh on

play02:59

device device such as on a smartphone uh

play03:02

another thing to highlight is I'm using

play03:04

the unslot version you can also use the

play03:07

Llama version directly you'll need to

play03:09

provide your hugging face um token ID

play03:12

and accept their terms and conditions

play03:14

the 11 billion and 90 billion models are

play03:17

not available in all regions and that

play03:20

has to do with its uh Vision

play03:22

capabilities so you need just need to be

play03:24

careful Uno currently does not support

play03:27

Vision models yet but hopefully they

play03:30

will um add support soon when you're

play03:32

loading the model you need to Define

play03:34

three different parameters the first one

play03:36

is the max sequence length in our case

play03:38

we are setting it to

play03:40

2048 this number is dependent on your

play03:43

training data set look at your training

play03:45

examples and see the maximum sequence

play03:47

length available in your data set and

play03:50

I'll recommend to set it to that but

play03:53

setting it to a higher value will also

play03:55

need more TPU vram so you need to be

play03:57

careful of that data types you can set

play03:59

to floating Point 16 or 8 but if you

play04:02

keep it none it will automatically

play04:04

select depending on your Hardware we're

play04:07

going to be using the 4bit quantization

play04:08

to reduce the U memory usage or memory

play04:12

footprint so here we're loading both the

play04:14

model as well as the tokenizer next I'm

play04:17

adding Lura adopters we are not using

play04:19

full fine tuning even though the model

play04:21

is pretty small we're adding Laura

play04:24

adopters these are different modules

play04:26

that we are targeting we train

play04:28

completely separate modules and then

play04:30

merge it with the original model weights

play04:33

there are a couple of other things to

play04:34

keep in mind one is the r or rank this

play04:39

determines how uh parameters are going

play04:41

to be in your Lowa adapter if you uh set

play04:44

it to a high number this will give you

play04:47

much better uh fine tuning or the

play04:50

performance is usually going to be

play04:52

better but again your um fine tuning a

play04:54

large number of uh in your low adopter

play04:57

so that will mean that you will need

play04:59

more uh resources in terms of vram to

play05:02

fine-tune uh or train the low adopters

play05:06

so usually 16 or 32 provides you um good

play05:10

compromise between the memory footprint

play05:13

and the performance another thing is the

play05:16

impact of this Lura when you merging it

play05:19

back to the original weights of the

play05:21

model so that is set through the Laura

play05:24

Alpha now some points on the prompt

play05:28

template so here's the PR template that

play05:31

the Lama 3.1 and 3.2 uses you need to

play05:34

make sure that your data set that you're

play05:36

providing in order to find T the model

play05:39

actually follows this specific prompt

play05:41

template because we're using the

play05:43

instruct version of the models for fine

play05:45

tuning if you're fine-tuning the base

play05:48

model you can provide your own template

play05:51

but if you're working with instructor

play05:53

chat version then you have to follow the

play05:56

template used by the model itself the

play05:58

promt template

play06:00

expects role and content but here you

play06:04

can see that the data set we're using

play06:07

actually uses another format which is

play06:09

from human and then I think there is

play06:11

from GP team right so it uses a

play06:15

different promt templat so we need to

play06:17

adjust this prompt template and for that

play06:19

you can use the get chat template class

play06:22

or function from unslot basically we

play06:25

provide the token use the prompt

play06:26

template from Lama 3.1 which is uh

play06:29

similar to 3.2 and that will take all

play06:33

the data sets and convert it to our

play06:36

specific prompt template so here we're

play06:38

loading the data set now we need to go

play06:41

from this which is from system and then

play06:44

you provide the value or from human or

play06:47

from GPT to the role based approach

play06:50

everything should be converted to RO

play06:52

system Ro user androll assistant we do

play06:55

that through the standardized share GPT

play06:58

uh function that we just created now if

play07:01

you look at um some example

play07:03

conversations here you can see that we

play07:05

went to the content so here's the

play07:08

content then here's the role role is

play07:11

user and that's the uh question asked by

play07:14

the user then we have a role of

play07:16

assistant and this is the response

play07:18

generated by the assistant when you're

play07:20

are formatting your own data set you

play07:22

will have to follow this specific prompt

play07:25

template in order to fine-tune a Lama

play07:27

3.1 instruct version another thing is

play07:30

that the Lama 3.1 instruct defaults chat

play07:34

template adds this specific sentence in

play07:38

the system instruction so it's actually

play07:40

telling the model that this cut off

play07:42

training date was in December 2023 and

play07:46

it adds today's date to be uh July 26 so

play07:50

if you see something like this in

play07:51

responses from the model don't be

play07:53

alarmed because that's just part of the

play07:55

system instruction and later on they

play07:57

actually masked this for now in order to

play08:00

train the model we are using the TRL

play08:03

library from hugging face and we're

play08:05

going to be using the supervised fine

play08:07

tuning trainer because we are doing

play08:10

supervised fine tuning in this case so

play08:12

we provide the model name the tokenizer

play08:15

these are coming from the unslot then we

play08:18

provide our data set we also tell it

play08:20

which column to use as basically our

play08:22

prompt template that we have already

play08:24

formatted we added a text column to the

play08:27

data and the maximum sequence length in

play08:30

the training data set now here are some

play08:32

other uh specific parameters a couple of

play08:35

things which I want to highlight is if

play08:37

you set the number of epoch so for

play08:40

example if it sets to one it will go

play08:42

through the whole data set at once only

play08:45

during training but 100,000 examples are

play08:48

a pretty huge data set so that's going

play08:51

to take a while that's why we set the

play08:53

max steps to 60 you can either set the

play08:56

max steps or uh you can set the the

play08:59

number of epoch now what's the

play09:01

relationship between the two that is

play09:03

determined by our batch size in order to

play09:06

get the total number of steps in an

play09:09

Epoch you can divide the size of the

play09:11

data set by the batch size for example

play09:14

if you have 100 examples if you divide

play09:16

it by two you will get a maximum of 50

play09:20

steps in the EPO we're just running it

play09:22

for 60 St step steps which is the

play09:25

fraction of the total number of steps

play09:27

possible for 100,000 examp examples the

play09:30

reason we do it is because we don't want

play09:32

to run it for a long time I just want to

play09:34

show you an example and that's why you

play09:37

probably are not going to see a greatly

play09:39

trained model in order to get really

play09:41

good training output you definitely want

play09:44

to run it for a lot longer the learning

play09:47

rate determines the speed of convergence

play09:50

if you set it to a high number the

play09:52

training speed is going to be faster but

play09:55

the training might not converge you

play09:57

usually want to find a suite spot where

play10:00

the learning rate is small enough that

play10:02

it converges but that will also take

play10:05

much longer to train okay one more thing

play10:08

that you want to train the model on the

play10:10

output not on the inputs so that's why

play10:13

you want to calculate the loss of the

play10:15

model on the output from the assistant

play10:18

not based on the inputs from the user so

play10:20

the model should see the user input

play10:22

generate a response and then compare the

play10:25

output with the original or gold

play10:27

standard output or ground truth and

play10:29

that's where you compute the loss so

play10:31

this section takes care of that it

play10:33

forces the model to only use the output

play10:35

for uh computation of the training loss

play10:38

or the test loss depending on if you

play10:40

have a test data set now you can look at

play10:43

uh how the tokenized version of the data

play10:46

set looks like you can see that we have

play10:48

clearly added the system role here is

play10:52

the well formatted user input and then

play10:55

we have the well formatted assistant

play10:57

response and this is the data set that

play11:00

we will use to train our mod want to get

play11:03

rid of this part which is the um system

play11:05

message part you can mask that here

play11:08

we're masking that and now you can see

play11:10

that you don't really see the original

play11:13

system message you only see the output

play11:16

the model is supposed to generate okay

play11:19

next we call the train function on the

play11:21

trainer that we created you can see that

play11:24

the loss goes down then comes up again

play11:27

the reason is that we're running it for

play11:28

a way small number of steps probably we

play11:31

can play around with the learning rate

play11:32

as well that will control the speed of

play11:34

convergence these are different

play11:36

parameters that you need to play around

play11:38

with if you're using bigger batch sizes

play11:41

you can set the learning rate to a

play11:43

relatively higher value but bigger batch

play11:46

sizes will also depends on the available

play11:48

GPU vram that you have so there has to

play11:51

be a compromise between these

play11:53

hyperparameters that you're working with

play11:55

okay so after this training you can see

play11:57

that if we run uh this specific um

play12:01

prompt on the train model uh then here

play12:04

is the response that we get so here we

play12:07

see the system message but we'll have to

play12:09

mask that ourselves in terms of the user

play12:12

input here's the user input continue the

play12:15

Fibonacci sequence so we provide the

play12:17

Fibonacci sequence and then the response

play12:20

generated by the model is here now you

play12:23

can also stream this if you want here's

play12:25

an output of the stream response which

play12:28

basically does the same same thing but

play12:29

in a streaming fashion okay once you

play12:32

train the model you can either push this

play12:34

to GitHub or store it locally I'm mostly

play12:38

interested in how to store the GG verion

play12:42

of the model because I want to load this

play12:44

in olama and run it locally for that to

play12:47

work you just need to call the save

play12:49

pre-trained ggf provide the model name

play12:52

I'm calling it model uh 3 billion

play12:55

provide the token and the level of

play12:57

quantization since it's a atively

play12:59

smaller model I wanted to run it in uh

play13:02

16bit floating Point Precision keep in

play13:04

mind this step will take quite a long

play13:07

time because it has to First download

play13:09

and install Lama CPP and then convert uh

play13:13

this model to GG UF format so here's the

play13:16

model that I downloaded from Google

play13:18

collab if you run the training locally

play13:20

you are going to see unslot

play13:23

fp16 ggf I downloaded the model from

play13:27

Google collab now let me show you the

play13:29

rest of the process next let me show you

play13:32

how to run that trained file locally

play13:34

using ama ama uses the concept of model

play13:38

file which is basically a set of

play13:40

configurations that you will need to

play13:42

provide for AMA to use a model locally

play13:45

there are a number of different things

play13:46

you can use uh there's an instruction

play13:48

called from where you tell it which

play13:50

model to use you can set different uh

play13:53

parameters such as temperature uh Max

play13:56

context window and so on you can also

play13:58

provide the full prompt template for the

play14:01

uh model here is a quick example if you

play14:03

want to use uh L 3.2 with different

play14:06

configurations than the default you're

play14:08

going to say from Lama 3.2 here they're

play14:10

changing the temperature to one the

play14:12

contact Max contact window is changed to

play14:15

4096 and you can also provide a simple

play14:18

system instruction if you go to any

play14:21

model on AMA you can see this template

play14:24

if you click on this this is the model

play14:27

file used by any model available on AMA

play14:31

I have downloaded the ggf file that was

play14:33

created after uh fine-tuning the model

play14:36

on um Google collab notebook you just

play14:39

want to look at the file that is uh GF

play14:42

so downloaded here and then created

play14:44

another file called Fine Lama in here

play14:47

I'm saying from and then providing that

play14:50

model name with uh GG at the end so this

play14:54

is basically the model file that we're

play14:56

going to be using we can also include a

play14:58

temp template that will Define The

play15:01

Prompt template but since it's already

play15:02

in the tokenizer so I don't need to do

play15:04

that you can also Define the system

play15:06

prompt but in this case we want to mask

play15:08

it so I'm not going to add that either

play15:10

now you need to have Ama up and running

play15:13

after that you need to provide some

play15:16

details to create this model in ama we

play15:19

type the command AMA create then what

play15:22

you want the model to be called so I'm

play15:24

going to call it then you can use this

play15:29

DF parameter and you need to provide the

play15:32

path of the uh model file that we

play15:35

created so it's in the same directory

play15:37

when you click on it this will uh start

play15:39

transferring the data and if everything

play15:41

goes well uh it will create a model file

play15:44

for us it's using the template from Lama

play15:48

3 instruct seems like everything is

play15:50

successful now we can run our model but

play15:54

before that let me show you if this

play15:56

shows up in the model list so here we

play15:58

have have the fine Lama this is

play16:00

basically the model that we just created

play16:03

I have a whole bunch of other models

play16:04

that I have already downloaded and now

play16:06

in order to run this model all we need

play16:08

to do is just type AMA run and just like

play16:11

any other AMA model we just need to

play16:14

provide the name now it's a 3 billion

play16:17

model so it's going to be extremely fast

play16:18

if I say hi you can say uh that it

play16:21

generates responses pretty great and

play16:23

pretty quickly all right so I'm going to

play16:25

ask it to write a program in Python to

play16:27

move files from uh three to a local

play16:29

directory and you can see it's really

play16:32

fast because it's just a 3 billion model

play16:34

that is running completely locally and

play16:36

that's the finetune model that we just

play16:38

fine tuned okay so this was a quick

play16:40

video on how to fine tune Lama 3.2 using

play16:43

Onslaught and then run it locally on

play16:46

your own machine using AMA I hope uh

play16:49

this was helpful I'll put a link to the

play16:52

um Google collab in the video

play16:53

description in this video I only focused

play16:56

on the 3 billion uh same approach will

play16:58

apply app to the 1 billion model for uh

play17:01

11 and 90 billion models the approach is

play17:03

a little different because it has a

play17:06

adopter for the uh Vision component so

play17:09

the same approach probably is not going

play17:11

to apply but I'm going to be creating

play17:13

some videos specifically focused on the

play17:16

vision model because I think there are

play17:18

some great applications there

play17:20

specifically for vision based rag which

play17:22

is a topic I'm personally interested in

play17:25

if that interests you make sure to

play17:28

subscribe to the Channel I hope you

play17:30

found this video useful thanks for

play17:31

watching and as always see you in the

play17:34

next one

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Llama 3.2Fine-tuningUnslotLocal AILanguage modelsVision tasksOlaAI trainingModel deploymentHugging Face