EASIEST Way to Fine-Tune a LLM and Use It With Ollama

warpdotdev
12 Sept 202405:17

Summary

TLDRThis tutorial video guides viewers on fine-tuning a large language model (LLM) for local use with Ollama. It emphasizes selecting the right dataset, such as synthetic text to SQL, for task-specific performance. The presenter demonstrates setting up the environment with Unsloth and Llama 3.1, adjusting model parameters for efficiency, and using tools like Jupyter Notebook. The script covers data formatting, model training with Hugging Face's Trainer, and converting the model for local deployment. Finally, it shows how to run the fine-tuned LLM using Ollama's Docker-like configuration.

Takeaways

  • πŸ” Finding the right dataset is crucial for training a language model that can outperform larger models when the data is relevant to the task.
  • πŸ’» The video demonstrates creating a small, fast language model tailored for generating SQL data based on table data.
  • πŸš€ Utilizing a Nvidia 4090 GPU on Ubuntu or Google Colab for training without needing complex hardware.
  • πŸ“š Unsloth is highlighted as a tool for efficient fine-tuning of open-source models with significantly reduced memory usage.
  • πŸ¦™ Llama 3.1 is chosen as the language model for its high performance in English for commercial and research purposes.
  • πŸ› οΈ Anaconda and Cuda libraries are prerequisites, with Cuda 12.1 and Python 3.10 specified for the project.
  • πŸ“¦ Dependencies for Unsloth are installed, setting up a new environment with PyTorch and Cuda libraries.
  • πŸ”„ The script guides through setting up a Jupyter notebook, importing the fast language model, and configuring it with a max sequence length.
  • πŸ”— The use of LoRA adapters is explained, which allows updating only a small portion of model parameters, saving resources.
  • πŸ“ˆ The script details the process of formatting the dataset and setting up the training module with parameters like max steps and seed.
  • πŸ”§ Post-training, the model needs to be converted into a compatible file type to run locally using Ollama.
  • πŸ“ A step-by-step guide is provided to run the fine-tuned model locally with Ollama, including creating a model file and running a command.

Q & A

  • What is the main focus of the video?

    -The main focus of the video is to guide viewers on how to fine-tune a large language model (LLM) locally using Ollama, specifically for generating SQL data based on table data.

  • Why is finding the right dataset important for training an LLM?

    -Finding the right dataset is crucial because training a large language model with a relevant dataset can lead to better performance than larger models that lack task-specific data.

  • What is the synthetic text to SQL dataset mentioned in the video?

    -The synthetic text to SQL dataset is a large dataset with over 105,000 records, split into columns of prompt SQL content, complexity, and more, which is ideal for training an LLM to generate SQL queries.

  • What hardware does the presenter use for fine-tuning the model?

    -The presenter uses an Nvidia 4090 GPU for fine-tuning the model on their machine running Ubuntu.

  • Can the process be done without a GPU?

    -Yes, the process can be done without a GPU by using Google Colab, which allows running training code in the cloud.

  • What is Unsloth and how does it help in fine-tuning LLMs?

    -Unsloth is a tool that allows efficient fine-tuning of open-source models with about 80% less memory usage, making it suitable for fine-tuning LLMs without extensive hardware requirements.

  • Which LLM version does the presenter use and why?

    -The presenter uses Llama 3.1, an LLM designed for commercial and research purposes, especially in English, known for its high performance.

  • What software prerequisites are needed for this project?

    -The prerequisites include Anaconda, Cuda libraries, Cuda 12.1, Python 3.10, and dependencies required by Unsloth.

  • What does the presenter mean by setting 'load in four bit to true'?

    -Setting 'load in four bit to true' means using fewer bits (4 bits) instead of the typical 16 or 32 bits to represent information in the model, reducing memory usage and load on the machine.

  • What are LORA adapters and why are they used in this context?

    -LORA adapters are a method to update only a small portion (1 to 10%) of the model parameters during fine-tuning, avoiding the need to retrain the entire model, which is time-consuming and resource-intensive.

  • How does the presenter format the data for training?

    -The presenter formats the data to focus specifically on the SQL aspect of the database, updating the code to reflect the interest in the SQL queries, generated code, and explanation.

  • What is the role of the Trainer by Hugging Face in the training process?

    -Trainer by Hugging Face is used to set up the training module for supervised fine-tuning, with various parameters such as max steps, seed, and warmup steps to control the training process.

  • How does one run the fine-tuned model locally using Ollama?

    -To run the fine-tuned model locally with Ollama, one needs to convert the model into the right file type and use a model file with specific parameters, then run a command that utilizes Ollama's Docker-like configuration to execute the model.

Outlines

00:00

πŸ’» Fine-Tuning LLM with Ollama

The video script introduces a tutorial on fine-tuning a large language model (LLM) locally using a tool called Ollama. The presenter emphasizes the importance of finding the right dataset for training an LLM, mentioning that a relevant dataset can lead to better performance than larger models. The goal is to create a small, fast LLM that generates SQL data from table data. The presenter plans to use the 'synthetic text to SQL' dataset, which contains over 105,000 records. The setup involves using an Nvidia 4090 GPU on Ubuntu, but alternatives like Google Colab are suggested for those without a GPU. The script mentions Unsloth for efficient fine-tuning with less memory usage and Llama 3.1 as the LLM. The presenter also covers the installation of Anaconda, Cuda libraries, and Unsloth dependencies. The process includes setting up a Jupyter notebook, importing the LLM, configuring the model with a max sequence length, and using Lora adapters for efficient training. The script concludes with instructions on setting up the training module and running the training process.

05:02

πŸ”— Wrapping Up with Ollama

The second paragraph of the video script provides a brief outro, mentioning a two-minute video for further information about Ollama. The presenter thanks the viewers for watching and hints at a future video, indicating the end of the tutorial. This part of the script serves as a conclusion and a teaser for additional content related to Ollama.

Mindmap

Keywords

πŸ’‘Fine-tuning

Fine-tuning refers to the process of training a pre-existing machine learning model on a specific task with a new, often smaller, dataset. In the context of the video, fine-tuning is used to adapt a large language model (LLM) to generate SQL data based on table data. The script mentions fine-tuning a model using a specific dataset called 'synthetic text to SQL' to improve its performance on the task.

πŸ’‘Dataset

A dataset is a collection of data that is used to train or test machine learning models. The video emphasizes the importance of finding the right dataset because it can significantly impact the model's performance. The 'synthetic text to SQL' dataset is highlighted as a large dataset with over 105,000 records, which is crucial for training the LLM to generate SQL queries.

πŸ’‘LLM (Large Language Model)

An LLM is a type of artificial neural network designed to predict and understand human language based on vast amounts of text data. The video discusses creating a small, fast LLM that can generate SQL queries, indicating that even a smaller LLM can outperform larger models when trained with a relevant dataset.

πŸ’‘Nvidia 4090 GPU

The Nvidia 4090 GPU is a high-performance graphics processing unit mentioned in the video as the hardware used for fine-tuning the model on the presenter's machine. GPUs are essential for machine learning tasks as they can perform parallel computations much faster than CPUs, accelerating the training process.

πŸ’‘Google Colab

Google Colab is a cloud-based platform that allows users to run Jupyter notebooks with various computing resources, including GPUs, for free. The video suggests using Google Colab as an alternative to local GPU hardware for those who don't have access to a suitable machine.

πŸ’‘Unsloth

Unsloth is a tool mentioned in the video for fine-tuning open-source models efficiently with reduced memory usage. It's used in the script to fine-tune the LLM with about 80% less memory, showcasing how it can help in managing system resources during the training process.

πŸ’‘Llama 3.1

Llama 3.1 is an LLM variant used for commercial and research purposes, particularly in English, noted for its high performance. In the video, it's specified as the model to be used for fine-tuning, indicating its suitability for tasks involving the English language.

πŸ’‘Anaconda

Anaconda is a distribution of Python and R for scientific computing that aims to simplify package management and deployment. The video instructs installing Anaconda as part of the setup process for the project, highlighting its importance in managing the Python environment and dependencies.

πŸ’‘Cuda

CUDA (Compute Unified Device Architecture) is a parallel computing platform and application programming interface model created by Nvidia. The video specifies using Cuda 12.1, which is necessary for leveraging the power of Nvidia GPUs in the training process.

πŸ’‘Jupyter Notebook

A Jupyter Notebook is an open-source web application that allows creation and sharing of documents containing live code, equations, visualizations, and narrative text. The video mentions running a Jupyter notebook as part of the setup process, indicating its use for interactive computing in the project.

πŸ’‘Max Sequence Length

Max sequence length is a parameter that defines the maximum length of input sequences the model will process. In the video, a max sequence length of 2048 tokens is set, which means the model will only consider up to 2048 tokens for each input, affecting how the model processes and generates text.

πŸ’‘LoRA Adapters

LoRA (Low-Rank Adaptation) adapters are a technique for efficiently fine-tuning pretrained language models by only updating a small percentage of the model's parameters. The video mentions using LoRA adapters to update only 1 to 10% of the model's parameters, reducing the computational cost of training.

πŸ’‘Hugging Face Trainer

Hugging Face Trainer is a tool used for training and fine-tuning machine learning models. In the video, it's used to set up the training module for fine-tuning the LLM, with various parameters like max steps, seed, and warmup steps that control the training process.

πŸ’‘Ollama

Ollama is a tool for running large language models locally. The video describes the process of converting the trained model into a format compatible with Ollama and running it locally, which allows the use of the fine-tuned model without relying on cloud services.

Highlights

Fine-tuning a large language model locally using Ollama

Importance of finding the right dataset for task-specific performance

Creating a small, fast LLM for generating SQL data

Utilizing the synthetic text to SQL dataset with over 105,000 records

Running fine-tuning on an Nvidia 4090 GPU with Ubuntu

Option to use Google Colab for cloud-based training without a GPU

Using Unsloth for efficient fine-tuning with reduced memory usage

Llama 3.1 as the LLM for English with high performance

Requirements for Anaconda and Cuda libraries

Installation of dependencies for Unsloth

Setting up a new environment with PyTorch and Cuda libraries

Installing Jupyter for running the Jupyter notebook

Loading the fast language model with a max sequence length of 2048 tokens

Using 4-bit to reduce memory usage and machine load

Loading the PEFT model with Lora adapters for efficient parameter updates

Formatting the data set for the LLM to understand

Setting up the training module for supervised fine-tuning

Running the training and converting the model for local use

Using Ollama's Docker-like file configuration for local model running

Running the fine-tuned LM locally with Ollama

Using the fine-tuned LM with an OpenAI compatible API

Transcripts

play00:00

you want to fine tune your large language model and run it locally on your machine

play00:03

using Ollama.

play00:07

Well, in today's video, we're going to do exactly that.

play00:09

So let's go.

play00:12

First, for the fun part, just finding the right data set.

play00:14

The reason why finding the right data set is so important is when you train

play00:18

a small, large language model with a data set

play00:21

that is relevant to the task you're trying to do.

play00:23

It can actually outperform large models.

play00:26

What I'm going to be doing today is creating a small, fast LLM

play00:29

that will generate SQL data based off of table data.

play00:32

I provide it.

play00:33

One of the biggest data sets to do this with is called synthetic text to SQL,

play00:37

which has over 105,000 records split

play00:41

into columns of prompt SQL content, complexity, and more.

play00:45

Im running a Nvidia 4090 GPU,

play00:47

so I'm going to be fine tuning this on my machine using ubuntu.

play00:50

If you don't have a GPU, feel free to do this using Google Colab,

play00:53

which allows you to run training code in the cloud.

play00:56

The great news is that this project

play00:57

does not require a lot of complex hardware to get it up and running.

play01:01

We're going to be using Unsloth

play01:02

which allows you to fine tune a lot of open source models

play01:06

really efficiently, with about 80% less memory usage.

play01:09

And we're going to be using Llama 3.1 which is an LLM used for commercial and

play01:13

research purposes, especially in English, and has really high performance.

play01:18

Make sure that you have Anaconda

play01:19

installed on your machine as well as the Cuda libraries.

play01:22

I will be using Cuda 12.1 and Python 3.10 for this project.

play01:26

You want to install

play01:27

the dependencies required by Unsloth which you can find in the Readme.

play01:31

But for simplicity, here it is.

play01:33

This creates a new environment for us

play01:35

and installed PyTorch Cuda libraries as well as the latest unsloth.

play01:39

You'll also want to install Jupyter if it isn't there already,

play01:42

and then run your Jupyter notebook.

play01:44

And now you're done with the setup.

play01:46

So let's go into the Jupyter notebook and get started.

play01:48

First, we want to make sure that all the installed requirements are actually there.

play01:52

If you're using Google Colab, this command will install the packages.

play01:55

Next we're going to import the fast language model by Unsloth.

play01:58

Here we're specifying that we want to use the Lama three eight bit model.

play02:02

We also want to set up a max sequence length of 2048 tokens.

play02:06

This means that the model will only consider up to 2048 tokens,

play02:10

where a token can be a word, subword character, or even punctuation.

play02:15

When processing or generating text, and we'll set load in four bit to true,

play02:19

which essentially means we're using less bits, as opposed to using the typical 16

play02:23

or 32 bits to represent the information in the model.

play02:26

Doing this is going to help you

play02:28

reduce memory usage and also reduce the load on your machine.

play02:31

After running this,

play02:32

you're going to get a cute Ascii image, and that means that your model is loaded.

play02:35

After this,

play02:36

we're going to load in the PEFT model, which is basically Lora adapters.

play02:40

If you don't know what these terms mean, that's totally fine.

play02:43

Basically, the LORA adapters mean that

play02:45

we only have to update 1 to 10% of the parameters in this model.

play02:49

Without them, it means that

play02:50

we would have to retrain the whole model, not just a small portion,

play02:54

which takes a lot of time, energy, and even money.

play02:57

Unsworth provides this here with the recommended settings.

play03:00

I trust them, so feel free to read each comment.

play03:02

Now this is where things can get a little bit tricky depending on what data

play03:06

set you're using.

play03:06

The each data set comes different from each other,

play03:09

but they're each formatted in the same way such that the large language model

play03:12

can understand it.

play03:13

Llama three uses alpaca prompts which look like this.

play03:17

Now, if you remember our data set, it is not as easy as just plugging it in

play03:21

and letting it go off to the races.

play03:22

I have to format my response first before plugging in the data.

play03:26

I'm only interested in the SQL of my database.

play03:28

The prompt I will be asking for, as well as the generated code and explanation.

play03:32

So I'm going to update my code to reflect this.

play03:34

Now we set up the training module to supervise to fine tuning.

play03:37

Trainer by hugging face is what I used.

play03:39

There are a lot of parameters to use all that can be described in their own video.

play03:43

So for example, have max steps which tells us how many training steps to perform.

play03:47

Seed is a random number generator.

play03:48

We used to be able to reproduce results and warmup steps gradually increases

play03:53

learning rate over time.

play03:54

So now that we have everything set up, let's run it.

play03:59

And that's it.

play04:00

Your model has been trained.

play04:01

Now before we move on,

play04:02

we actually need

play04:03

to convert this into the right file type so that we can run this locally

play04:06

using a llama.

play04:07

Luckily, onslaught has a one liner we can use to do this.

play04:10

After this is done, we only need to do one step to be able to run this with Allama.

play04:15

First, open up your terminal.

play04:16

I'm using the warp terminal here.

play04:18

Go to the path of where the file is saved.

play04:21

Then create a file called Model file and open it up in the code editor.

play04:25

This is Ollamas Docker like file configuration

play04:28

where we can create new models with specific parameters.

play04:30

In our model file we're going to put a prompt.

play04:32

So something like you're an SQL generator that takes a user's query

play04:36

and gives them helpful SQL to use.

play04:38

Finally make sure Olan was running.

play04:40

And then we're just going to run this command.

play04:42

This command will then read all the items in the model file you just created,

play04:46

and start using llama Dhcp under the hood to make sure that the model runs

play04:50

on your machine. And congrats!

play04:51

You can now use your fine tuned LM locally,

play04:54

all with the OpenAI compatible API and more in your applications.

play05:01

If you're

play05:01

curious to know more about Alama, we do have a two minute video

play05:05

out about everything you need to know about Alama here.

play05:08

Otherwise, thank you for watching and I'll see you next time.

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
LLM TuningSQL GeneratorData ScienceMachine LearningNvidia 4090Google ColabUnslothLlama 3.1AnacondaCuda 12.1