Fine-tuning Tiny LLM on Your Data | Sentiment Analysis with TinyLlama and LoRA on a Single GPU

Venelin Valkov
29 Jan 202431:41

Summary

TLDRIn this tutorial video, Vin shows how to fine-tune a Tiny L language model on a custom cryptocurrency news dataset. He covers preparing the data, setting the correct parameters for the tokenizer and model, training the model efficiently with Warp using a Google Colab notebook, evaluating model performance, and doing inference with the fine-tuned model. The goal is to predict the subject and sentiment for new crypto articles. With only 40 minutes of training data, the fine-tuned Tiny L model achieves promising results - around 79% subject accuracy and over 90% sentiment accuracy.

Takeaways

  • 📚 Vin explains the process of fine-tuning a TinyLM model on a custom dataset, beginning with dataset preparation and proceeding through training to evaluation.
  • 🔧 Key steps include setting up tokenizer and model parameters, using a Google Colab notebook, and evaluating the fine-tuned model on a test set.
  • 🌐 The tutorial includes a complete text guide and a Google Colab notebook link, available in the ML expert bootcamp section for Pro subscribers.
  • 🤖 TinyLM is preferred over larger models like 7B parameter models due to its smaller size, faster inference, training speed, and suitability for older GPUs.
  • 📈 Fine-tuning is essential for improving model performance, especially when prompt engineering alone doesn't suffice, and for adapting the model to specific data or privacy needs.
  • 📊 For dataset preparation, a minimum of 1,000 high-quality examples is recommended, and consideration of task type and token count is crucial.
  • 🔍 The tutorial uses the 'Crypton News+' dataset from Kaggle, focusing on sentiment and subject classification of cryptocurrency news.
  • ⚙️ Vin demonstrates using Hugging Face's datasets library and tokenizer configurations, emphasizing the importance of padding tokens in avoiding repetition.
  • 🚀 The training process involves using WaRT (Weighted Activation Regularization of Training) to train a small adapter model over the base TinyLM model.
  • 📝 Evaluation results show high accuracy in predicting subjects and sentiments from the news dataset, validating the effectiveness of the fine-tuning process.

Q & A

  • What model is used for fine-tuning in the video?

    -The Tiny Lama model, which is a 1.1 billion parameter model trained on over 3 trillion tokens.

  • What techniques can be used to improve model performance before fine-tuning?

    -Prompt engineering can be used before fine-tuning to try to improve model performance. This involves crafting the prompts fed into the model more carefully without changing the model itself.

  • How can Warp be used during fine-tuning?

    -Warp allows only a small model called an adapter to be trained on top of a large model like Tiny Lama. This reduces memory requirements during fine-tuning.

  • What data set is used for fine-tuning in the video?

    -A cryptocurrency news data set containing titles, text, sentiment analysis labels, and subjects for articles is used.

  • How can the data set be preprocessed?

    -The data can be split into train, validation, and test sets. The distributions of labels can be analyzed to check for imbalances. A template can be designed for formatting the inputs.

  • What accuracy is achieved on the test set?

    -An accuracy of 78.6% is achieved on subject prediction on the test set. An accuracy of 90% is achieved on sentiment analysis on the test set.

  • How can the fine-tuned model be deployed?

    -The adapted model can be merged into the original Tiny Lama model and pushed to Hugging Face Hub. Then it can be deployed behind an API for inferences in production.

  • What batch size is used during training?

    -A batch size of 4 is used with gradient accumulation over 4 iterations to simulate an effective batch size of 16.

  • How are only the model completions used to calculate loss?

    -A special collator is used that sets the labels for all tokens before the completion template to -100 to ignore them in the loss calculation.

  • How can the model repetitions be reduced?

    -The repeated subject and sentiment lines could be removed from the completion template to improve quality.

Outlines

00:00

🚀 Fine-tuning a Tiny Language Model on Custom Data

Vin introduces a tutorial on fine-tuning a tiny language model (LM) on a custom dataset, starting from data preparation to training and evaluation, using a Google Colab notebook. He highlights the advantages of using smaller models like TinyLM over larger models for faster inference and training, and the significance of fine-tuning for improved performance on specific tasks. The tutorial promises a step-by-step guide for ML Expert Pro subscribers, emphasizing the need for high-quality data and the process of selecting and preparing the dataset for fine-tuning.

05:00

📊 Preparing and Understanding Your Dataset for Fine-tuning

This section delves into dataset preparation, focusing on selecting tasks and ensuring data quality. Vin uses a cryptocurrency news dataset from Kaggle, detailing the process of creating training, validation, and test splits. He emphasizes the importance of stratified sampling to maintain representative data distribution across splits and discusses handling class imbalance. The dataset includes sentiment and subjectivity labels for news articles, serving as a basis for training the tiny LM to predict news sentiment and subjects accurately.

10:01

🔧 Setting Up Tokenizer and Model Configuration

Vin explains the setup process for the tokenizer and model configuration, including adding a padding token and adjusting token embeddings for the TinyLM model. He discusses the importance of correct padding to avoid repetition and the use of GPU capabilities for training. The section also covers how to fit data within the model's context window using a specific template and the preparation steps for using the model with WaRT (With Adaptation and Retraining Techniques), highlighting the benefits of training smaller models or adapters for efficiency.

15:02

⚙️ Applying WaRT and Training the Model

This part focuses on applying WaRT to fine-tune the TinyLM model, targeting specific model layers for adaptation and discussing the configuration for efficient training. Vin shares insights on optimizing training parameters, like batch size and learning rate, and introduces techniques for training on completions to improve model performance. He provides a detailed walkthrough of setting up training arguments and using a data collator for focusing loss calculation on specific parts of the model output.

20:03

📝 Training Insights and Evaluation Techniques

Vin shares his training insights, noting the effectiveness of using a smaller batch size with gradient accumulation for better training dynamics. He outlines the training process, including optimizer choices and the rationale behind training setup decisions. The section also covers model evaluation strategies, demonstrating how to test the fine-tuned model's performance on the dataset and analyze results for both subject and sentiment prediction accuracy using confusion matrices and accuracy calculations.

25:05

🎯 Achieving Accurate Predictions and Model Deployment

The final section showcases the fine-tuned model's ability to accurately predict news subjects and sentiments, with examples demonstrating its performance. Vin discusses the potential for discrepancies between model predictions and dataset labels, suggesting the model's predictions might sometimes be more accurate. He concludes by outlining plans for deploying the model in production, emphasizing the significance of model fine-tuning in achieving high accuracy and the upcoming tutorial on model deployment and API integration.

Mindmap

Keywords

💡fine-tuning

Fine-tuning refers to the process of taking a pre-trained language model like Tiny L and customizing it by training the model further on your own dataset. This improves the model's performance on the specific tasks and data that are relevant for your needs. The video discusses how to properly prepare a dataset and configure parameters before fine-tuning Tiny L within a Google Colab notebook.

💡Tiny L

Tiny L is a class of relatively small transformer-based language models with 1-10 billion parameters. Compared to models with hundreds of billions of parameters, Tiny L models can enable faster inference and training. However, fine-tuning is often needed to boost Tiny L performance on specialized tasks. The video focuses specifically on fine-tuning the Tiny L model on a cryptocurrency news dataset.

💡dataset preparation

Properly preparing the dataset is key before fine-tuning a model. The video recommends having over 1000 high quality, human-reviewed data examples. It also advises thinking about the types of tasks, input/output formats, context window, etc. when structuring the data.

💡tokenization

Tokenization refers to splitting text into tokens that serve as the input representation fed into language models. The video discusses adding special padding tokens to the Tiny L tokenizer so that batches are standardized during fine-tuning.

💡War Adapters

War Adapters module only trains a small percentage of Tiny L's parameters during fine-tuning to reduce memory requirements. This allows fitting Tiny L into a single GPU for training by restricting model customization to War Adapter layers.

💡sentiment analysis

One of the key tasks framed in the video is predicting the sentiment (positive/neutral/negative) of cryptocurrency news articles. Fine-tuning Tiny L to make better sentiment predictions is one of the end goals.

💡subject classification

Besides sentiment analysis, the other main task is classifying crypto news into subjects like Bitcoin, blockchain, DeFi, etc. Fine-tuning to boost Tiny L performance on subject classification is thus another goal.

💡deployment

The final step hinted at is deploying the fine-tuned Tiny L behind an API for production inference. This could allow querying the model to analyze sentiment, subjects, etc. for new crypto news articles.

💡model evaluation

Evaluating model performance after fine-tuning is important. The video illustrates how to check accuracy metrics on a held-out testset to quantify how much fine-tuning has improved Tiny L for the desired tasks.

💡Google Colab

Google Colab provides free cloud GPUs for running Jupyter notebooks. The presenter explains how the entire Tiny L fine-tuning pipeline can be executed in a Colab notebook using the free tier GPU.

Highlights

Introduction to fine-tuning TinyLM on custom datasets using Google Colab free tier.

Advantages of choosing TinyLM over larger language models for faster inference and training.

Importance of fine-tuning for improving model performance on specific tasks.

Guidance on dataset preparation and the need for high-quality examples.

Using TinyLM for multiple tasks, showcasing versatility in application.

Crypton news dataset example to demonstrate fine-tuning on real-world data.

Detailed process of tokenizer and model preparation for training.

Utilizing WaRT (Weighted Adapter Residual Tuning) for efficient fine-tuning.

Strategies for managing GPU memory limitations during model training.

Fine-tuning model performance by adjusting WaRT configuration parameters.

Introduction to training with completion collators for focused learning.

Techniques for achieving lower training loss and effective model evaluation.

Saving and reloading fine-tuned models for inference.

Demonstrating the fine-tuned model's accuracy in predicting news subjects and sentiments.

Future directions on deploying the fine-tuned model for production and API integration.

Transcripts

play00:00

hey everyone my name is Vin and in this

play00:02

video we're going to have a look at how

play00:03

you can f tune a tiny L on your own data

play00:07

set we're going to start with preparing

play00:09

the data set for training then we're

play00:12

going to have a look at what parameters

play00:13

you need to set in order to get your

play00:16

tokenizer and model prepared for

play00:18

training along with war setup then we're

play00:21

going to train the model within a Google

play00:23

CL notebook free tier finally we are

play00:26

going to wot the train model and do

play00:29

evaluation on a test set to see whether

play00:31

or not the fine tuned model is doing a

play00:34

good job let's get started if you want

play00:37

to follow along there will be a complete

play00:39

text tutorial along with the link to a

play00:41

Google clap notebook for this video and

play00:44

this will be available within the

play00:45

bootcamp section of ML expert. and then

play00:48

find unink tiny L on custom data set

play00:51

this is available for ML expert Pro

play00:53

subscribers so if you want to support my

play00:56

work and get access to this please go

play00:59

and subscribe to mxer Pro thanks so what

play01:02

do you need in order to find you a tiny

play01:05

L first we're going to go through why

play01:08

you would might want to choose a tiny I

play01:10

over something like wama 7B Li parameter

play01:14

models then we're going to have a look

play01:16

at why you would need to do some

play01:19

fine-tuning then we're going to have a

play01:21

look at some of the checkpoints that you

play01:23

need to cover in order to choose and

play01:26

prepare your data set and finally I'm

play01:29

going to give you some tips in order to

play01:32

find you a tiny L using War so why tiny

play01:37

L first and most importantly those types

play01:40

of models are relatively small or

play01:43

smaller compared to regular watch

play01:45

language models such as 7 billion

play01:48

parameters models such as mistra or W 2

play01:51

and Tiny LM are usually something like

play01:54

tiny wama the one that we're going to

play01:56

use in this video and other like F and F

play01:59

two which is on the let's say limits of

play02:03

what I would call a tiny a another

play02:06

important thing for tiny L is that you

play02:09

can do much faster inference with those

play02:12

and uh the training itself can be a lot

play02:15

faster compared to what you might get

play02:17

with a relatively larger a and you can

play02:21

even use like older gpus in order to

play02:25

train those types of models and finally

play02:28

even though those models are tiny some

play02:31

of those are still trained with very

play02:33

high quality data such as fi and f 2 and

play02:37

trained on a lot of tokens in the data

play02:39

set such as Tiny Lama which has uh more

play02:42

than 3 trillion tokens in the training

play02:45

data set why would you want to do some

play02:48

fine-tuning well first you can try to

play02:51

start with some prompt engineering and

play02:53

if that works for you and The Benchmark

play02:57

or the performance of your model is

play03:00

relatively good then try to stick with

play03:03

just prompt engineering but if you want

play03:05

to increase the performance of your

play03:06

model and if you have enough data in

play03:09

order to do that fine tuning is a very

play03:12

good approach in order to get much

play03:14

better performance of your tiny a and in

play03:17

the general case tiny a are not as

play03:20

powerful at 70 billion parameter model

play03:22

plus like uh for example W 2 or mist or

play03:27

other models and not even close to CH

play03:30

GPT and GPT 4 and GPT 4 Turbo so in that

play03:34

case if you want to have some much

play03:36

smaller model that is performing

play03:38

relatively well on your benchmark on

play03:40

your tasks you would likely need to do

play03:43

some fine-tuning in order to provide

play03:46

much better performance for your tiny a

play03:49

another good thing about fine tuning is

play03:52

that you're going to reduce essentially

play03:54

the number of tokens that you need in

play03:57

order to pass into the input with with

play03:59

the prompt so you might just pass in

play04:02

your data and you might just want to

play04:04

think of a much smaller template that

play04:07

will be good for your prompts and you

play04:10

can essentially just use that instead of

play04:13

some larger prompts and this will make

play04:16

your inference time even faster of

play04:19

course you might want to have a data or

play04:23

might have data that is private to you

play04:25

or your company so when you're fine

play04:28

tuning your own models you don't have to

play04:31

expose the data to the outside world so

play04:33

this is another uh let's say positive of

play04:36

the fine tuning

play04:37

approach and how would you prepare your

play04:40

data as a general row of temp I would

play04:42

suggest more than a thousand examples

play04:45

dat of high quality so uh preferably you

play04:49

might want to have a humans that were

play04:52

looking through the data and they would

play04:54

essentially get a feel of where the data

play04:57

quality is and when you get get a good

play05:00

quality data your fine tuning L are

play05:03

going to be much much better compared to

play05:04

if you have some let's say Shady data

play05:08

points and you would have to think about

play05:11

what type of tasks you're solving in

play05:13

this video I'm going to show you that we

play05:16

are going to use the a for two different

play05:19

tasks which is very good uh in the past

play05:22

if you had to Sol for multiple tasks

play05:25

essentially you have to train multiple

play05:26

models or have a single model that have

play05:29

multiple heads for each prediction in

play05:32

the era of the L uh we are going to just

play05:35

say that we want two outputs one will be

play05:38

the sentiment of uh news and then is

play05:41

going to be the subject of the news or

play05:43

cryptocurrency news this is the DAT set

play05:45

that we're going to use and you would

play05:47

have to have a look at how much tokens

play05:50

do you need in the input and the output

play05:52

and uh have a look at your model maximum

play05:55

context WID and choose whether or not

play05:57

you're going to be able to fit the

play06:00

inputs and outputs within the context

play06:02

window and then you would have to think

play06:05

of a template that is going to be

play06:07

essentially good in order to prepare

play06:10

your own data the data set that we're

play06:13

going to use is uh these Crypton news

play06:16

plus that are available on KGO and it

play06:19

says that there are Crypton news

play06:21

articles containing title text and

play06:24

sentiment analysis of course the

play06:26

sentiment analysis is going to be Essen

play06:29

probably predicted from some model so

play06:32

the labels might not be perfect but

play06:34

still this is a real world example of

play06:37

what you might have and uh here is the

play06:40

Crypton news data for year over a year

play06:44

21 to

play06:45

23 structured format including title

play06:48

text Source subject and sentiment

play06:50

analysis and this is the example of data

play06:53

that you get you have a class for the

play06:56

sentiment polarity and subjectivity and

play06:59

of course you have this subject and all

play07:01

of those are going to be accompanied

play07:04

with the text and a title from the news

play07:06

and this is just the first paragraph of

play07:09

the article and this is the title of the

play07:11

article I have a Google cop notebook

play07:13

that I've have wed the Crypton news data

play07:17

and I essentially took the CSV or the

play07:20

original CSV file and created this

play07:23

stratified split between train

play07:25

validation and test sets and here is the

play07:28

data frame for the training headit

play07:31

training data frame and you'll see the

play07:33

split between the training the

play07:35

validation and tests examples we still

play07:38

have a lot of data and uh you'll see

play07:41

that I've got the subject here and I

play07:44

essentially split the sentiment within a

play07:46

couple of columns so this will be a bit

play07:48

easier to work with compared to what we

play07:50

had into the original U data set uh

play07:54

other than that I'm going to show you

play07:55

the splits between the train test and

play07:59

valid ations so you can see that the

play08:01

stratified sampling has worked wonders

play08:04

for us you see that the trend the

play08:06

validation and the test set for each uh

play08:09

subject which is Bitcoin altcoin

play08:11

blockchain ethereum nft and defi all of

play08:14

those are split um pretty much as the

play08:16

way that the training set has the

play08:19

frequency for those and you see that

play08:22

essentially we have a very large bias

play08:24

towards Bitcoin outcome and blockchain

play08:27

examples which is again something that

play08:29

you might not want in your data set but

play08:33

this is uh the real world in here you

play08:36

can of course use some techniques such

play08:38

as oversampling under sampling Etc in

play08:40

order to fight this but just for this

play08:42

fine tuning example I'm going to stick

play08:44

with the original

play08:45

distributions uh this is the subject

play08:48

that we're going to try to predict and

play08:50

then we have the sentiment again the

play08:52

distribution is uh essentially kept as

play08:56

in the way that the training set has

play08:58

this so again with the stratified

play09:00

sampling and you see that we have the

play09:02

positive neutral and negative sentiments

play09:05

and you might see again that we have

play09:07

somewhat of a skew data towards neutral

play09:10

and positive news while the negative

play09:12

news are much much less compared to the

play09:16

neutral and positive so keep that in

play09:18

mind as well and this is the

play09:20

subjectivity score something that we are

play09:22

not going to predict but I've shown this

play09:25

in order to get a few of this uh

play09:28

category this distribution so the first

play09:30

thing that I'm doing here with the data

play09:32

set in order to pre-pro is to

play09:34

essentially get the data set from pandas

play09:37

and I'm going to use the huging phas

play09:40

data sets Library I'm going to just

play09:42

create this dictionary with the train

play09:43

validation and test subsets and then I'm

play09:47

going to essentially W the tokenizer for

play09:50

the model that we're going to use in our

play09:52

case this is going to be the tiny Lama

play09:55

model and I'm going to get the latest

play09:58

model that is not a chat model and this

play10:00

was trained on 3 trillion parameter

play10:03

tokens and I'm going to set a padding

play10:06

token or P token for the tokenizer and

play10:10

uh here you see that I'm getting the

play10:12

tokenizer for the model then I'm adding

play10:14

this special token for the P token and

play10:17

then I'm setting a padding side to right

play10:19

and after wading the model itself I'm

play10:22

going to resize the token embeddings in

play10:24

order to get the new uh token embeddings

play10:27

count since I'm loading or adding this

play10:30

tokenizer and I'm expanding this to a p

play10:33

of multiple of eight and you'll see that

play10:35

we've added this token the padding token

play10:38

that is and you see that now the

play10:40

tokenizer has all the available tokens

play10:43

and this is the new token that we've

play10:44

added to the tokenizer so this is very

play10:47

important because if you don't have some

play10:50

padding or correct ping within the

play10:52

training sets your model is tending to

play10:55

essentially repeat the last couple of

play10:58

words or tokens that is going to

play11:00

generate so this really helps with the

play11:02

repetition of the model and then another

play11:05

thing right here is that if you're using

play11:08

a GPU that is capable of using flash

play11:11

attention to I would strongly suggest

play11:13

you that you turn on this one but since

play11:17

I'm using the T4 GPU which is available

play11:20

on the free tier of Google C I'm

play11:24

essentially commenting out this one so

play11:27

essentially this is how you're going to

play11:29

to what the model and the tokenizer

play11:32

itself next we are going to make sure

play11:35

that the number of tokens are going to

play11:37

be fitted right within the context

play11:39

window of our tiny W model which has 248

play11:43

tokens of context WID and in our example

play11:46

I'm going to create this format or

play11:49

template which is something that I've

play11:50

chose to use this is not something

play11:53

standard so I chose to set the title the

play11:56

text and then the prediction in this

play11:58

format for the article of or the news

play12:01

article and then you see that within the

play12:03

prediction I have this subject and then

play12:05

sentiment and in order to have a look at

play12:08

how many tokens we are going to need I'm

play12:10

essentially counting the number of

play12:12

tokens in each example after formatting

play12:15

it into uh using this template and you

play12:18

see that the number of tokens is much

play12:21

much more L compared to the maximum

play12:24

limit of 28 48 so we are going to

play12:28

essentially need at most 200 tokens for

play12:31

the input so the problem with the

play12:34

context window should not be uh anything

play12:36

errow our examples are very tiny

play12:39

compared to what the tiny one model can

play12:41

handle while you can fine tune a tiny l

play12:45

in its fullest still 1.1 billion

play12:48

parameter models are not small by any

play12:51

means even though the name is Tiny wama

play12:54

so if you have a single GPU for example

play12:56

a T4 that we're going to use within the

play12:58

the Google cop notebook you might have a

play13:01

hard time fitting this model into the

play13:03

GPU and fine-tuning it in on its own so

play13:06

in our case I'm going to have a look at

play13:09

how you can use war in order to fine

play13:12

tune the tiny I and this will allow us

play13:15

to even increase the bat size that we

play13:17

are going to use in order to train this

play13:19

model so one important thing to note is

play13:23

that war or with War when you're

play13:26

training such models you are going to

play13:28

essentially train just a small model

play13:29

called adapter on top of the original

play13:32

model so you have to essentially W the

play13:34

original model within the memory and

play13:36

then create a smaller model or a set of

play13:39

or a matrix of parameters in order to

play13:41

find youe just those and even though

play13:44

when you're training models such as wama

play13:47

7B you might just train roughly or even

play13:52

lower than 1% of the parameters if you

play13:55

do that with tiny L you're going to get

play13:58

like something like maybe 1 or 10

play14:00

million parameters in order to train

play14:03

your model so in the general case this

play14:06

wouldn't be enough of course this

play14:08

depends on the task at hand so as a

play14:11

general start I would recommend

play14:13

something like 100 million

play14:16

parameters which is a great start and

play14:18

you can tweak that in order to get

play14:20

something like this for the tiny wama

play14:23

we're going to increase the rank of the

play14:26

wama or sorry the water conf to about

play14:31

128 so this will give us roughly

play14:34

8.5% of the parameters for training of

play14:38

the original model and then I'm going to

play14:40

increase also the war Alpha in order to

play14:42

scale the learning rate and not change

play14:44

its value and again I'm going to set

play14:47

this number to

play14:48

128 to start with the training I'm going

play14:51

to set the P token ID on the model and

play14:54

then on the model config P token ID to

play14:56

the tokenizer Token IDs then I'm going

play14:59

to have a look at model config in order

play15:01

to double check that the Ping token or P

play15:04

token ID has been properly set which is

play15:07

and then we are going to have a look at

play15:09

the model architecture which is going to

play15:11

tell us where do we need to apply the

play15:14

war scaling or the war Target modules so

play15:18

in this case you're going to see within

play15:20

my config right here that I'm targeting

play15:22

the self attention one and then the MLP

play15:25

ones so these are the linear layers and

play15:27

these are the self attention layers as

play15:29

you can see right here and I'm

play15:31

essentially targeting all of those and

play15:34

for the rank of the Matrix and shout out

play15:36

to Tris research YouTube channel from

play15:39

which I've seen that he's actually

play15:41

targeting tiny LS with much higher

play15:44

number of parameters so thank uh thanks

play15:48

to you I've seen that you can actually

play15:51

you need to actually increase the number

play15:53

of parameters or the ranking of the war

play15:56

Matrix in order to find you much better

play15:59

with tiny a and here I'm going to set

play16:03

the rank of the Matrix and the war Alpha

play16:05

in order to scale the warning crate

play16:07

within 128 bolt and I'm going to apply a

play16:11

small Dropout to the War uh so this is

play16:14

the new adapter model and then I'm going

play16:17

to say that this is a coal language

play16:19

modeling task from the task type right

play16:22

here and then I'm going to get the P

play16:24

model on top of the original tiny Lama

play16:26

model with the water config application

play16:29

right here and you see that we are

play16:31

actually targeting roughly um 100

play16:35

million or 101 million parameters for

play16:38

training

play16:40

8.4% on the training front with the

play16:43

water so next I'm going to show you how

play16:47

you can train just on the completions

play16:50

and this is uh something that my

play16:51

colleague called wo have shown me thank

play16:54

you w for that so instead of training

play16:58

the the whole text or using the whole

play17:01

text for the training you essentially

play17:03

what you want to get is to use for

play17:06

example from this example uh you want to

play17:10

calculate the was only on this so

play17:15

essentially I'm going to ignore this

play17:19

which is the changing part within the

play17:21

DAT set and to calculate the W I'm going

play17:23

to essentially take only those tokens in

play17:25

order to have a look at how well the

play17:27

model is performing and this will

play17:29

drastically reduce the was that you have

play17:32

but keep that in mind that if you're

play17:34

training for a task such as are on right

play17:37

here for some completion just so some

play17:40

for some completions then this type of

play17:43

collator is doing a great job but if

play17:45

you're training for something like um

play17:48

assistant and chats Etc this might might

play17:51

not be a good use case of the data

play17:54

cleator so keep that in

play17:56

mind and uh in our case I'm going to use

play18:00

the prediction as a template I'm going

play18:03

to encode this and then I'm going to

play18:05

pass in the template IDs or the response

play18:08

template IDs to the collator and then

play18:11

I'm going to pass in a tokenizer to that

play18:13

so essentially what we are going to do

play18:15

here is to um tokenize the template

play18:19

since without that uh discolor appears

play18:22

to be failing at least for me and I'm

play18:24

going to essentially get a single

play18:26

example and tokenize it in order to show

play18:29

you what the labels uh this collector is

play18:32

going to add so you'll see here that

play18:35

when I create this data water and I get

play18:38

the next patch from it you see that now

play18:40

we have input IDs attention mask and

play18:44

then a new field callede

play18:45

labels if you look through the batch

play18:48

labels you'll see that everything uh

play18:50

before the template essentially has been

play18:54

given an ID of minus 100 so this is

play18:57

essentially ignore these tokens and for

play19:00

the was itself only these tokens are

play19:03

going to be used for the calculation of

play19:05

the loss since we get a bit of

play19:07

repetition with the subject and

play19:09

sentiment you can essentially prove this

play19:12

to be either better so essentially what

play19:15

you might want to get is to get rid of

play19:18

this and get rid of this and just um

play19:21

print those two lines this would

play19:23

probably be much better compared to what

play19:26

we have right now and your wor is going

play19:28

to be performing even better but yeah

play19:31

this is an exercise that if you want to

play19:33

do this and then for the training

play19:35

arguments I am going to essentially use

play19:38

a b size of four but I'm going to

play19:41

multiply that by four in order to get an

play19:43

effective B size of 16 using gradient

play19:46

accumulation so what this will do is

play19:49

going to be passing only four examples

play19:51

through the GPU but then uh the results

play19:54

are going to be

play19:55

accumulated within a four uh iterations

play19:59

of those four batches and then the

play20:01

accumulation or the gradient is going to

play20:03

be calculated on top of that this

play20:05

appears to be he to pink with the

play20:06

training and I've seen that during the

play20:09

training on this single GPU this gave me

play20:12

a much ler uh war or sorry much l wases

play20:17

so it appears to be helping uh then I'm

play20:20

going to be using a regular Adam with uh

play20:24

wdk fix from torch Optimizer we are not

play20:27

using any um

play20:30

quantized um any quantized Optimizer

play20:33

since we are going to be using uh fp16

play20:37

or floating Point 16 training for this

play20:40

one we don't need Q for those tiny uh

play20:43

language models this appears to be

play20:45

training very fast and it appears to be

play20:47

very stable with very good results so no

play20:50

quantization on this part right here and

play20:52

I'm going to essentially use a constant

play20:55

schedu type uh yeah this is is a bit

play20:58

redundant since we're not going to be

play21:00

using any warm up right here uh and then

play21:04

another important thing is that I'm

play21:06

going to train just for one Epoch of

play21:08

course you might want to train for

play21:10

multiple epochs that depends on the DAT

play21:14

set size that you have I've trained this

play21:15

for roughly 40 minutes I believe and if

play21:19

you train for longer you might actually

play21:21

get better results with those tiny l so

play21:24

uh it it might be worth to experiment

play21:26

with that and those are essentially the

play21:29

training arguments that we

play21:30

have then I'm going to get this format

play21:34

prompts which is going to be passing

play21:36

essentially a

play21:38

example and within this example I'm

play21:41

going to our examples and within that

play21:44

I'm going to essentially use the format

play21:46

of the template that we've seen thus far

play21:49

and this is going to essentially create

play21:51

our batch for us so this is the trainer

play21:53

that I'm going to use um I'm going to

play21:56

pass in the model the training arguments

play21:58

then the training and the validation uh

play22:01

sets a tokenizer Max sequence length

play22:04

which can be increased but in our case

play22:06

that's not needed then the formatting

play22:08

function which is this one and then the

play22:10

data calator which is going to be

play22:12

training only on the completions so uh

play22:15

this is essentially the output of the

play22:18

training uh and you see that the model

play22:20

is actually performing very well this is

play22:23

the evaluation was from the the tensor

play22:26

board training uh you see right here

play22:29

that we start with a relatively high

play22:31

value of

play22:33

0.15 then uh after 600 steps this is

play22:38

0.11 uh below

play22:40

0.10 and yeah you can see that uh in

play22:44

relatively let's see that again in about

play22:48

26 minutes of training we get this far

play22:51

below so this is really

play22:54

good uh and uh you can check the

play22:56

tutorial for the full outline of this

play22:59

but this is my training course without

play23:01

any smoothing and you see that is again

play23:04

generally decreasing uh you might U

play23:07

argue that we are going to hit a plateau

play23:09

right here but I would say that the

play23:12

training went really well and these are

play23:15

again the results from this one uh you

play23:18

see here in this table that we have the

play23:21

training course and the validation was

play23:23

and you can see that that they're fairly

play23:26

similar uh and the validation was is

play23:28

actually a bit better in the later

play23:31

iterations which is surprising but um it

play23:35

is within the realm of what you might

play23:37

get since the training set is much

play23:39

larger compared to validation set so it

play23:42

might be just uh Randomness right here

play23:44

and then in order to get this model to

play23:47

be safed I'm going to use the trainer

play23:51

model save pre-trained and within the

play23:54

same folder I'm going to essentially get

play23:56

the tokenizer to save itself as well

play23:59

with the proper

play24:01

configuration so in order to try out our

play24:05

model I'm going to um essentially I've

play24:08

at this point I've restarted the Google

play24:11

WAP notebook and what I did here was to

play24:15

get the base model wed into for 16 and

play24:19

then apply the pth model on top of that

play24:21

this is again the same folder and train

play24:23

folder and then I'm going to essentially

play24:26

merge the P model on top of the original

play24:28

model and this is going to get again the

play24:33

tokenizer which was correctly formatted

play24:36

you can see right here that we have a

play24:37

padding token and we have a correct

play24:39

padding site and a correct P token ID

play24:42

and after that I just again setting the

play24:45

P token ID and p uh config P token ID

play24:49

just in case and now we can use our

play24:52

function model as a regular Hing face

play24:54

Transformers model I'm I'm going to

play24:56

create a pipeline I it for text

play24:59

generation I'm going to pass in the

play25:00

model the tokenizer the maximum number

play25:03

of new tokens this is going to be only

play25:04

16 since uh we already know that our

play25:07

model is going to be producing a very

play25:09

small number of tokens for the

play25:13

completion and I'm going to essentially

play25:15

format the example for completion or for

play25:19

prediction I'm going to just take from

play25:21

the example the title the text and then

play25:23

I'm going to pass in the prediction

play25:24

without the prediction itself I'm going

play25:27

to to um reduce the verbosity of the

play25:30

Ling and then I'm going to have a look

play25:33

at 10 examples note here that this is

play25:36

the text so this is the complete text

play25:38

from the example and then I'm calling

play25:41

format for prediction right here with

play25:43

the example itself and I'm going to

play25:45

essentially output the prediction so um

play25:48

the original subject or sentiment is not

play25:51

passed into the

play25:53

model so this is the first example

play25:57

binance research report reviews Etc and

play26:00

the subject here is from the original

play26:03

data point is nft the sentiment is

play26:05

positive and this is now the prediction

play26:07

you see that we have a um duplicate of

play26:11

the sentiment line by the model this is

play26:14

relatively common and we're going to

play26:16

address that in a bit but the subject

play26:19

appears to be correct right here and the

play26:21

sentiment appears to be positive as well

play26:23

let's look at the another one subject

play26:25

altcoin sentiment positive outco

play26:28

positive again uh this is essentially

play26:31

what we have right here it is correct uh

play26:34

then subject etherium sentiment positive

play26:38

again those appear to be uh exactly

play26:41

correct altcoin positive but the

play26:44

prediction was negative let's have a

play26:46

look at the title coinbase coinbase coo

play26:49

calls for regulation of centralized

play26:51

crypto entities the demise of FTX has

play26:54

set back crypto by years and This

play26:56

Disaster is likely to steer Regulators

play26:59

Regulators into action so the sentiment

play27:02

is positive but I wouldn't exactly agree

play27:05

with this label right here uh you can

play27:06

decide on your own and I think that our

play27:09

model is actually predicting a better

play27:12

sentiment than the one in the

play27:14

labels something that is very

play27:16

interesting let's have a look at another

play27:19

one altcoin positive again uh

play27:24

correct now this subject here here is

play27:28

altcoin but the model is saying Bitcoin

play27:30

let's have a look at and there again

play27:33

positive sentiments for both so

play27:35

bitcoin's PR prediction as BTC breaks

play27:38

through Etc Bitcoin the world swes

play27:40

currency and the label is altcoin yeah

play27:44

our model is uh performing very well

play27:46

indeed so uh this looks to be the case

play27:51

that the labels are not exactly perfect

play27:53

but our model seems to be doing a good

play27:56

job even though the the data set is not

play27:58

of that high quality uh and yeah you can

play28:02

go through a lot of examples and see for

play28:04

yourself so next I'm going to do

play28:07

something a bit different I'm going to

play28:08

extract the prediction for the complete

play28:11

test set again this is 1, uh200 uh yeah

play28:18

1,24 242 examples this took about 10

play28:22

minutes and this these are the

play28:25

predictions uh the title the text true

play28:28

subject true sentiment predicted subject

play28:30

predicted sentiment this is essentially

play28:32

the data frame that we're going to get

play28:34

and I'm going to essentially calculate a

play28:36

very rough accuracy for the subject

play28:39

which is according to this calculation

play28:43

78.6% accuracy of course you might want

play28:46

to go through some examples and see for

play28:49

yourself if the model is actually better

play28:51

compared to the labels uh this is

play28:54

essentially a heat map or a confusion

play28:56

Matrix

play28:58

of

play28:59

different predictions for the subject

play29:01

and the real values uh you can see that

play29:03

we have some overlap between blockchain

play29:05

and altcoin right here but uh nothing

play29:09

really

play29:10

major and again for the TR subject and

play29:13

the predicted subject uh let's see let's

play29:16

get an example from right

play29:19

here AI optimizing crypto exchange

play29:22

functions artificial intelligence tools

play29:24

are providing so the TR subject is

play29:25

Bitcoin but the predicted subject is

play29:28

blockchain yeah at least from the first

play29:31

couple of words it appears that again

play29:33

our model is performing better than the

play29:35

labels but I might be wrong I I mean go

play29:39

over the title and the text for some

play29:42

examples on your own next for the

play29:44

sentiment uh we have exactly the same

play29:47

calculation and you see that this time

play29:50

we have a just a tiny bit over 90%

play29:54

accuracy on the test set which is really

play29:56

impressive if with such a small data set

play30:00

again this is the confusion

play30:02

Matrix uh yeah and again we are going to

play30:05

have a look at some

play30:07

examples bad news is good news Bitcoin

play30:09

plays with USD Bitcoin reaches its

play30:12

highest Target in nearly seven Etc and

play30:15

here the sentiment is positive while our

play30:19

prediction is neutral I would agree that

play30:21

the labels here is better compared to

play30:23

what we have in the model

play30:25

itself uh arst stroke promised me 100 in

play30:29

Bitcoin is it possible that coinbase CEO

play30:32

Etc neutral and our prediction is

play30:35

negative I'm not sure I have to see the

play30:38

title and the text for this one but yeah

play30:41

even if this is correct 90% is very good

play30:44

for such a small training so this is it

play30:47

for this video you now know how to find

play30:50

you a tiny L on your own data set and

play30:53

you know how to set up correctly the war

play30:56

configuration for for that and also you

play30:58

know how to save the model after

play31:02

training and then get the final model on

play31:05

top of the original model and do some

play31:08

inference with it in the next video I'm

play31:10

going to show you how you can use the

play31:12

adapted model and fuse it or merge it

play31:15

within the original model push that to a

play31:17

huging face Hub repository and then from

play31:20

there we're going to deploy the model in

play31:22

production behind an API and we're going

play31:25

to start to get some inference on top of

play31:28

a real world example thanks for watching

play31:31

guys please like share and subscribe

play31:34

also join the Discord channel that I'm

play31:36

going to link down into the description

play31:37

below and I'll see you in the next one

play31:40

bye