Fine-tuning Tiny LLM on Your Data | Sentiment Analysis with TinyLlama and LoRA on a Single GPU

Venelin Valkov

29 Jan 202431:41

Summary

TLDRIn this tutorial video, Vin shows how to fine-tune a Tiny L language model on a custom cryptocurrency news dataset. He covers preparing the data, setting the correct parameters for the tokenizer and model, training the model efficiently with Warp using a Google Colab notebook, evaluating model performance, and doing inference with the fine-tuned model. The goal is to predict the subject and sentiment for new crypto articles. With only 40 minutes of training data, the fine-tuned Tiny L model achieves promising results - around 79% subject accuracy and over 90% sentiment accuracy.

Takeaways

📚 Vin explains the process of fine-tuning a TinyLM model on a custom dataset, beginning with dataset preparation and proceeding through training to evaluation.
🔧 Key steps include setting up tokenizer and model parameters, using a Google Colab notebook, and evaluating the fine-tuned model on a test set.
🌐 The tutorial includes a complete text guide and a Google Colab notebook link, available in the ML expert bootcamp section for Pro subscribers.
🤖 TinyLM is preferred over larger models like 7B parameter models due to its smaller size, faster inference, training speed, and suitability for older GPUs.
📈 Fine-tuning is essential for improving model performance, especially when prompt engineering alone doesn't suffice, and for adapting the model to specific data or privacy needs.
📊 For dataset preparation, a minimum of 1,000 high-quality examples is recommended, and consideration of task type and token count is crucial.
🔍 The tutorial uses the 'Crypton News+' dataset from Kaggle, focusing on sentiment and subject classification of cryptocurrency news.
⚙️ Vin demonstrates using Hugging Face's datasets library and tokenizer configurations, emphasizing the importance of padding tokens in avoiding repetition.
🚀 The training process involves using WaRT (Weighted Activation Regularization of Training) to train a small adapter model over the base TinyLM model.
📝 Evaluation results show high accuracy in predicting subjects and sentiments from the news dataset, validating the effectiveness of the fine-tuning process.

Q & A

What model is used for fine-tuning in the video?
-The Tiny Lama model, which is a 1.1 billion parameter model trained on over 3 trillion tokens.
What techniques can be used to improve model performance before fine-tuning?
-Prompt engineering can be used before fine-tuning to try to improve model performance. This involves crafting the prompts fed into the model more carefully without changing the model itself.
How can Warp be used during fine-tuning?
-Warp allows only a small model called an adapter to be trained on top of a large model like Tiny Lama. This reduces memory requirements during fine-tuning.
What data set is used for fine-tuning in the video?
-A cryptocurrency news data set containing titles, text, sentiment analysis labels, and subjects for articles is used.
How can the data set be preprocessed?
-The data can be split into train, validation, and test sets. The distributions of labels can be analyzed to check for imbalances. A template can be designed for formatting the inputs.
What accuracy is achieved on the test set?
-An accuracy of 78.6% is achieved on subject prediction on the test set. An accuracy of 90% is achieved on sentiment analysis on the test set.
How can the fine-tuned model be deployed?
-The adapted model can be merged into the original Tiny Lama model and pushed to Hugging Face Hub. Then it can be deployed behind an API for inferences in production.
What batch size is used during training?
-A batch size of 4 is used with gradient accumulation over 4 iterations to simulate an effective batch size of 16.
How are only the model completions used to calculate loss?
-A special collator is used that sets the labels for all tokens before the completion template to -100 to ignore them in the loss calculation.
How can the model repetitions be reduced?
-The repeated subject and sentiment lines could be removed from the completion template to improve quality.