Hands-On Hugging Face Tutorial | Transformers, AI Pipeline, Fine Tuning LLM, GPT, Sentiment Analysis
Summary
TLDRThis video script offers a comprehensive guide on utilizing the Hugging Face 'transformers' library for various NLP tasks. It demonstrates sentiment analysis with different models, highlighting nuances and limitations. The script also covers text generation, question answering, and the importance of tokenization. It introduces fine-tuning models using the IMDB dataset and showcases Hugging Face Spaces for deploying AI apps. The project concludes with using the arXiv API for paper summarization, suggesting potential for building summarization apps.
Takeaways
- đŠ Installing the 'transformers' library is the first step to start using Hugging Face's NLP tools.
- đ§ After installation, you can import the 'pipeline' for performing various NLP tasks, such as sentiment analysis.
- đ The sentiment analysis pipeline can be used without specifying a model, defaulting to 'Distil BERT' for classification.
- đ The sentiment analysis results include a label (e.g., 'negative') and a confidence score, indicating the model's certainty.
- đ€ Different models can yield different results, highlighting the importance of model selection for nuanced understanding.
- đ Batch processing of sentences can provide a more comprehensive sentiment analysis, as demonstrated with varied results.
- đ§ Emotion detection can be incorporated into sentiment analysis, offering more depth by identifying specific emotions like 'admiration' or 'anger'.
- đ Text generation is another task facilitated by the 'pipeline', where models can create new text based on a given prompt.
- đ€ Question answering is facilitated by the pipeline, where a model can extract answers from provided context with a certain confidence score.
- đ Tokenization is a crucial preprocessing step that converts text into manageable pieces for models, often represented as IDs.
- đ Fine-tuning models on specific datasets, like the IMDB dataset, allows for customization to particular tasks or domains.
- đ ïž Hugging Face 'Spaces' is a platform for deploying and exploring AI applications, offering a community-driven approach to AI development.
Q & A
What is the first step in using the 'transformers' library for NLP tasks?
-The first step is to install the 'transformers' library using pip and then import the pipeline functionality for different NLP tasks.
What does the default model used for sentiment analysis in the pipeline return when no model is explicitly provided?
-When no model is provided, the pipeline uses the default model, DistilBERT, which returns the sentiment analysis result. For example, it might return 'negative' with a confidence score of 99% for a specific sentence.
Why might the sentiment analysis results not fully capture the nuances of a sentence?
-The default sentiment analysis model may not be nuanced enough to understand complex sentiments or mixed emotions, leading to results that might not accurately represent the sentiment of the text.
How can you enhance the sentiment analysis model to capture more emotions?
-You can enhance sentiment analysis by choosing a model that includes emotions, such as a model from Hugging Face that can detect sentiments like admiration, confusion, amusement, and anger.
How does the pipeline handle text generation tasks?
-For text generation, you can use the pipeline by selecting a suitable model from Hugging Face, then providing a prompt. The pipeline will generate a sequence of text based on that prompt.
How can you perform question answering using the 'transformers' pipeline?
-You can use the question answering pipeline by providing a question and a context. The model will return an answer with a confidence score based on the provided context.
What is the purpose of tokenization in NLP models?
-Tokenization is used to break down text into smaller components, such as words or characters, and convert them into IDs that the model can understand. It helps to process the text efficiently and uniformly.
Why is padding necessary when tokenizing text?
-Padding is necessary to ensure that all sentences have the same length, which is important when feeding the input to a model. Padding helps the model handle sentences of varying lengths effectively.
What dataset is used for fine-tuning a sentiment analysis model in the example?
-The IMDB dataset, which contains movie reviews, is used for fine-tuning the sentiment analysis model.
How can you deploy models or AI apps on Hugging Face Spaces?
-You can deploy models or AI apps on Hugging Face Spaces, which is a platform similar to GitHub but designed for AI projects. It allows the community to share and explore AI apps.
Outlines
đ Introduction to NLP Pipelines and Sentiment Analysis
This paragraph introduces the use of the 'transformers' library for natural language processing tasks, starting with sentiment analysis. It explains the process of installing the library, importing the pipeline, and using it without specifying a model, which defaults to DistilBERT. The paragraph discusses the results of sentiment analysis on different sentences, highlighting the model's limitations in understanding nuances. It also touches on the idea of using different models like Facebook BART for more neutral results and batch processing for better insights. The speaker then explores the use of emotion detection models available on Hugging Face for a more nuanced sentiment analysis.
đ€ Exploring Tokenization and Fine-Tuning Models
The second paragraph delves into the importance of tokenization in processing text for machine learning models. It explains how tokens are converted into IDs and the role of token type IDs and attention masks in handling multiple sentences. The speaker then discusses fine-tuning models using the IMDB dataset as an example, outlining the steps for preprocessing data, setting up training arguments, and initializing the model for training. The paragraph also introduces Hugging Face Spaces as a platform for deploying and exploring AI apps, suggesting the potential for community-driven innovation in AI.
đ Advanced NLP Tasks and Projects with Hugging Face
The final paragraph covers advanced NLP tasks such as text generation and question answering, demonstrating the process of selecting appropriate models and pipelines for these tasks. It also revisits the concept of tokenization, emphasizing the importance of choosing the right tokenizer for the model. The paragraph concludes with a project idea that involves using the arXiv API to access and summarize academic papers, suggesting the broad applicability of NLP techniques in various domains. Additionally, it briefly mentions a separate project on time series data, hinting at the versatility of machine learning approaches.
Mindmap
Keywords
đĄTransformers
đĄPipeline
đĄSentiment Analysis
đĄModel
đĄFine-tuning
đĄTokenizer
đĄTokenization
đĄHugging Face
đĄDataset
đĄFine-tune Model
đĄAttention Mask
đĄEmotion
Highlights
Installing transformers and using pipeline for NLP tasks like sentiment analysis.
Using default Distil BERT model for sentiment analysis without specifying a model.
Example of sentiment analysis on movie review text.
Issues with nuance understanding in sentiment analysis results.
Using Facebook BART language model for more neutral sentiment analysis.
Batch processing of sentences for more consistent sentiment analysis.
Incorporating emotions into sentiment analysis using a different model.
Different models have varying levels of nuance for sentiment analysis.
Text generation using transformers and pipelines.
Question answering pipeline and example usage.
Importance of tokenization in processing text for models.
Explanation of tokenizer components like input IDs, token type IDs, and attention mask.
Fine-tuning models on the IMDB dataset using the Hugging Face Datasets library.
Preprocessing data for fine-tuning, including tokenization and padding.
Setting up training arguments for fine-tuning models.
Initializing and training the model using the Trainer API.
Evaluating and saving the fine-tuned model and tokenizer.
Exploring Hugging Face Spaces for AI apps and community projects.
Project idea using arXiv API to fetch and summarize research papers.
Using the arXiv library to search and retrieve research paper data.
Building an app for summarizing research paper abstracts.
Project on time series analysis using LSTM autoencoders and convolutional neural networks.
Transcripts
We pip install transformers and run it And you're done.
After that you can go ahead and import pipeline. With pipeline you can do different NLP tasks
the first task being sentiment analysis.
So if you've got the transformer and you will import the pipeline, what you can do is then
start to build a pipeline for the sentiment analysis you write the text And after that apply
that classifier in the pipeline and then write any kind of sentiment analysis text if you say I
wasn't happy with the last mission impossible movie, we want to see what kind of results we get.
So the first thing that you may notice is that I didn't give any model.
So we'll say no model was supplied, Uh let me see.
This is the model It is like kind of default. The Distil BERT, the uncased fine tuned.
And it gives me the result of this label is negative
with an score of ninety nine percent and so.
So the sentiment for this sentence was negative But you can also apply this, a little bit
shorter in the coding So you can say for example pipeline and then give the task which is
sentiment analysis, and just open the bracket and write something else.
And if I run actually uh I was confused with the Barbie movie, you can see that it says negative.
it's not really understanding the the nuance So I wrote every
day lots of LLMs papers are published about LLMs evaluation.
Lots of them look very promising I'm not sure. It we can actually evaluate LLMs.
And if I run this just come with this positive
score, which is not representative of what I'm saying.
So I was thinking how about using another model So I just use Facebook BART
language As you can see it comes with a neutral and the score is about seventy seven percent But
this does not give me the result that I want So after that I was
thinking how about doing it like in a batch way?
just separate each of the sentences, and give it like as a list to the classifier.
And when I was running it, it just starts to get like positive negative negative negative
negative So got a little bit of more of the vibe
of this then I was thinking I need a little bit of emotions.
And if I go ahead and click on models on Hogging face, I say I want emotion
âŠthen I get a few choices imagine that I look this one, I can copy this
modelâŠget back to my, um, notebookâŠand use that
one as a sentiment analysis because this one has emotions.
when I'm talking about things, like, I really like auto encoder's best models for anomaly
detection It says, hey it's admiration.
They say I'm not sure if we can evaluated LLMs is confusion And then passive aggressive is the
name of a linear regression model that so many people do not know It's a pretty funny name for a regression model.
It just says it's an amusement, and then I say I hate long
meetings, who doesn't, and then we get anger.
So that's actually something that you can, incorporate into your sentiment analysis.
not all of the models comes with the same amount of nuance, and it's important to get back to
that model You can see here we have all of these labels at the
right side disappointment sadness annoyance, etcetera etcetera.
So you can see from the information from the model what kind of model is actually suitable for you
Imagine that you want to do text generation. Almost everything is exactly the same What you do
is you just pick up pipeline you need a model, to know which one you go ahead and, go to
Huggies phase, then click on models, and then you can just find out which tasks you have So
we say text generationâŠand then we get all of the models which are available.
And if we pick up one of them this is like the Stanrdard, and then the start with a sentence,
we say we have a truncation and then the sequence of two then we look at them So the generated
text would say today is a rainy day in London we can be quiet the most any day in the world so
we wanted to look across the city with a great view and so on and so on.
let's take a look at another one which is question answering. So you say question answering
pipeline then you give you a question what is my job And then I give the context I'm developing
AI models with Python, and then I would ask the question and the context and it would say in a
score of seventy eight percent, start at five and then end at twenty five, the answer is
developing AI models, pretty original.
So let's go to the Tokenizaion If you go to the transformer and just pick up some of the
tokenizers, what you can say is uh for example some of the auto tokenizersâŠand you will get
them and then you have for example auto model for sequence classification And then we've got
for example Distil BERT tokenizerâŠSo you've got a number of tokenizers
âŠand and then we will also have Distil BERT for sequence classification So you can have the as
a model and then use that tokenizer from a pretrained model.
So in this way you have a little bit more control over your tokenizer.
But you may wondering what is actually a tokenizer And why should I actually care about it?
Let me show you. If we have from the transformer it auto tokenizer meaning that it will find
actually what is the best tokenizer for our model, and you will have any kind of pretrained
model, if we have a text, You need Tokenization
to just process a very big text in very small pieces.
A token could be a word or could be a character. And if we say we want actually, uh tokenizer
from a text and we want to convert it to IDs.
So the tokens each of the words will be changing to an ID Let me
show you for this sentence I was not so happy with the Barbie movie.
Sorry for all of the examples about movies, too typical.
So then we we've got the tokens and then you can see because our model is uncased BERT
based uncased it will change to uncased letter to all of the tokens, then we get all of the
input IDs, which are the specific IDs for each of two words, And if we encodeâŠthe
tokenizer, so we will just apply the tokenizer to our text you will get these IDs
We get a begin ID and an end ID because it's a sentence.
And then we can see that we have token type IDs and attention mask and token type IDs we will
need it if we have more sentences then we want to know from which section it is And attention
mask is to just distinguish between the actual tokens and the paddings.
Padding means that if you have a different lengths of a sentence then you need actually to make
them the same sentence because we are gonna feed it to a model, and it will not understand if
we have different lengthes so that's why you need actually padding So this was tokenization
So the next thing that we are gonna look at is fine tuning and we're gonna do it on IMDB dataset.
So first thing I want you to do is Pip install datasets the data set is coming from
hugging face So if you go to Hogging phase and you go to dataset, you can find a lot of
datasets and based on what kind of a task you have you can just go ahead here and choose for
example text to text generation you can see For example here is a Salesforce WIKI SQL So you can
have so many data to play around with the step two would be to just use load data set from data
set and we're gonna go for IMDB movies.
look at the dataset you can see there is a train a test and an unsupervised part, and
it has text and label After that we can go ahead and pre process it.
What do we do with pre processing As I said before We can use that tokenizing so the only thing
that we need to do is in the tokenized function is to get that example, add kind of padding to
it which is like maximum, say we need actually truncation, and then just map this Is pretty
straightforward the tokenized dataset looks like this so it has a text label, input IDs and
token type IDs and attention mask If you remember, I've told these three things will come out
tokenization then we're gonna setting up our training argument.
it is so similar to our machine learning model.
So if you just can do that comparison with your models with your tabular data, is
basically the same So you would get training argument and then the output directory, eval
strategy, learning rate and do a train batch size
and evaluation batch size number of training epochs and the weight decay.
For now you just keep it to these examples.
And if you print it you can see that there are a lot of them where you can fine tune actually
After having actually some parameters we're gonna do our initialization of the model.
We say there is an auto model for sequence classificationâŠand use the BERT based
uncased, number of labels or two because we're gonna use a classification, and then we are
gonna do trainer, trainer model, model, arguments as we have said The
training gonna be datasets of the training and evaluationâŠfrom the test And training the model
is Same as Scikit learn or TensorFlow or Pytorch.
It is a pretty clean API and simple to use.
After we train it we just can evaluate and print our result This and in the step of saving the
fine tune model, just, save the model and save the tokenizer to any path that you want
Another thing that you definitely want to explore is a Hugging face spaces.
If you just click on spaces, spaces where you can deploy models is a kind of GitHub of hugging
face, but way more possibilities and there are already a lot of AI apps developed by the
community where you can get inspiration to build your own AI app.
For example this one is an AI Comic factory and it will make actually a comic book from your story.
So a lot of them are very interesting to explore let's have our own project we are gonna use the
arXiv API to get access to, all of the two point four million, articles
that we get on arXiv And this is the library So you can say you can fetch results you can search
anything you want you even can download the papers.
So if we pip install archive and then import it, and then we can just say, give us
anything about AI or artificial intelligence for machine learning and search.
And then we can give a number of results It can be ten or more.
Then we get the papers where we have the date of the published the title abstract and the categories.
And if we put a data frame around the papers, You get that data frame, which
contains the published paper, and this one for example exloration of single demonstration or a
LLEM map, and then the abstract.
now we can work on this data now we can use this data to do summarization.
as I explained before we got the abstracts We can use one of them for an example, and then use
the pipeline the task is summarization give him a model and then if you ask him to summarize so
if you say the summarization of that specific abstract, then it will start with we propose Way
a new method for learning and so on and so on we already can build
an app from our language summarization
of the papers. let's get it to Visual Code and build that
So this project was on text. If your data is not text but time series, you can take a look at
this project where I use LSTM auto encoder.
And convolutional neural network See you there.
Voir Plus de Vidéos Connexes
How to Use Pretrained Models from Hugging Face in a Few Lines of Code
Gradio Crash Course - Fastest way to build & share Machine Learning apps
Text to Image generation using Stable Diffusion || HuggingFace Tutorial Diffusers Library
Deploy Hugging Face models on Google Cloud: from the hub to Inference Endpoints
#1 Generative AI On AWS-Getting Started With First Project- Problem Statement With Demo
LLM Text Summarization (3.3)
5.0 / 5 (0 votes)