Text Classification Using BERT & Tensorflow | Deep Learning Tutorial 47 (Tensorflow, Keras & Python)
Summary
TLDRThis video builds on the previous explanation of BERT by demonstrating how to use it for email classification, determining whether emails are spam or non-spam. The presenter walks through key steps, including generating embedding vectors from email text using BERT and feeding them into a simple neural network. The model is trained and evaluated, achieving high accuracy. The tutorial also touches on handling data imbalances, building functional models in TensorFlow, and using cosine similarity to compare word embeddings. Viewers are encouraged to practice by running similar code on their own.
Takeaways
- 😀 The video explains how BERT, a language model, can be used for email classification to determine if an email is spam or not.
- 🔍 BERT converts an entire email into an embedding vector, which is a numerical representation that captures the context of the text.
- 📊 The video demonstrates creating a neural network with a single dense layer and a dropout layer to prevent overfitting, using the embeddings as input.
- 📈 The script discusses the importance of data preprocessing, including creating a new column for the target variable and performing a train-test split while maintaining class balance.
- 🌐 The tutorial guides viewers on how to access and use BERT models from TensorFlow Hub for pre-processing and encoding text.
- 💻 The presenter shows how to generate embedding vectors for sentences and words using BERT, and even compares the vectors using cosine similarity.
- 📝 The video introduces functional models in TensorFlow, contrasting them with sequential models, and demonstrates building a functional model for the classification task.
- 🎯 The training process involves compiling the model with an optimizer and loss function, then fitting it to the training data.
- 📊 The script includes an evaluation of the model's performance on a test set, achieving high accuracy for both training and testing.
- 🔧 The video concludes with an exercise for viewers to practice what they've learned by following a TensorFlow tutorial on text classification with BERT.
Q & A
What is the main purpose of using BERT in the provided email classification example?
-The main purpose of using BERT in this example is to convert the entire email into an embedding vector, which can then be fed into a neural network for training to classify emails as spam or non-spam.
Why is the embedding vector from BERT set to a length of 768?
-The embedding vector from BERT is set to a length of 768 because this is the standard dimensionality of the hidden layers in the BERT model, which was covered in a previous video as mentioned by the speaker.
What are the two main components of the BERT model?
-The two main components of the BERT model are preprocessing and encoding. Preprocessing prepares the text for the model, while encoding generates the sentence embeddings.
How does the speaker handle the imbalance in the dataset between ham and spam emails?
-The speaker first checks for class imbalance by grouping the data and observing the distribution of spam and ham emails. To ensure balance during model training, the speaker uses stratification during the train-test split, ensuring proportional representation of both classes.
What is the purpose of the ‘apply’ function in creating the 'spam' column in the data frame?
-The 'apply' function is used to create a new column 'spam' by applying a lambda function that assigns 1 if the email is spam and 0 if it is ham, using a ternary operator in Python.
Why does the speaker use a dropout layer in the neural network model?
-A dropout layer is used to tackle overfitting by randomly dropping a fraction of the neurons during training. In this case, the speaker drops 10% of the neurons to improve generalization.
What does the speaker mean by 'pooled output' in BERT encoding?
-The 'pooled output' in BERT refers to the embedding vector for the entire sentence, which is generated after encoding and represents the meaning of the full sentence.
How does the speaker evaluate the model's performance?
-The speaker evaluates the model's performance by splitting the dataset into training and test sets, then training the model for 5 epochs. After training, the speaker achieves a 95% accuracy on the test set.
What is cosine similarity, and how is it used in the video?
-Cosine similarity is a metric used to measure the similarity between two vectors. In the video, the speaker uses cosine similarity to compare the embedding vectors of different words (e.g., comparing fruits like 'banana' and 'grapes' and people like 'Jeff Bezos').
What exercise does the speaker recommend for the viewers?
-The speaker recommends that viewers follow a TensorFlow tutorial on classifying text with BERT. The exercise involves copying and running code from the tutorial to practice and solidify the concepts learned in the video.
Outlines
📧 Introduction to Email Classification with BERT
This paragraph introduces the concept of using BERT (Bidirectional Encoder Representations from Transformers) for email classification, distinguishing spam from non-spam emails. The author explains how BERT converts an entire email into an embedding vector, which is then fed into a simple neural network with a single dense layer for training. The network also uses a dropout layer to prevent overfitting. The paragraph discusses the steps involved in preprocessing the dataset, including handling data imbalance and creating a new column to indicate spam status.
🔗 Setting Up BERT for Text Encoding
This paragraph outlines the steps required to set up BERT for text encoding using TensorFlow Hub. It describes how to access and utilize the BERT pre-processing and encoding components from the TensorFlow Hub website. The author provides instructions on copying the required URLs for BERT pre-processing and encoding, and integrating them into a Keras layer. The paragraph also emphasizes the time it may take to download the pre-trained BERT model and the use of the model to generate embedding vectors for different sentences.
🤖 Building a Functional Neural Network Model with BERT
This section details the process of building a functional neural network model using BERT for text classification. The author explains the difference between sequential and functional models in TensorFlow and introduces a method to create a functional model that can handle multiple inputs and outputs. The steps include defining an input layer, processing the input through BERT encoding, adding a dropout layer to prevent overfitting, and finally adding a dense layer with a sigmoid activation function for binary classification. The author also provides insights into the trainable and non-trainable parameters of the model and discusses the use of binary cross-entropy as a loss function.
🎯 Training and Evaluating the Model
This paragraph focuses on the training and evaluation process of the neural network model built using BERT. The author explains how to compile the model using standard parameters like optimizer and loss functions, and then trains it using the training dataset. The training is conducted over a set number of epochs, and the author discusses the time it might take depending on the system's computing power. The model achieves an accuracy of 93% on the training data and 95% on the test data. The author also demonstrates how to perform inference on new emails using the trained model and provides examples of the model's predictions for spam detection.
Mindmap
Keywords
💡BERT
💡Embedding Vector
💡Neural Network
💡Dense Layer
💡Dropout Layer
💡Pre-processing
💡Encoding
💡Spam Classification
💡Imbalanced Dataset
💡Cosine Similarity
💡Functional Model
Highlights
BERT is used for email classification to determine if an email is spam or not.
BERT converts an entire email into an embedding vector.
The embedding vector generated by BERT is 768 in length.
A simple neural network with one dense layer is used for training after BERT encoding.
A dropout layer is included to prevent overfitting.
BERT consists of two components: pre-processing and encoding.
The BERT model is downloaded from TensorFlow Hub.
The dataset used is from Kaggle, with two columns: category and email content.
Data imbalance is noted, with more 'ham' emails than 'spam'.
A new column 'spam' is created to label emails as 1 for spam and 0 for ham.
A train-test split is performed with 80% for training and 20% for testing.
Stratification is used in the train-test split to maintain balance.
The BERT model is used to generate embedding vectors for sentences.
Cosine similarity is used to compare embedding vectors.
Functional models in TensorFlow are introduced as an alternative to sequential models.
The model architecture includes an input layer, BERT encoder, dropout layer, and a dense output layer.
The model is compiled with binary cross-entropy loss due to binary classification.
The model achieves 93% accuracy on training and 95% on testing.
Inference is performed on new emails to classify them as spam or not spam.
BERT can be applied to various text classification problems beyond email classification.
An exercise is provided for viewers to practice using BERT with a larger dataset.
Transcripts
I hope you have seen my previous video
on what is BERT in this video explain
how BERT works
the fundamentals of it in today's video
we are going to
do email classification whether it's a
spam or non-spam using BERT
now BERT will convert an email
sentence you know the the whole email
into
an embedding vector. So we saw in a
previous video that
the purpose of BERT is to generate
if is embedding vector for the entire
sentence
and that is something that we can feed
into our neural network and do the
training.
So here we will generate a vector of 768
length y 768 we have covered that in a
previous video
and then we will supply that to
a very simple neural network with only
one dense
layer, one neuron in the dense layer
as an output will also put a dropout
layer
in between just to tackle the
overfitting.
Now if you open this BERT box by the way
it has two components
pre-processing and encoding and we
talked about that in a previous video as
well so previous video watching that
is quite a prerequisite so let's jump
into
coding. Now here I have downloaded
this file from kegel
simple two columns category hammer spam
and here is the content of your email I
have imported few basic
libraries here in my jupyter notebook
and I'm going to simply read this CSV
file into my Pandas data frame which
looks like that
and then I will do some basic analysis
you know I will do df
group by let's say category
so here I have four eight two five ham
emails and 747
spam emails you can clearly see
there is some imbalance in our data set
so we need to take care of that but
before we do that
we will create a new column in my
data frame
you know we'll call it spam so let's
create a new column
and if the value spam the value of
this spam column will be 1 ham it will
be 0
and you all know if you want to create a
new column in a data frame from an
existing column.
You can use apply function and that will
take
lambda and what we you are doing is if x
is equal to spam then the value is one
you see this is how ternary operator in
python works
else value is zero.
And now if you do df dot head see we
simply create a new
column zero one say spam one
hem zero all right so far
so good now let's do train test split.
So I'm going to use our standard
sql entry in this split function and in
that
my x is actually a message
and my y is the spam
okay the spam column and I'm going to
store the output into
these variables this is pretty much a
standard practice
in machine learning world and okay I
will do
our test size to be
point two so eighty percent training
samples
twenty percent taste sample
let let me check how it split the
spam and non-spam
so value counts
okay
so I'm checking this to make sure there
is a balance
okay so let's see okay 149
divided by 967 okay around 15 percent
spam in test
and 3859
okay so it is good balance but still
to be on a safe side I will
say stratify so when you do stratify it
will make sure
there is a balance you know
it's not like in your training data set
if all the samples are zero
and there are lesser two samples which
has spam value
then model will not be good in terms of
detecting the spam.
Okay so that's why I supply this
stratify I mean before stratify also it
already did a good job
but this is just to be on a safe side
now comes the most interesting part
which is creating the embedding using
BERT.
Okay so how do you do that so for that
you have to go to this tensorflow hub
website
and click on text and go to BERT model
now in but we are going to use this
first model so we saw in a previous
video that
there is an encoder and there is
pre-processing step so first you do
pre-processing
so you click here you copy the URL
okay so this is my pre-process URL all
right
and you go back
you go to text bird and
you go here and copy this URL. This is
your
main encoder okay so this is your
pre-processing URL
and this is your main encoder
URL so I'm going to use
keras layer hub keras layer basically
okay and call that bird pre-process
and then I will use the same hope keras
layer here
and call it BERT encoder
see we saw in a presentation there are
two steps invert encoding pre-processing
and encoding so that's what
we did exactly.
Okay when you run it it's gonna take
some time because
it is downloading the BERT model you
know it's
somewhat around 300 megabytes so based
on your internet speed it might take
some time
but essentially you are downloading a
train
model which is trained on all the
wikipedia
and book corpus so now in our task
we'll be just directly using that train
model to generate the embedding vectors
after model is downloaded I am going to
define a simple function
that takes couple of
sentences as an input and returns me an
embedding vector so basically the way
I'll use this function is okay
supply an array of sentences
and any sentence okay and that should
return me the embedding
vector for the entire sentence
and if you've seen again my previous
video
this pre-process handle that you get you
can use it as a normal function
and you supply your sentences here and
it should return you the pre-process
text so I will just call it pre-process
text and then
use the BERT encoder okay and when you
use the about encoder it returns a
dictionary out of which
you need to use pulled out pull output
is basically
the encoding for the entire sentence
again if you want to know
what other elements are there in the
dictionary you need to watch my previous
video it's sort of like a prerequisite
all right now see when I run this
it is generating for this sentence this
is my
embedding vector the size is 768
for this second sentence this is my
embedding vector so
we have achieved the major goal here
which is
generating the vector you know vector
using the BERT and I just give you a
simple function but in reality
we will be using tensorflow layers
okay but before we go there let me
generate some embedding vectors so
for some more you know words let's just
generate it for words let's say banana I
want to
see what kind of embedding vectors it
generates
for a couple of fruits and then
you know what I will compare the fruits
with Jeff Bezos and Bill Gates
so these three are people these three
are fruits so
let's see what kind of embedding vector
it generates
and now I have all these embedding
vectors right so if you do e
6 by 7 68 okay i am going to use
cosine similarity so if you have seen my
cosine similarity video
if you do cosine similarity you will
find this video where I have explained
you know what is exactly cosine
similarity
so if you don't know watch it. It is used
to compare
two vectors so here i will compare
let's say banana banana's embedding
vector
with uh grapes embedding vector now
this takes a two dimensional array so
I'm just going to wrap it up in a
two dimensional array okay you see 0.99
so if it is near to 1 it means
these two vectors are very similar so
banana and grapes are similar because
they are fruits banana and
mango is also similar because they are
fruit but let me compare banana with
Jeff Bzos it's kind of weird right
comparing banana with Jeff Bezos
see 0.84 but still they're not 0.99
they're not as similar as banana and
grapes
okay and by the way you have to use this
with a
with a little caution I mean cosine
similarity
is not exact vector similarity okay so
sometimes you might see some unexpected
result but that's okay
now let me compare Jeff Bezos with Elon
Musk
say again 0.98 so you get the point
behind
BERT now okay now let's build a model
so far in our deep learning series we
have
built tensorflow models using sequential
model okay we are going to now use
functional models so there are two type
of model sequential
and functional
okay so what is the difference between
the two?
I'm going to link a good article here so
in the sequential model you add layers
one by one as a sequence you see
but in a functional model you create
your input
then you create a hidden layer let's say
and supply input as a
function argument then you create hidden
one then you supply that into
hidden two's argument and so on and then
you create model using
inputs and outputs now this allows you
to
create a model which might have multiple
inputs multiple outputs
like something like rest net you know
you can also
share network layers with other
other models so there are there are some
differences you read the article you
will get an idea
so here I'm going to build a
functional model okay
so the first step is you create your
input layer
the shape is going to be this because
the sentence length is
varying and
my data type is string and
name of this layer I will call it text
or
input whatever you know you can give it
the name that you like the most and this
will be my input layer
then we are going to do this
these two things
so here I supply input okay
same thing and then
BERT encoder and BERT encoder
takes let me just do output here
so outputs
okay and from the outputs I get pulled
output so pull output will be
the sentence encoding so pool output
will be this
okay this I need to
I will create one dropout layer which I
have not shown in the picture
so let me feed that pulled output into a
dropout layer
and then last layer will be one neuron
dense layer okay
so let's create dropout layer here
dropout layer is used to tackle
the overfitting
sometimes even if you don't do it it's
okay it helps
and in that dropout layer
you pass this as an input okay so now
i'm going to drop 10 of neuron
okay and I will call it this
dropout and let's call it
l l is the layer and the second one
is the dense layer with one neuron
and since it's a binary classification
you know like one zero
kind of thing I will do
sigmoid and the name of this layer is
output and again we are using functional
API so
you need to treat this as like a
function and pass in the previous layer
here
and then I will say overwrite the same
variable you know
okay and then in the end my model
is nothing but this model
which has two parameters inputs
outputs now the inputs will be
this so it's an array you can supply
multiple inputs as well
so here this is the input
and output will be l okay
and you can do model dot summary
okay output is not defined
so output
outputs great
now here my trainable parameters are 769
because
I have 768 neuron here and this one so
total
769 my non-trainable parameters are so
much so these are
the parameters from my birth model bird
is already trained
so I don't need to train them again and
when you are doing you know model
building you know that you do model
compile
where optimizer loss these are like
pretty much standard
things that we use in all our tutorials
loss is binary cross entropy because we
are doing binary classification here
and then I'm going to now run the
training so model dot fit
x train y train
epochs let's do tan epochs now
this is gonna take time because the
whole encoding process is little slow
and we have so many samples so based on
your computer it might take time I have
a powerful
computer and gpu but still takes few
minutes so you to be little passion
you can reduce epochs if you want okay
so I reduced epochs to five and I got
ninety three percent accuracy
then I do model dot evaluate on my extra
sweaters
I got ninety five percent accuracy which
is which is so
good actually so now I do inference so I
have a couple of
emails actually it's not reviews it's
emails
and on that email when I do predict
see the first three emails are spam
the second the rest of them are not spam
they're legit emails
and in sigmoid whenever the value is
more than 0.5 it means it's a spam
and it is less than 0.5 it means it's
not a spam. So you see
so these things worked out really well
for us
all right so this tutorial provided you
a very
like a simple explanation of how you can
do
text classification using BERT you can
use
BERT for variety of other problems as
well just
such as movie review classification or
name entity recognization and by the way
I have an exercise for you and the
exercise is actually
very simple you have to just do copy
paste so
go to Google red tensorflow tutorial
and in that go to text tutorial and look
at classified text with
BERT so what you need to do is
you need to just run this code on your
computer so just
copy paste these these lines you know
step by step in your notebook and
just run it and try to understand it
this
tutorial is similar to what we did but
the
data set is much bigger they are using
tensorflow data set API so
in terms of API also they are using
little different we use Pandas
and they are also using some caching
the model is also little different so if
you practice this
you will consolidate your learning
from this particular video so I hope
you're you're going to practices I trust
you all you're a census student
so please open a notebook copy
paste these lines one by one try to
understand it
and see how it works if you are
confident
you can just load the data set and
finish rest of the tutorials by
referring to this page
but without referring to this page okay
so thank you very much for watching
I will see you in the next video if you
like this particular video
give it a thumbs up and share it with
your friends.
Ver Más Videos Relacionados
Machine Learning Tutorial Python - 15: Naive Bayes Classifier Algorithm Part 2
Training a model to recognize sentiment in text (NLP Zero to Hero - Part 3)
OpenAI's NEW Embedding Models
Neural Network Python Project - Handwritten Digit Recognition
Stanford CS224W: ML with Graphs | 2021 | Lecture 6.1 - Introduction to Graph Neural Networks
מדריך מלא: איך לאמן מודל FLUX עם הפנים שלכם באתר ASTRIA + קבלו 10$ לחשבון שלכם מתנה!
5.0 / 5 (0 votes)