Sequencing - Turning sentences into data (NLP Zero to Hero - Part 2)
Summary
TLDRIn this episode of 'Zero to Hero with Natural Language Processing,' Laurence Moroney teaches viewers how to convert tokenized words into sequences of numbers using TensorFlow. He demonstrates handling sentences of varying lengths and introduces the concept of Out Of Vocabulary (OOV) tokens to manage unseen words. The video also covers padding sequences to ensure uniformity for neural network training, with options for padding position and sequence length customization. Moroney invites viewers to practice these techniques through a provided Codelab and teases the next episode, where they'll train a neural network to detect sarcasm in text.
Takeaways
- 😀 This is episode 2 of a series on Natural Language Processing (NLP) with TensorFlow.
- 🔢 The focus of this episode is on converting tokenized words into sequences of numbers.
- 📚 The 'texts to sequences' method is introduced, which simplifies the process of creating token sequences.
- 📈 It's important to manage sentences of different lengths, which is demonstrated by adding a new sentence to the dataset.
- 🆕 The script shows how new words like 'amazing', 'think', 'is', and 'do' are introduced into the token index.
- 🤖 The issue of neural networks encountering unknown words during classification is discussed.
- 🔄 A solution to the unknown words problem is presented using the Out Of Vocabulary (OOV) token.
- 📊 The concept of padding sequences is introduced to handle varying sentence lengths in neural network training.
- 💻 The script provides a practical example of padding sequences with zeros and adjusting padding positions.
- 🔗 A URL is given for viewers to try out the code and experiment with the concepts discussed in the video.
- 🔮 The next episode's preview hints at training a neural network to classify text as sarcastic or not sarcastic.
Q & A
What is the main focus of episode 2 of the Zero to Hero with Natural Language Processing series?
-The main focus of episode 2 is to teach how to create sequences of numbers from sentences using TensorFlow's tools, and how to process these sequences for neural network training.
What is tokenization in the context of natural language processing?
-Tokenization is the process of converting sentences into numeric tokens, which are essentially a representation of words in a numerical form that can be used by machine learning models.
Why is it important to manage sentences of different lengths when creating sequences?
-Managing sentences of different lengths is important because it ensures that all input data for a neural network is of the same size, which is a common requirement for many machine learning algorithms.
What does the 'texts to sequences' method do in TensorFlow's tokenizer?
-The 'texts to sequences' method in TensorFlow's tokenizer converts a list of sentences into sequences of tokens, where each token represents a word in the sentence.
How does the script demonstrate the handling of new words not present in the initial word index?
-The script demonstrates handling new words by using the Out Of Vocabulary (OOV) token property, which replaces unrecognized words with a special token, thus maintaining the sequence length.
What is the purpose of padding in the context of sequence processing for neural networks?
-Padding is used to ensure that all sequences are of the same length by adding zeros to the shorter sequences. This is necessary for batch processing in neural networks.
What is the difference between pre-padding and post-padding as mentioned in the script?
-Pre-padding adds zeros at the beginning of the sequence, while post-padding adds zeros at the end of the sequence. The choice depends on the specific requirements of the neural network training process.
What is the 'maxlen' parameter used for in the context of padding sequences?
-The 'maxlen' parameter specifies the maximum length of the sequences after padding. Sequences longer than 'maxlen' can be truncated according to the specified truncation method.
How can sequences be truncated if they exceed the specified 'maxlen'?
-Sequences can be truncated either by removing words from the end (post-truncation) or from the beginning (pre-truncation) of the sequence if they exceed the 'maxlen'.
What is a RaggedTensor and why is it mentioned as an advanced solution in the script?
-A RaggedTensor is a tensor that can handle sequences of varying lengths without padding. It is mentioned as an advanced solution because it is more complex and beyond the scope of the series.
What will be the focus of the next video in the series after discussing tokenization and sequence creation?
-The next video will focus on training a neural network with text data, specifically looking at a dataset with sentences classified as sarcastic and not sarcastic to determine if sentences contain sarcasm.
Outlines
📚 Tokenization and Sequence Creation
In this segment, Laurence Moroney introduces the next step in the series on Natural Language Processing (NLP) with TensorFlow, which is converting tokenized words into sequences of numbers. He discusses the importance of managing sentences of different lengths and demonstrates the 'texts to sequences' method. The segment also addresses the challenge of out-of-vocabulary (OOV) words and how to handle them using an OOV token. The presenter shows how to use padding to standardize sequence lengths for neural network training, ensuring that all input data has the same dimensions.
🔍 Advanced Text Processing Techniques
This paragraph delves into the nuances of text processing for neural network training. It clarifies the difference between padding (represented by '0') and the OOV token (represented by '1'). The presenter explains how to adjust padding to appear at the end of sequences and how to set a maximum length for sequences using the 'maxlen' parameter. It also covers how to handle sequences longer than the specified 'maxlen' by truncating them either from the beginning or the end. The segment invites viewers to explore the code through a provided URL and teases the next video's content, which will involve training a neural network to classify sentences as sarcastic or not sarcastic.
Mindmap
Keywords
💡Tokenization
💡Sequences
💡Tokenizer
💡Out Of Vocabulary (OOV)
💡Word Index
💡Padding
💡RaggedTensor
💡Neural Networks
💡Natural Language Processing (NLP)
💡Sarcasm Detection
Highlights
Introduction to episode 2 of the Zero to Hero series on Natural Language Processing.
Continuation from the previous episode on word tokenization using TensorFlow's tools.
Exploration of creating sequences of numbers from sentences for neural network training.
Demonstration of managing sentences of different lengths using the 'texts to sequences' method.
Explanation of how to handle sentences with words not in the initial word index.
Introduction of the concept of Out Of Vocabulary (OOV) tokens to handle unknown words.
Technique to replace unrecognized words with an OOV token to maintain sequence integrity.
Discussion on the importance of a large word index for classifying texts outside the training set.
Introduction to the concept of padding to handle sequences of different lengths.
Explanation of using 'pad sequences' from preprocessing to equalize sequence lengths.
Illustration of how padding works with examples of sequence transformation.
Option to adjust padding position with the 'padding' parameter.
Customization of padded sequence length with the 'maxlen' parameter.
Handling of sequences longer than 'maxlen' with truncation options.
Invitation to try out the code from the video through a provided URL.
Preview of the next episode focusing on training a neural network with text data.
Teaser of a dataset involving classification of sarcastic and non-sarcastic sentences.
Transcripts
[MUSIC PLAYING]
LAURENCE MORONEY: Welcome to episode 2
of this series of Zero to Hero with Natural Language
Processing.
In the last video, you learned about how to tokenize
words using TensorFlow's tools.
In this one, you'll take that to the next step,
creating sequences of numbers from your sentences
and using tools to process them to make them ready for teaching
neural networks.
Last time, we saw how to take a set of sentences
and use the tokenizer to turn the words into numeric tokens.
Let's build on that now by also seeing
how the sentences containing those words
can be turned into sequences of numbers.
We'll add another sentence to our set of texts,
and I'm doing this because the existing sentences all
have four words, and it's important to see
how to manage sentences, or sequences,
of different lengths.
The tokenizer supports a method called texts
to sequences which performs most of the work for you.
It creates sequences of tokens representing each sentence.
Let's take a look at the results.
At the top, you can see the list of word-value pairs
for the tokens.
At the bottom, you can see that the sequences that texts
to sequences has returned.
We have a few new words such as amazing, think, is, and do,
and that's why this index looks a little different than before.
And now, we have the sequences.
So for example, the first sequence
is 4, 2, 1, 3, and these are the tokens for I,
love, my, and dog in that order.
So now, we have the basic tokenization done,
but there's a catch.
This is all very well for getting
data ready for training a neural network,
but what happens when that neural network needs
to classify texts, but there are words
in the text that it has never seen before?
This can confuse the tokenizer, so we'll
look at how to handle that next.
Let's now look back at the code.
I have a set of sentences that I'll use
for training a neural network.
The tokenizer gets the word index from these
and create sequences for me.
So now, if I want to sequence these sentences, containing
words like manatee that aren't present in the word index,
because they weren't in my initial set of data,
what's going to happen?
Well, let's use the tokenizer to sequence them
and print out the results.
We see this, I really love my dog.
A five-word sentence ends up as 4, 2, 1, 3,
a four-word sequence.
Why?
Because the word really wasn't in the word index.
The corpus used to build it didn't contain that word.
And my dog loves my manatee ends up
as 1, 3, 1, which is my, dog, my,
because loves and manatee aren't in the word index.
So as you can imagine, you'll need a really big word index
to handle sentences that are not in the training set.
But in order not to lose the length of the sequence,
there is also a little trick that you can use.
Let's take a look at that.
By using the OOV token property, and setting it as something
that you would not expect to see in the corpus, like angle
bracket, OOV, angle bracket, the tokenizer
will create a token for that, and then
replace words that it does not recognize
with the Out Of Vocabulary token instead.
It's simple, but effective, as you can see here.
Now, the earlier sentences are encoded like this.
We've still lost some meaning, but a lot less.
And the sentences are at least the correct length.
That's a handy little trick, right?
And while it helps maintain the sequence length
to be the same length as the sentence,
you might wonder, when it comes to needing
to train a neural network, how it can handle
sentences of different lengths?
With images, they're all usually the same size.
So how would we solve that problem?
The advanced answer is to use something
called a RaggedTensor.
That's a little bit beyond the scope of this series,
so we'll look at a different and simpler solution, padding.
OK.
So here's the code that we've been using,
but I've added a couple of things.
First is to import pad sequences from pre-processing.
As its name suggests, you can use it to pad our sequences.
Now, if I want to pad my sequences, all I have to do
is pass them to pad sequences, and the rest is done for me.
You can see the results of our sentences here.
First is the word index, and then is
the initial set of sequences.
The padded sequence is next.
So for example, our first sentence is 5, 3, 2, 4.
And in the padded sequence, we can
see that there are three 0s preceding it.
Well, why is that?
Well, it's because our longest sentence had seven words in it.
So when we pass this corpus to pad sequence,
it measured that and ensured that all of the sentences
would have equally-sized sequences by padding them
with 0s at the front.
Note that OOV isn't 0.
It's 1.
0 means padding.
Now, you might think that you don't want the 0s in front,
and you might want them after the sentence.
Well, that's easy.
You just set the padding parameter to post like this,
and that's what you'll get.
Or if you don't want the length of the padded sentences
to be the same as the longest sentence,
you can then specify the desired length
with the maxlen parameter like this.
But wait, you might ask what happens if sentences are longer
than the specified maxlen?
Well, then, you can specify how to truncate either
chopping off the words at the end, with a post truncation,
or from the beginning with a pre-truncation.
And here's what a post looks like.
But don't take my word for it.
Check out the Codelab at this URL,
and you can try out all of the code
in this video for yourself.
Now that you've seen how to tokenize your text
and organize it into sequences, in the next video,
we'll take that data and train a neural network with text data.
We'll look at a data set with sentences that are classified
as sarcastic and not sarcastic, and we'll
use that to determine if sentences contain sarcasm.
Really?
No, no.
I mean, really.
[MUSIC PLAYING]
Weitere ähnliche Videos ansehen
Natural Language Processing - Tokenization (NLP Zero to Hero - Part 1)
Training a model to recognize sentiment in text (NLP Zero to Hero - Part 3)
Prepare your dataset for machine learning (Coding TensorFlow)
Deep Learning(CS7015): Lec 1.6 The Curious Case of Sequences
Pooling and Padding in Convolutional Neural Networks and Deep Learning
Transformers - Part 7 - Decoder (2): masked self-attention
5.0 / 5 (0 votes)