Sequencing - Turning sentences into data (NLP Zero to Hero - Part 2)

TensorFlow
25 Feb 202006:25

Summary

TLDRIn this episode of 'Zero to Hero with Natural Language Processing,' Laurence Moroney teaches viewers how to convert tokenized words into sequences of numbers using TensorFlow. He demonstrates handling sentences of varying lengths and introduces the concept of Out Of Vocabulary (OOV) tokens to manage unseen words. The video also covers padding sequences to ensure uniformity for neural network training, with options for padding position and sequence length customization. Moroney invites viewers to practice these techniques through a provided Codelab and teases the next episode, where they'll train a neural network to detect sarcasm in text.

Takeaways

  • 😀 This is episode 2 of a series on Natural Language Processing (NLP) with TensorFlow.
  • 🔢 The focus of this episode is on converting tokenized words into sequences of numbers.
  • 📚 The 'texts to sequences' method is introduced, which simplifies the process of creating token sequences.
  • 📈 It's important to manage sentences of different lengths, which is demonstrated by adding a new sentence to the dataset.
  • 🆕 The script shows how new words like 'amazing', 'think', 'is', and 'do' are introduced into the token index.
  • 🤖 The issue of neural networks encountering unknown words during classification is discussed.
  • 🔄 A solution to the unknown words problem is presented using the Out Of Vocabulary (OOV) token.
  • 📊 The concept of padding sequences is introduced to handle varying sentence lengths in neural network training.
  • 💻 The script provides a practical example of padding sequences with zeros and adjusting padding positions.
  • 🔗 A URL is given for viewers to try out the code and experiment with the concepts discussed in the video.
  • 🔮 The next episode's preview hints at training a neural network to classify text as sarcastic or not sarcastic.

Q & A

  • What is the main focus of episode 2 of the Zero to Hero with Natural Language Processing series?

    -The main focus of episode 2 is to teach how to create sequences of numbers from sentences using TensorFlow's tools, and how to process these sequences for neural network training.

  • What is tokenization in the context of natural language processing?

    -Tokenization is the process of converting sentences into numeric tokens, which are essentially a representation of words in a numerical form that can be used by machine learning models.

  • Why is it important to manage sentences of different lengths when creating sequences?

    -Managing sentences of different lengths is important because it ensures that all input data for a neural network is of the same size, which is a common requirement for many machine learning algorithms.

  • What does the 'texts to sequences' method do in TensorFlow's tokenizer?

    -The 'texts to sequences' method in TensorFlow's tokenizer converts a list of sentences into sequences of tokens, where each token represents a word in the sentence.

  • How does the script demonstrate the handling of new words not present in the initial word index?

    -The script demonstrates handling new words by using the Out Of Vocabulary (OOV) token property, which replaces unrecognized words with a special token, thus maintaining the sequence length.

  • What is the purpose of padding in the context of sequence processing for neural networks?

    -Padding is used to ensure that all sequences are of the same length by adding zeros to the shorter sequences. This is necessary for batch processing in neural networks.

  • What is the difference between pre-padding and post-padding as mentioned in the script?

    -Pre-padding adds zeros at the beginning of the sequence, while post-padding adds zeros at the end of the sequence. The choice depends on the specific requirements of the neural network training process.

  • What is the 'maxlen' parameter used for in the context of padding sequences?

    -The 'maxlen' parameter specifies the maximum length of the sequences after padding. Sequences longer than 'maxlen' can be truncated according to the specified truncation method.

  • How can sequences be truncated if they exceed the specified 'maxlen'?

    -Sequences can be truncated either by removing words from the end (post-truncation) or from the beginning (pre-truncation) of the sequence if they exceed the 'maxlen'.

  • What is a RaggedTensor and why is it mentioned as an advanced solution in the script?

    -A RaggedTensor is a tensor that can handle sequences of varying lengths without padding. It is mentioned as an advanced solution because it is more complex and beyond the scope of the series.

  • What will be the focus of the next video in the series after discussing tokenization and sequence creation?

    -The next video will focus on training a neural network with text data, specifically looking at a dataset with sentences classified as sarcastic and not sarcastic to determine if sentences contain sarcasm.

Outlines

00:00

📚 Tokenization and Sequence Creation

In this segment, Laurence Moroney introduces the next step in the series on Natural Language Processing (NLP) with TensorFlow, which is converting tokenized words into sequences of numbers. He discusses the importance of managing sentences of different lengths and demonstrates the 'texts to sequences' method. The segment also addresses the challenge of out-of-vocabulary (OOV) words and how to handle them using an OOV token. The presenter shows how to use padding to standardize sequence lengths for neural network training, ensuring that all input data has the same dimensions.

05:00

🔍 Advanced Text Processing Techniques

This paragraph delves into the nuances of text processing for neural network training. It clarifies the difference between padding (represented by '0') and the OOV token (represented by '1'). The presenter explains how to adjust padding to appear at the end of sequences and how to set a maximum length for sequences using the 'maxlen' parameter. It also covers how to handle sequences longer than the specified 'maxlen' by truncating them either from the beginning or the end. The segment invites viewers to explore the code through a provided URL and teases the next video's content, which will involve training a neural network to classify sentences as sarcastic or not sarcastic.

Mindmap

Keywords

💡Tokenization

Tokenization is the process of converting a sequence of characters into a sequence of tokens or meaningful elements, such as words or phrases. In the context of the video, tokenization is essential for preparing text data for neural network training by turning words into numeric tokens. The script mentions using TensorFlow's tools for this purpose, which is a fundamental step in natural language processing for machine learning models.

💡Sequences

Sequences in the script refer to the ordered lists of tokens that represent sentences. The video discusses how to convert sentences into sequences of numbers, which are then used for training neural networks. The concept is central to understanding how neural networks can process varying sentence lengths and the importance of maintaining the order of words in a sentence.

💡Tokenizer

A tokenizer is a tool used in natural language processing to split text into tokens, which are typically words or phrases. The script explains how the tokenizer supports a method called 'texts to sequences' to automate the creation of token sequences from sentences. This tool is crucial for preparing data for neural network training, as it translates human language into a format that a machine can understand and process.

💡Out Of Vocabulary (OOV)

The term 'Out Of Vocabulary' or OOV refers to words that are not included in the initial vocabulary or word index used for training a model. The script discusses the challenge of encountering OOV words during neural network classification tasks and introduces the use of an OOV token to handle不认识words, ensuring that the model can still process sentences containing unfamiliar terms.

💡Word Index

A word index is a mapping of words to unique numerical identifiers used in natural language processing. The script describes how the tokenizer uses a word index to convert words into tokens for sequence creation. The importance of having a comprehensive word index is highlighted to minimize the occurrence of OOV words and to maintain the integrity of the data during model training.

💡Padding

Padding is a technique used to ensure that all sequences in a dataset have the same length by adding a specified element, often zeros, to the beginning or end of sequences. In the script, padding is introduced as a solution to handle sentences of different lengths when preparing data for neural network training, allowing for uniform input size across all examples.

💡RaggedTensor

Although not fully explained in the script, a RaggedTensor is a data structure that can represent sequences of variable lengths, which is useful in neural network training for handling data with non-uniform sequence lengths. The script mentions RaggedTensor as an advanced concept beyond the scope of the series but implies its relevance to managing different sentence lengths.

💡Neural Networks

Neural networks are a set of algorithms designed to recognize patterns and learn from data input. In the video, the main theme revolves around preparing text data for training neural networks to perform tasks such as text classification. The script provides insights into how neural networks can be trained using tokenized and sequenced text data.

💡Natural Language Processing (NLP)

Natural Language Processing, often abbreviated as NLP, is a field of computer science and artificial intelligence that focuses on the interaction between computers and human language. The script is part of a series on NLP, specifically discussing techniques for preparing text data for neural network training, which is a core application of NLP.

💡Sarcasm Detection

Sarcasm detection is a task within NLP that involves identifying whether a sentence is sarcastic or not. The script teases the next video in the series, which will focus on training a neural network to classify sentences as sarcastic or not, using a dataset labeled for sarcasm. This showcases an application of NLP and neural networks in understanding and processing human language nuances.

Highlights

Introduction to episode 2 of the Zero to Hero series on Natural Language Processing.

Continuation from the previous episode on word tokenization using TensorFlow's tools.

Exploration of creating sequences of numbers from sentences for neural network training.

Demonstration of managing sentences of different lengths using the 'texts to sequences' method.

Explanation of how to handle sentences with words not in the initial word index.

Introduction of the concept of Out Of Vocabulary (OOV) tokens to handle unknown words.

Technique to replace unrecognized words with an OOV token to maintain sequence integrity.

Discussion on the importance of a large word index for classifying texts outside the training set.

Introduction to the concept of padding to handle sequences of different lengths.

Explanation of using 'pad sequences' from preprocessing to equalize sequence lengths.

Illustration of how padding works with examples of sequence transformation.

Option to adjust padding position with the 'padding' parameter.

Customization of padded sequence length with the 'maxlen' parameter.

Handling of sequences longer than 'maxlen' with truncation options.

Invitation to try out the code from the video through a provided URL.

Preview of the next episode focusing on training a neural network with text data.

Teaser of a dataset involving classification of sarcastic and non-sarcastic sentences.

Transcripts

play00:00

[MUSIC PLAYING]

play00:03

LAURENCE MORONEY: Welcome to episode 2

play00:05

of this series of Zero to Hero with Natural Language

play00:07

Processing.

play00:09

In the last video, you learned about how to tokenize

play00:12

words using TensorFlow's tools.

play00:14

In this one, you'll take that to the next step,

play00:17

creating sequences of numbers from your sentences

play00:20

and using tools to process them to make them ready for teaching

play00:24

neural networks.

play00:25

Last time, we saw how to take a set of sentences

play00:28

and use the tokenizer to turn the words into numeric tokens.

play00:32

Let's build on that now by also seeing

play00:34

how the sentences containing those words

play00:37

can be turned into sequences of numbers.

play00:40

We'll add another sentence to our set of texts,

play00:42

and I'm doing this because the existing sentences all

play00:45

have four words, and it's important to see

play00:47

how to manage sentences, or sequences,

play00:50

of different lengths.

play00:52

The tokenizer supports a method called texts

play00:55

to sequences which performs most of the work for you.

play00:59

It creates sequences of tokens representing each sentence.

play01:03

Let's take a look at the results.

play01:05

At the top, you can see the list of word-value pairs

play01:08

for the tokens.

play01:09

At the bottom, you can see that the sequences that texts

play01:12

to sequences has returned.

play01:14

We have a few new words such as amazing, think, is, and do,

play01:20

and that's why this index looks a little different than before.

play01:23

And now, we have the sequences.

play01:26

So for example, the first sequence

play01:28

is 4, 2, 1, 3, and these are the tokens for I,

play01:32

love, my, and dog in that order.

play01:36

So now, we have the basic tokenization done,

play01:39

but there's a catch.

play01:41

This is all very well for getting

play01:42

data ready for training a neural network,

play01:45

but what happens when that neural network needs

play01:47

to classify texts, but there are words

play01:50

in the text that it has never seen before?

play01:53

This can confuse the tokenizer, so we'll

play01:55

look at how to handle that next.

play01:57

Let's now look back at the code.

play01:59

I have a set of sentences that I'll use

play02:01

for training a neural network.

play02:03

The tokenizer gets the word index from these

play02:05

and create sequences for me.

play02:08

So now, if I want to sequence these sentences, containing

play02:11

words like manatee that aren't present in the word index,

play02:14

because they weren't in my initial set of data,

play02:17

what's going to happen?

play02:18

Well, let's use the tokenizer to sequence them

play02:20

and print out the results.

play02:22

We see this, I really love my dog.

play02:25

A five-word sentence ends up as 4, 2, 1, 3,

play02:30

a four-word sequence.

play02:32

Why?

play02:33

Because the word really wasn't in the word index.

play02:36

The corpus used to build it didn't contain that word.

play02:40

And my dog loves my manatee ends up

play02:43

as 1, 3, 1, which is my, dog, my,

play02:48

because loves and manatee aren't in the word index.

play02:52

So as you can imagine, you'll need a really big word index

play02:55

to handle sentences that are not in the training set.

play02:59

But in order not to lose the length of the sequence,

play03:02

there is also a little trick that you can use.

play03:05

Let's take a look at that.

play03:06

By using the OOV token property, and setting it as something

play03:10

that you would not expect to see in the corpus, like angle

play03:13

bracket, OOV, angle bracket, the tokenizer

play03:17

will create a token for that, and then

play03:19

replace words that it does not recognize

play03:22

with the Out Of Vocabulary token instead.

play03:25

It's simple, but effective, as you can see here.

play03:28

Now, the earlier sentences are encoded like this.

play03:32

We've still lost some meaning, but a lot less.

play03:35

And the sentences are at least the correct length.

play03:37

That's a handy little trick, right?

play03:39

And while it helps maintain the sequence length

play03:42

to be the same length as the sentence,

play03:44

you might wonder, when it comes to needing

play03:46

to train a neural network, how it can handle

play03:49

sentences of different lengths?

play03:52

With images, they're all usually the same size.

play03:54

So how would we solve that problem?

play03:56

The advanced answer is to use something

play03:58

called a RaggedTensor.

play04:00

That's a little bit beyond the scope of this series,

play04:03

so we'll look at a different and simpler solution, padding.

play04:07

OK.

play04:07

So here's the code that we've been using,

play04:09

but I've added a couple of things.

play04:11

First is to import pad sequences from pre-processing.

play04:15

As its name suggests, you can use it to pad our sequences.

play04:19

Now, if I want to pad my sequences, all I have to do

play04:23

is pass them to pad sequences, and the rest is done for me.

play04:27

You can see the results of our sentences here.

play04:29

First is the word index, and then is

play04:32

the initial set of sequences.

play04:34

The padded sequence is next.

play04:37

So for example, our first sentence is 5, 3, 2, 4.

play04:41

And in the padded sequence, we can

play04:42

see that there are three 0s preceding it.

play04:45

Well, why is that?

play04:46

Well, it's because our longest sentence had seven words in it.

play04:50

So when we pass this corpus to pad sequence,

play04:53

it measured that and ensured that all of the sentences

play04:56

would have equally-sized sequences by padding them

play05:00

with 0s at the front.

play05:01

Note that OOV isn't 0.

play05:03

It's 1.

play05:04

0 means padding.

play05:05

Now, you might think that you don't want the 0s in front,

play05:08

and you might want them after the sentence.

play05:10

Well, that's easy.

play05:11

You just set the padding parameter to post like this,

play05:15

and that's what you'll get.

play05:17

Or if you don't want the length of the padded sentences

play05:20

to be the same as the longest sentence,

play05:22

you can then specify the desired length

play05:24

with the maxlen parameter like this.

play05:27

But wait, you might ask what happens if sentences are longer

play05:31

than the specified maxlen?

play05:33

Well, then, you can specify how to truncate either

play05:36

chopping off the words at the end, with a post truncation,

play05:40

or from the beginning with a pre-truncation.

play05:43

And here's what a post looks like.

play05:45

But don't take my word for it.

play05:47

Check out the Codelab at this URL,

play05:49

and you can try out all of the code

play05:51

in this video for yourself.

play05:53

Now that you've seen how to tokenize your text

play05:55

and organize it into sequences, in the next video,

play05:59

we'll take that data and train a neural network with text data.

play06:03

We'll look at a data set with sentences that are classified

play06:05

as sarcastic and not sarcastic, and we'll

play06:09

use that to determine if sentences contain sarcasm.

play06:12

Really?

play06:13

No, no.

play06:13

I mean, really.

play06:15

[MUSIC PLAYING]

Rate This

5.0 / 5 (0 votes)

関連タグ
NLPTensorFlowTokenizationSequencesNeural NetworksText ClassificationSarcasm DetectionData PreparationMachine LearningRaggedTensorPadding
英語で要約が必要ですか?