What is Bag of Words?

IBM Technology
3 Jun 202421:08

Summary

TLDRこのビデオスクリプトでは、自然言語処理における「Bag of Words」というテクニックが紹介されています。これは、テキストを数値に変換する特徴抽出技術で、例えばメールのスパムフィルタリングなど様々な用途に使われています。スクリプトでは、その定義や利点、欠点、そして改良方法が解説されています。また、テキストの分類や類似性の比較、Word2Vecによる単語の意味の表現、感情分析など、様々なアプリケーションも紹介されています。この技術を通じて、人工知能の分野への興味を深めていくことが望まれます。

Takeaways

  • 📚 Bag of Words(BoW)は、テキストを数値に変換する特徴抽出技術です。
  • 🛍️ BoWは、例えばメールのスパムフィルタリングなど、様々な用途に使われます。
  • 🐱 BoWは単語だけでなく、ビジュアル要素にも応用できます。例えば、猫の画像を特徴量に分解できます。
  • 📈 BoWはテキスト分類や文書の類似度比較など、機械学習モデルのバックエンドで使われます。
  • 📝 テキストを数値に変換する際、一意の単語から辞書を作成し、文書-term行列を作成します。
  • 👍 BoWの利点はシンプルで、説明しやすいという点です。
  • 👎 BoWの欠点には、複合語の意味が失われることや、単語間の関係性が考慮されないことがあります。
  • 🔍 n-gramを使用することで、単語の組み合わせを考慮することができます。
  • 🌐 テキスト正規化は、単語の基本形に戻すことで辞書の単語数を減らし、スパシティ問題を緩和するのに役立ちます。
  • 📊 Tf-Idfは、単語の重要度を評価する重みやスクリーニング技術としてBoWを拡張したものです。
  • 📚 BoWは、テキストの感情分析や文書分類、Word Embeddingなど、様々な自然言語処理の分野で応用されます。

Q & A

  • 「バッグ・オブ・ワーズ」とはどのようなテクニックですか?

    -「バッグ・オブ・ワーズ」はテキストを数値に変換する特徴抽出技術です。異なる単語の集合を意味しており、機械学習モデルが理解できる数値に変換します。

  • バッグ・オブ・ワーズはなぜスパムフィルタに有用ですか?

    -バッグ・オブ・ワーズは異なる単語の出現頻度を分析し、信頼性の高いスパムメールと信頼性のないメールを区別するのに役立ちます。

  • バッグ・オブ・ワーズはどのような種類のタスクで使用されますか?

    -バッグ・オブ・ワーズはテキスト分類、文書の類似性の比較、検索エンジンでの最も関連性の高い文書の検索など、様々なMLPタスクで使用されます。

  • 「ビジュアル・ワード」とは何を意味していますか?

    -「ビジュアル・ワード」は、バッグ・オブ・ワーズの概念を画像に応用したもので、画像を複数の異なるキーフィーチャーに分割し、コンピュータビジョン技術で使用されます。

  • バッグ・オブ・ワーズでのテキスト表現はどのように行われますか?

    -バッグ・オブ・ワーズでのテキスト表現は、文書と語彙の間の出現頻度を数値で表した「文書用語行列」を作成することで行われます。

  • バッグ・オブ・ワーズの利点は何ですか?

    -バッグ・オブ・ワーズはシンプルで、単語の出現頻度をカウントするだけで簡単に特徴量を作成できますし、他のアルゴリズムと比べて直感的です。

  • バッグ・オブ・ワーズの欠点には何がありますか?

    -バッグ・オブ・ワーズは複合語の意味を失うことや単語間の相関関係を無視するなどの欠点があり、これにより意味のロスが生じることがあります。

  • n-gramモデルとは何であり、バッグ・オブ・ワーズを改善するためにどのように使われますか?

    -n-gramモデルは、テキスト内の単語の連続したグループを分析することで、単語間の相関関係を捉えることができる改良されたモデルです。

  • テキスト正規化とは何であり、バッグ・オブ・ワーズにどのように役立ちますか?

    -テキスト正規化は、テキストを事前に処理し、語幹に戻すことで単語の数を削減し、スパシティ問題を緩和する手法です。

  • TF-IDFとは何であり、バッグ・オブ・ワーズとどのように関係していますか?

    -TF-IDFは単語の重要度を示す重みまたはスコアであり、特定の文書内で単語がどれくらい頻出するかを示します。バッグ・オブ・ワーズの背後にあるテクニックとして使用されます。

  • バッグ・オブ・ワーズはどのようなアプリケーションで使われますか?

    -バッグ・オブ・ワーズは文書分類、カスタマーサポートのチケット分析、センチメント分析、または不快なテキストの検出など、様々なアプリケーションで使われます。

Outlines

00:00

🛍️ B&G食品と単語バッグの紹介

第1段落では、B&G食品というテキストを数値に変換する特徴抽出技術が紹介されています。単語バッグは、電子メールのスパムフィルタリングなど、様々な用途に使われています。この技術は、テキストを単語の集合として扱い、それぞれのメールにある単語の出現頻度を分析してスパムを判断します。さらに、単語バッグは単語だけでなく、画像の要素にも応用できます。例えば、猫の画像を耳、ひげ、体、足、しっぽなどの視覚要素に分解し、オブジェクト検出などのコンピュータビジョン技術で使用できます。

05:01

📚 単語バッグの具体的な例と欠点

第2段落では、単語バッグを用いた具体的な例とその欠点が説明されています。文書間の類似度やテキスト分類など、単語バッグが適用される一般的な機械学習タスクが紹介されています。次に、2つの文を用いて単語バッグの数値表現方法が説明されています。単語の出現頻度を数値化し、文書ターム行列を作成しますが、単語の意味の相関性や語順、単語の複合性などの問題点が指摘されています。

10:01

🔍 単語バッグの改良方法

第3段落では、単語バッグのいくつかの問題点とそれに対する改良方法が提案されています。単語の意味を失うこと、語順を失うこと、スパrsityの問題などがあります。n-gramを使用して単語の組み合わせを捕捉したり、テキスト正規化を通じて語彙の数を減らすStemmingを行うことで、これらの問題に対処できます。

15:05

📈 Tf-Idfの概念と応用

第4段落では、Tf-Idfという単語の重要度を評価する指標が紹介されています。Tf-Idfは頻度と逆文書頻度を組み合わせたスコアで、単語がどれくらい重要な役割を果たしているかを示します。この指標は、文書分類やトピックモデリングなど、様々な自然言語処理の分野で応用されています。

20:05

🌟 単語バッグの応用と今後の展望

第5段落では、単語バッグを応用した例と自然言語処理、人工知能の分野への興味を喚起する呼びかけがされています。単語バッグは単語のベクトル表現や感情分析など、さまざまなタスクで使われています。また、ネガティブなセンチメントや不当な言葉を検出するようにモデルをトレーニングする可能性についても触れられています。最後に、ビデオの好評を得た場合の対応として、いいねや登録を呼びかけています。

Mindmap

Keywords

💡ベーグオブワーズ(Bag of Words)

ベーグオブワーズは、テキストを数値に変換する特徴抽出技術です。ビデオでは、メールの迷惑メールフィルタリングなど、様々な用途に使われていると説明されています。この技術は、テキスト中の単語の異なる出現頻度を数値データに変換し、機械学習モデルが理解できる形式に変えます。ビデオでは、「I think. Therefore, I am.」と「I love learning Python.」という文を例に、単語をカウントし、ドキュメントターム行列を作成するプロセスが紹介されています。

💡特徴抽出(Feature Extraction)

特徴抽出とは、機械学習モデルが理解できる数値データに変換するプロセスを指します。ビデオでは、ベーグオブワーズがテキストを数値に変換する特徴抽出技術として紹介されており、これは自然言語処理において重要なステップです。特徴抽出は、テキスト分類や文書の類似性比較など、様々な機械学習タスクで使われます。

💡テキスト分類(Text Classification)

テキスト分類は、テキストデータを特定のカテゴリーに分類するタスクです。ビデオでは、メールの迷惑メール分類が例として挙げられています。ベーグオブワーズを使って、メール内の単語の出現頻度を数値データに変換し、そのデータを基に迷惑メールかどうかを判断することができます。

💡ドキュメント類似性(Document Similarity)

ドキュメント類似性は、2つの文書がどれくらい類似しているかを比較するタスクです。ビデオでは、検索エンジンでのクエリと最も関連性の高い文書を見つける例が説明されています。ベーグオブワーズを使って文書を数値データに変換し、それらの類似度を計算することができます。

💡単語の順序(Order of Words)

ビデオでは、ベーグオブワーズの欠点として単語の順序を失う問題が指摘されています。単語の順序は、文の意味に重要な役割を果たしますが、ベーグオブワーズでは単語がバラバラに扱われるため、その情報が失われます。例えば、「flight San Francisco Mumbai from unto」という文では、単語の順序がないと意味が不明瞭になります。

💡コンパウンドワード(Compound Words)

コンパウンドワードは、2つ以上の単語が結合して新しい意味を持つ言葉です。ビデオでは、「AI」「ニューヨーク」などのコンパウンドワードが例として挙げられています。ベーグオブワーズでは、これらの単語がバラバラに扱われ、単語間の意味の関連性が失われるという問題があります。

💡n-gram

n-gramは、テキスト内の連続するn単語の組み合わせを意味する用語です。ビデオでは、n-gramを用いてコンパウンドワードの問題を解決する例が説明されています。n-gramを使用することで、単語同士が隣り合わせで出現する頻度をカウントすることができます。

💡テキスト正規化(Text Normalization)

テキスト正規化は、テキストを一定の形に整える前処理のプロセスです。ビデオでは、ステミング(stemming)という技法が紹介されており、これは単語の基本形に戻すことで、語彙の数を減らすことができます。これにより、ベーグオブワーズにおけるスパシティ問題を緩和するのに役立ちます。

💡TF-IDF(Term Frequency-Inverse Document Frequency)

TF-IDFは、単語の重要性を評価するウェイトまたはスコアです。ビデオでは、TFが単語が文書内で出現する頻度を、IDFが単語がどれくらい多くの文書で使われるかを表す指標であると説明されています。TF-IDFは、一般的な単語に対して低いスコア、稀な単語に対して高いスコアを割り当てることで、文書の主要なトピックを強調します。

💡スパシティ(Sparsity)

スパシティは、ベーグオブワーズにおける問題の一つで、多くの要素がゼロで埋め尽くされ、重要な要素が少数に限られるという特性を指します。ビデオでは、大量の文書に対して語彙が非常に大きくなる場合、それぞれの文書にはそのうちのわずかしか出現しないため、スパシティが生じることが説明されています。これは機械学習モデルにとって扱いづらい問題です。

Highlights

B&G食品是一种特征提取技术,将文本转换为数字。

B&G食品在电子邮件垃圾邮件过滤器中的一个典型应用案例。

B&G食品不仅适用于单词,还可以应用于视觉元素,如图像中的特征。

B&G食品可以用于文本分类任务,例如判断电子邮件是否为垃圾邮件。

B&G食品还可用于文档相似性比较和搜索引擎中找到最相关文档。

通过创建词汇表或字典,将文本转换为特征向量,用于机器学习模型。

B&G食品的优点包括简单性、易用性和可解释性。

B&G食品的缺点之一是复合词问题,如AI和人工智慧被分开处理,丢失了语义。

B&G食品无法关联词之间的相关性,例如蛋糕和烘焙比蛋糕和赛跑更有可能一起出现。

B&G食品丢失了单词之间的顺序关系,这可能导致歧义。

B&G食品方法中的稀疏性问题,大多数元素为零,只有少数元素存在。

n-gram是B&G食品的一种改进,可以查看一起出现的词组。

文本规范化,如词干提取,可以减少词汇量,帮助解决稀疏性问题。

TF-IDF是B&G食品的一个扩展概念,它为单词分配权重或分数。

TF-IDF在文档分类、客户支持票务和情感分析中有实际应用。

Word embeddings是B&G食品的另一个应用,它在n维空间中表示单词。

B&G食品有助于理解自然语言处理,并鼓励人们继续探索人工智能领域。

Transcripts

play00:00

We are going shopping for a new concept to learn.

play00:03

Keep your hands free because we are going to have a lot of bags to deal with.

play00:08

You guessed it.

play00:09

The topic for today is B&G foods.

play00:14

B&G foods is a feature extraction

play00:16

technique to convert text into numbers,

play00:20

and it's exactly what it sounds like.

play00:23

A collection of different words.

play00:27

A great use case for B&G foods is spam filters in your emails.

play00:32

For example, you might be receiving different emails

play00:36

about the latest news,

play00:39

maybe some interesting messages from your friends,

play00:43

and perhaps a few spammy content.

play00:46

Saying that you have won a lottery and you're about to become a millionaire.

play00:51

Bag of words looks at the different words present

play00:54

and the frequency in each of these emails and trusted in.

play00:59

Which of these would be spam?

play01:03

So today we are going to be looking at

play01:05

what bag of words means, as well as some examples.

play01:09

We will be looking at the pros and cons of bag of words,

play01:14

certain applications,

play01:17

and also modifications that we can use

play01:20

to improve our bag of words algorithm.

play01:26

Like I said,

play01:27

bag of words is a feature extraction technique, which means that

play01:32

all of your different texts or different words

play01:36

are converted into numbers.

play01:41

After all,

play01:41

numbers is what our machine learning models understand.

play01:45

I like to think of Bag of Words as a bag of popcorn.

play01:51

Let's think of the different words as kernels of popcorn.

play01:56

And each word represents a kernel.

play01:58

Or rather, each kernel represents a different word.

play02:02

The cool thing about Bag of Words is that it's not just limited to words, but

play02:06

it can also be applied to visual elements,

play02:09

which is bag of visual words.

play02:13

Let's say, for instance, you have an image of a cat.

play02:20

And yes, this is how I draw a cat,

play02:23

but you can break down this image of a cat

play02:25

into multiple different key features.

play02:29

You could have an ear, you could have

play02:32

whiskers, a body,

play02:35

legs and a tail.

play02:38

And each of these different elements help in multiple

play02:41

computer vision techniques, such as object detection.

play02:45

So you can use bag of words, not just in words,

play02:48

but also on visual words, which is images.

play02:52

Next, let's take a look at what bag of words looks like

play02:55

for different sentences, and see the pros and cons for it.

play02:59

Common MLP tasks where bag of words comes in handy is

play03:04

text classification.

play03:06

Let's say for example, spam or not,

play03:11

you could have your email

play03:14

and depending on what the words in that are, you could identify.

play03:18

So this is an example of text classification.

play03:23

Another example could be

play03:25

that of document similarity

play03:28

where perhaps you want to compare two different documents

play03:32

and check how similar they are to each other.

play03:36

Or maybe you have a particular query,

play03:40

like the type you put in a search engine,

play03:43

and you want to find the most relevant

play03:45

documents.

play03:49

Both of these examples text classification and document similarity.

play03:53

Use bag of words in the back end.

play03:57

Now let's take an example of two sentences

play03:59

and see how we can convert the text other words

play04:04

into features on numbers for our machine learning model.

play04:07

To understand.

play04:08

Consider two sentences.

play04:11

Sentence number one

play04:13

I think.

play04:17

Therefore, I am.

play04:23

And sentence number two.

play04:26

I love learning.

play04:31

Python.

play04:37

Now that we have our two examples sentences,

play04:40

what we are going to begin with is creating our vocabulary

play04:45

or a dictionary, which is the set of unique words

play04:47

set up in all of the given documents.

play04:51

In our case, here are only two sentences that we are looking at.

play04:54

But let's take a look at all the unique words present in here.

play04:58

So we have

play05:00

AI as a unique word.

play05:03

Think.

play05:06

Therefore.

play05:09

AI has already been covered over here, so we move on to the next one.

play05:14

I'm going to the next sentence.

play05:17

AI is also included here.

play05:20

Love learning

play05:25

Python one?

play05:27

That's 12345677.

play05:31

Words are seven.

play05:33

Unique words is what makes up our dictionary or

play05:36

our vocabulary based on these two sentences.

play05:39

Let's look at what the text representation of the bag

play05:42

of words representation for each of those sentences would be,

play05:46

and what we are constructing over here is called a document term matrix.

play05:51

So here are our documents.

play05:52

We consider our first document.

play05:55

And these are the different terms or the vocabulary present in here.

play05:59

So going over the first sentence I occurs a total of two times.

play06:05

She look at the count of the word I of the particular words.

play06:09

And you try to see how many times it occurs in that particular sentence.

play06:13

So I have used a total of two times.

play06:16

Think once.

play06:18

Therefore once.

play06:20

once.

play06:22

And in our first sentence, love learning and Python do not appear,

play06:26

which is why they get a score of zero.

play06:30

Doing the same technique for our second sentence,

play06:33

I appeals a total of one time.

play06:36

Think therefore and are absent in that sentence, which is why

play06:41

they get zero and love learning in Python, each of those other ones,

play06:46

which is why they get one.

play06:48

So what you're seeing over here is a vector of numbers

play06:53

that represent the first sentence.

play06:56

So we have now taken words and converted it into

play07:00

a feature representation.

play07:02

That is we have numbers over here, which is what our machine

play07:05

learning models used to understand.

play07:08

And similarly

play07:10

this is the feature representation for our second sentence.

play07:16

Now that we've seen

play07:17

what bag of words looks like or how to calculate it,

play07:21

the pros are kind of obvious.

play07:25

It's simple, which is how you saw it.

play07:28

You count the number of times particular word occurs, and you denote that count

play07:33

to that particular position for that sentence.

play07:37

It's easy, which is what we did over here.

play07:40

And it's explainable

play07:43

as opposed to certain other algorithms

play07:46

that maybe are not as intuitive.

play07:49

Unfortunately, as with all things in life.

play07:52

There are going to be pros and there are going to be cons.

play07:55

Next, we'll take a look at the cons of the simplistic algorithm

play07:59

and see if we can modify it to make it work better for us.

play08:03

Let's look at some of the drawbacks associated with bag of words.

play08:07

The first one being a compound word.

play08:10

Think about words like AI,

play08:13

artificial intelligence, or New York.

play08:18

In a simplistic bag of words approach.

play08:20

You break down artificial and intelligence, and now they are treated

play08:25

as two separate words with no correlation or no meaning between the two.

play08:30

That would apply to New York as well, where new is

play08:33

one word and York is another word.

play08:36

In this case, we are losing this semantic or the meaning

play08:40

that exists between the two words, which is a drawback.

play08:46

Let's look at another example.

play08:49

Perhaps kick.

play08:53

And baking.

play08:57

Maybe racing as well.

play09:01

Given these three words

play09:03

cake baking and racing, cake and baking are more likely to co-occur to occur

play09:08

in the same context, in the same documents

play09:12

as opposed to cake and racing.

play09:14

Well, of course, if tomorrow somebody invents

play09:16

a new sport called cake racing, that's going to change.

play09:20

But let's hope it doesn't.

play09:22

In this case, our Bag of words model is not able

play09:26

to associate the correlations that exist among the words,

play09:29

which might pose a problem to our machine learning models.

play09:36

Let's look at another

play09:37

drawback of Polyphemus words.

play09:41

Consider the word biting.

play09:45

Looking at just this word, it's hard to tell

play09:48

if I'm talking about biting the programing language or Python.

play09:51

The animal.

play09:54

Maybe there's another word

play09:56

that's content or content.

play10:00

It could mean either of the two, but just looking at the spelling, it's

play10:04

hard to see which is which.

play10:08

Another drawback that exists

play10:10

is that we lose the order associated between the words.

play10:14

Like I mentioned, Bag of words is nothing but a bag of popcorn,

play10:19

with each of the kernels being a specific word.

play10:22

And when you shake that bag, you lose all of the relationships that exist.

play10:27

As far as the order of the words is concerned.

play10:30

Let's say, for example,

play10:33

I have a sentence that says flight

play10:37

San Francisco,

play10:40

Mumbai from

play10:44

unto.

play10:48

What does this mean?

play10:49

Am I trying to fly from

play10:52

San Francisco to Mumbai?

play10:56

Am I trying to fly the other way around

play10:59

from Mumbai to San Francisco?

play11:02

It's hard to tell when we have only

play11:04

the bag of words available.

play11:08

Last but not the least

play11:10

is the problem of sparsity

play11:13

in our bag of words approach.

play11:15

We look at each of the unique words which makes up our vocabulary,

play11:19

and denote the presence of that particular word in a sentence

play11:24

given a large number of documents.

play11:26

You could have a very, very high number of vocabulary or words.

play11:31

Yet in each of the sentences,

play11:34

there could be maybe only three words or a very, very small proportion of words

play11:40

that actually are present with most of the other spaces being zeros.

play11:50

This leads to the problem of sparsity.

play11:52

Since our matrix are our vectors, over, here is sparse

play11:57

in the sense most of these elements are unoccupied

play12:00

because they're denoted by zeros, and very few of them are actually present.

play12:05

This could also pose a challenge with our models.

play12:09

Fear not though.

play12:11

Despite these drawbacks, we do have a certain modification in mind.

play12:16

Let's take a look at some of the modifications that can help

play12:19

improve our bag of words.

play12:20

Approach.

play12:24

Our first modification is n grams.

play12:28

Instead of looking at each individual word,

play12:32

you can now look at a combination of words that occur together.

play12:36

For example, in our artificial intelligence

play12:40

being the phrase we don't break it into artificial intelligence,

play12:44

but now we look at the presence of artificial and intelligence together

play12:50

and denote how many times it occurred in a particular document.

play12:54

Similarly, for New York, we look at the presence

play12:56

of New and York right after each other

play13:00

and denote the number of counts or the times a duck goes in that document.

play13:06

In this case, since our words

play13:08

are made up, or our faces are made up of two words

play13:12

and is equal to two, you could extend this with n

play13:16

is equal to three and is equal to five, so on and so forth.

play13:19

In which case you would look at, for example, if n is equal to three,

play13:24

you would look at three words that occur right next to each other.

play13:29

So maybe it is Python

play13:32

artificial intelligence.

play13:34

And any time these three words occur in your document,

play13:39

you would count the number of times that happens.

play13:42

And given the occurrence in here.

play13:46

Another modification

play13:48

that we can do is text normalization.

play13:52

Text normalization refers to certain preprocessing activities

play13:55

that you can do before you pass on the text to your bag of words.

play13:59

Model.

play14:00

A good example for this is the process of stemming,

play14:05

in which case

play14:05

you're trying to remove the ends of the words

play14:09

in the hope of getting back to its base word or its base stem.

play14:14

Consider the words coding

play14:17

coded

play14:19

codes and code.

play14:23

When you start removing the ends of the words.

play14:27

You can try to get to its base word,

play14:31

which is called in this case.

play14:34

This is a way to reduce the number of vocabulary

play14:37

or reduce your dictionary words,

play14:39

and hopefully that will help with the sparsity issue.

play14:43

An important concept that builds upon bag of words

play14:46

is Tf-Idf or term frequency.

play14:50

Inverse document frequency.

play14:54

You can think of Tf-Idf

play14:56

as a weight or a score associated with words,

play14:59

or perhaps even a feature scaling technique.

play15:04

TF is the term frequency,

play15:06

or the number of times a particular word occurs in your document.

play15:11

Let's say the words votes.

play15:14

President.

play15:15

Governments occur a lot of times in your document.

play15:20

Probably has something to do with

play15:22

maybe elections or some other government matter.

play15:26

So higher the term frequency

play15:29

higher is the score or the weight associated with that word.

play15:33

That makes sense with inverse document frequency.

play15:36

However, if you look at the number of documents.

play15:42

That that particular word occurs in.

play15:47

And if that word occurs in multiple documents

play15:51

or a huge proportion of documents, you actually give it a lower score.

play15:58

So the more number of documents the word occurs in, the lower

play16:01

the IDF score and the lower the whole Tf-Idf score becomes.

play16:07

This may seem a little counterintuitive, right?

play16:10

It's opposite of the term frequency.

play16:14

But I give you the example of words like d,

play16:18

un and some.

play16:23

But what's it?

play16:25

Don't really have any meaning on their own,

play16:28

but they're used to create grammatically correct sentences.

play16:32

As you can imagine, in an English language or a lot of documents

play16:36

with English language in it, these words would occur a lot of times.

play16:41

Perhaps, maybe even the most frequently occurring word.

play16:45

In that case, we do not want

play16:47

these words to have a high tfidf score,

play16:51

which is where the IDF component lowers their score.

play16:56

As these scores are not representative of the topics

play16:59

or the sentence of the documents.

play17:03

Let's take a look

play17:04

at some applications of tf IDF.

play17:12

Let's consider

play17:13

document classification as an example.

play17:18

Perhaps you have a company and a product that you're selling to your customers,

play17:23

and you have a support channel for them to come and read

play17:26

certain concerns, complaints or questions about your product.

play17:33

Maybe you have a chat associated

play17:35

with your customers or some support tickets,

play17:40

and you could use the bag of Words approach

play17:43

to understand which of the teams

play17:46

are associated with the problem that is there in the ticket.

play17:50

Maybe you have a building team

play17:53

or an onboarding team.

play17:57

Or a trial team.

play18:00

Or maybe it's a documentation issue.

play18:05

Looking at the vocabulary

play18:07

that is present, that is looking at the bag of words, representation

play18:11

of what is entailed in the customer chat or the support ticket.

play18:14

You will then be able to identify which of these teams is it, right,

play18:19

and the appropriate team to deal and resolve the customer's issue.

play18:25

Another example of bag of words

play18:26

is what to make.

play18:31

You might have heard of a to.

play18:33

These are word embeddings

play18:35

that exist in an n dimensional space.

play18:40

Your words are represented

play18:41

as vectors in this n dimensional space.

play18:44

For example, king and queen are two words,

play18:50

and the closer the words are in this n dimensional space,

play18:53

that means they are more related to each other.

play18:56

In this case, king and queen would be fairly close to each other,

play19:00

as you would find documents or sentences where king and queen

play19:04

appear together.

play19:07

Maybe you have another word

play19:08

swim, that comes in those documents as well,

play19:12

but you wouldn't really associate swim with king or queen

play19:15

as much as you would with king and queen with each other.

play19:19

So swim would be further away from the vectors of king and queen.

play19:25

This is called what to work on word embeddings,

play19:28

and it does use bag of words as a back end to create this n dimensional space.

play19:37

Another example where

play19:38

bag of words comes in handy is for sentiment analysis.

play19:44

You could look at the collection of words in a given text,

play19:47

and understand if a lot of those words are positive.

play19:51

Maybe words like happy, joy,

play19:54

excited, or words that are negative,

play19:59

frustrated, angry, hate, terrible.

play20:04

And depending on the bag of words representation, you would be able

play20:08

to identify with sentiment at false positive or negative.

play20:14

You could even take this further and try to create a model

play20:18

that helps to keep speech.

play20:21

So you would look

play20:22

at the negative sentiments or the negative words present in there,

play20:26

and maybe extended with other words, for example, racism

play20:31

or other discrimination forms, and try to create a model that helps

play20:36

you distinguish these annoying or unexpected texts on the internet.

play20:43

Now that you have this concept in the bag,

play20:47

I hope this helps you understand a little more about natural language

play20:50

processing and encourages you

play20:52

to continue your journey into the field of artificial intelligence.

play20:56

If you like this video and want to see more like it, please like and subscribe.

play21:02

If you have any questions or want to share your thoughts about this topic,

play21:06

please leave a comment below.

Rate This

5.0 / 5 (0 votes)

Related Tags
自然言語処理テキスト分析機械学習特徴抽出スパムフィルタテキスト分類ドキュメント類似度n-gramテキスト正規化TF-IDF
Do you need a summary in English?