What is Stemming and how Does it Help Maximize Search Engine Performance?
Summary
TLDRこのビデオスクリプトでは、言語処理技術におけるstemming(語幹抽出)について解説しています。stemmingは、自然言語処理(NLP)のテキスト前処理技術の一つで、単語の異なる形態を基準形に減らして検索エンジンや情報検索の効率を高めるために使用されます。stemmingはルールに基づく手法で、単語の末尾を切り取ることで基準形を推測しますが、正確性に欠ける場合もあります。その一方で、lemmatizationはより正確な基準形を得るための技術で、単語の文脈を考慮します。stemmingはシンプルで実装が容易ですが、lemmatizationは計算コストが高く正確性が高いと特徴づけられています。stemmingの利点と限界、またPorter StemmerやSnowball Stemmerなどの具体的なアルゴリズムの動作についても紹介されています。
Takeaways
- 🌿 ステミングは、自然言語処理(NLP)におけるテキスト前処理技術の一つです。
- 🔍 ステミングは、単語をその基本形(stem)に還元するプロセスであり、「connect」という語幹から派生した様々な形態素を「connect」に戻すことが可能です。
- 💡 ステミングは検索エンジンや情報検索において有用で、関連する様々な単語の形態素を網羅して検索結果をより正確にします。
- 📚 ステミングは単語の語幹を見つけるためのヒューリスティックアルゴリズムであり、ルールに基づいて単語の末尾をトリミングします。
- 📈 ステミングは次元削減にも役立ち、文書中のユニークな単語数を削減して機械学習モデルの特徴の数を減らします。
- 📖 語彙の正規形を求めるlemmatizationと異なり、ステミングは単語の形態素を単純化するだけで、より簡単に実装できますが、精度は犠牲になります。
- 🤖 Porter Stemmerは最も広く使われているステミングアルゴリズムの一つで、単語の母音と子音の組み合わせに基づいて単語をトリミングします。
- 🌐 Snowball StemmerはPorter Stemmerを改良し、英語以外の言語にも対応した多言語対応のステミングアルゴリズムです。
- 🚫 ステミングにはオーバーステミングとアンダーステミングという限界があり、時には単語の意味を失うことがあります。
- 📚 ステミングは固有名詞の認識や同音異義語の扱いにおいて問題を引き起こすことがあり、言語によっては適切に機能しない場合もあります。
Q & A
植物と言葉には何が共通していますか?
-植物と言葉には両方とも「幹」を持っていることが共通しています。植物の場合、幹は葉、花、果実につながる中心的な部分です。言葉の場合、語幹は様々な形態変化を通じて関連する単語を結びつけます。
stemmingとはどのようなプロセスですか?
-stemmingは自然言語処理におけるテキスト前処理技術です。異なる単語をその基本形、つまり語幹に戻すことを目的としたプロセスです。例えば、「connect」という語幹から「connected」、「connection」、「connects」などの単語を派生させることができます。
stemmingはどのようにして検索エンジンの検索結果に影響を与えますか?
-stemmingは検索エンジンで関連する単語の様々な形態を網羅し、検索クエリとそれに関連する単語を含むドキュメントを関連付けるプロセスを可能にします。これにより、検索結果の関連性と精度が向上します。
stemmingとlemmatizationの違いは何ですか?
-stemmingは単語の語尾を単純にカットして語幹に戻そうとする一方で、lemmatizationは単語の辞書にある正規化された形、つまり既存の単語形を取得するように試みます。lemmatizationはよりコンテキストを必要とし、より正確ですが、stemmingは単純で実装が簡単です。
stemmingが使用される主な理由は何ですか?
-stemmingは主に検索エンジンや情報検索、次元削減、および機械学習モデルのパフォーマンス向上に役立ちます。異なる形態の単語を語幹に統一することで、検索結果の関連性や精度を高め、特徴の数を削減してモデルのパフォーマンスを向上させます。
Porter Stemmerとは何ですか?
-Porter Stemmerは最も広く使われているstemmingアルゴリズムの一つです。単語の子音と母音を特定し、それに基づいて置換や削除を行って単語を語幹に戻します。
Snowball stemmerはPorter Stemmerとどのように異なりますか?
-Snowball stemmerはPorter stemmerの多言語対応バージョンであり、英語以外の言語でも使用できます。また、NLTKのSnowball stemmerは「stop words」を除去する機能も持っています。
stemmingの主な課題とは何ですか?
-stemmingの主な課題にはover-stemming(過剰なstemming)とunder-stemming(不十分なstemming)が含まれます。これにより、単語が意味を失うか、誤った語幹に分類されることがあります。
stemmingはどのようにして固有名詞の認識に悪影響を与える可能性がありますか?
-stemmingは固有名詞を誤った語幹に分類する可能性があり、例えば「Boeing」を「Boe」に誤って削減する可能性があります。これは固有名詞の認識に悪影響を与えることがあります。
stemmingはどのような言語での適用に課題がありますか?
-stemmingはアラビア語のように複雑な形態を持つ言語での適用に課題があります。stemmingアルゴリズムは接尾辞や接頭辞を正確に理解することが難しいためです。
stemmingとlemmatization、どちらを選択すべきかはどのように判断されますか?
-選択はユースケースによります。高精度が必要な場合、計算コストが高いlemmatizationを選択します。一方で実装が簡単で精度を若干犠牲にしても良い場合は、stemmingを選択します。
Outlines
🌿 ステミングの基礎
この段落では、植物の茎と言葉のステムとの類似性から始まり、ステミングという言葉の前処理技術について説明しています。ステミングは自然言語処理(NLP)の一環であり、コンピュータがテキストや音声を理解するための手法です。文章をトークン化し、各単語を基本形に還元するプロセスです。例えば、「connect」という語彙の様々な形態を「connect」に還元することで、検索エンジンが関連性の高い結果を提供する仕組みを解説しています。また、ステミングとレマティゼーションの違いについても触れており、ステミングは単語の末端を削って基本形を推測する一方で、レマティゼーションは辞書にある正しい形態を取得します。
🔍 ステミングとレマティゼーションの比較
第二段落では、ステミングとレマティゼーションの技術的な違いと使用場面について詳述しています。ステミングは単語の末端を削る単純な手法であり、レマティゼーションはより正確な形態を取得するが、コンピュータリソースを多く消費します。WordNetという辞書データベースを用いて、レマティゼーションは単語の文脈や品詞情報を活用して正しい基本形を特定します。ステミングは実装が簡単である反面、レマティゼーションは正確性が高いが計算コストが高いというトレードオフがあります。
🛠 ステミングの応用
第三段落では、ステミングがどのような場面で有用であるかについて解説しています。検索エンジンや情報検索において、ステミングは単語の様々な形態を基本形に還元し、関連性の高い結果を提供するのに役立ちます。また、次元削減にも利用され、文書中のユニークな単語数を削減して機械学習モデルの性能を向上させることができます。さらに、Porter Stemmerという具体的なステミングアルゴリズムの動作についても説明しており、単語の音素を分析し、特定のパターンに従って単語を還元するプロセスを紹介しています。
🚫 ステミングの限界
最後の段落では、ステミングアルゴリズムの限界や課題について触れています。Porter Stemmerの例として「therefore」という単語を用いて、アルゴリズムが誤った基本形を生成する可能性があることを示しています。さらに、Snowball StemmerというPorter Stemmerの多言語対応版も紹介されており、NLTKというPythonのNLPライブラリを使って実装できます。また、ステミングが適切でない例として固有名詞の扱いや同音異義語の問題についても説明しています。最後に、ステミングが適切でない言語例としてアラビア語が挙げられていますが、この段落では詳細は省略されています。
Mindmap
Keywords
💡stem
💡stemming
💡natural language processing (NLP)
💡tokenization
💡lemmatization
💡Porter Stemmer
💡Snowball stemmer
💡over-stemming
💡under-stemming
💡homonyms
Highlights
植物和单词共同点的谜题:它们都有词干(stems)。
单词的词干是连接单词不同形态的基础形式,如'connect'是'connected'、'connection'、'connects'的词干。
词干提取(Stemming)是减少单词到其基础形式的过程,常用于搜索引擎中以提供相关搜索结果。
词干提取是自然语言处理(NLP)中的文本预处理技术,NLP是人工智能的一个子领域。
NLP包括将文档分解为更小的组成部分,如段落、句子和单词(tokens)。
词干提取在单词(tokens)层面上操作,用于简化单词到其基本形式。
词干提取与词形还原(Lemmatization)的区别:词干提取通过剪切单词末端来寻找词干,而词形还原寻找标准词典形式。
词干提取是一种启发式算法,基于规则,而词形还原需要更多上下文信息。
词形还原使用WordNet这样的资源来确定单词的准确词形。
词干提取实现简单,但准确性较低;词形还原计算成本高,但准确性高。
词干提取适用于搜索引擎和信息检索,提高搜索结果的相关性和准确性。
词干提取还用于降维,减少机器学习模型中的特征数量,提高性能。
Porter词干提取算法通过识别单词中的辅音和元音,然后根据辅音和元音对的数量进行替换和消除。
Porter词干提取算法的一个例子是将'caresses'简化为'caress'。
Porter词干提取算法的局限性示例:'therefore'被错误地简化为'therefor'。
Snowball词干提取算法是Porter词干提取算法的多语言版本,可以在NLTK中使用。
词干提取的常见问题包括过度简化(overstemming)和不足简化(understemming)。
词干提取在命名实体识别和处理同音词时可能会遇到问题。
尽管存在局限性,词干提取仍然是一个简单而强大的技术,特别是在适当的应用场景中。
Transcripts
What do plants and words have in common?
I'll give you a hint.
It's on the whiteboard.
Both of them have stems.
For a plant, the stem is the central part that connects it to the leaves,
the flowers, the fruits.
And for a word, each of the words have stems too.
Today we'll be talking about stemming.
Consider what this word "connect".
This is a stem for words like connected, connection, connects,
and of course, the word connect itself.
Reducing each of these different words that I've listed over here to connect
is the process of stemming,
in which case the connect is the base form of the word.
Let's say, for instance, you want to become a millionaire.
Honestly, who doesn't?
So what's the first step that you take to know how to be a millionaire?
Lot of questions, right?
Perhaps you start off with a search query.
You pull up your favorite search engine,
and you type in "how to invest so that I can become a millionaire".
And what you'll notice is the search results that pop up
don't just have the word invest,
but also have words related to invest
like invested, invest, investment, and so on.
The process that is making all of this happen,
so that you can receive relevant search results
which cover all the different variations and all the different forms of the words,
like invest in this case, is stemming.
That's what the magic is.
So today we'll be seeing about stemming, what it entails, how it's used,
comparing it with another alternative,
seeing an algorithm of stemming, which is called a stemmer, in action
and ending with some caveats.
Stemming is a text pre-processing technique
that's used in natural language processing.
Natural language processing, or NLP, is a sub-branch of artificial intelligence.
It's the way our computers, machines, can understand how you and I communicate
using text or speech.
Natural language processing includes different tasks
to take all of our documents, or all of our data set,
and break it down into smaller components.
Let's say you have a set of documents.
You continue breaking it down into smaller components
to make it more easily digestible for your machine.
So each document can be broken down into paragraphs.
Each paragraph can be broken down into sentences.
And finally, each sentence can be broken down into different words.
And these words over here are what are called as tokens.
And this entire process that we have done of taking the data set from the documents
to the paragraph, to the sentences to the words,
is called tokenization.
Stemming as a technique operates at the level of tokens.
And now we'll take a deeper look into how that looks.
So you caught a glimpse of stemming,
but there's also another text pre-processing technique called lemmatization.
Let's take a look at the differences between the two.
Stemming tries to cut the ends of the word
in the hope of getting to its base word, or its stem.
In this case, happy - would cut the "y" and make it "happi".
But in lemmatization, it tries to get to the normalized form of a word.
That is, the word form that already exists in the dictionary, in which case "happy"
will just stay "happy".
As you can imagine, stemming is more of a heuristic algorithm
and is very rules based.
It looks at the ends of the word
and tries to guess or tries to estimate what the base word could be.
For example, it would look at words ending in "ing"
and remove the "ing" to get the base form, which in this case works.
However, consider the word "nothing".
It tries to apply the same logic to it and you end up with "noth".
Which is not really correct.
Lemmatization, on the other hand, would actually give you "nothing".
But the caveat here is that it requires more context.
It requires information like part of the speech,
the context of the word, how it's being used.
And it uses all of that with relation to something called WordNet.
WordNet is a huge graph which gives you relationships amongst different words,
the synonyms, the type of definitions that they have, so on and so forth.
That goes to say that stemming is fairly easy,
and simple to implement.
Whereas lemmatization is computationally more expensive.
But also more accurate.
Think of an example where you have the word "better".
The stem of it would result in better, which is also incorrect.
But lemmatization,
powered with that additional context and additional knowledge,
would actually be able to tell us that better is a form of "good".
Choosing one or the other really depends on your use case.
If you want high accuracy
but you are okay with compromising on it being computationally expensive,
go with lemmatization.
If you want that something that's simpler and easier to implement,
while compromising on accuracy a little bit, go with stemming.
Now let's see what the use cases of stemming are.
Why should we use stemming?
First one is search engine, or information retrieval.
Think back to our example of wanting to become a millionaire
and putting in a search query of "how to invest to become a millionaire".
Even though your query has the word "invest",
the search results and documents that come up
has the word "investment", or maybe "investing" or "invest".
This is where stemming comes into play.
Trying to get all of those different forms.
Different morphological variants of that particular word "invest".
This in turn gives you more relevant results,
thereby increasing the efficiency of the search.
While also increasing the accuracy of the results that you get.
The next use is dimensionality reduction.
All of the different unique words that exist in your documents
is what comprises the vocabulary, the entire vocabulary.
Let's say, for instance, you have "change", "changing", "unchanged" in the vocabulary.
Which are three unique words making your vocabulary size three.
If instead you were to use only the stem for it,
which in this case is "change",
your vocabulary would now reduce to just one word.
This leads to a reduction in dimensions
or the number of features that your machine learning model will now have.
It again increases the accuracy as well as the procession.
This helps you get better performances with statistical NLP models,
especially the ones concerned with word embeddings and topic models.
Well, enough talk about all of this.
Let's go and see how exactly a stemming algorithm, or "stemmer", works.
So hopefully you're on board with stemming,
but now let's see the algorithm in action.
A stemmer is what a stemming algorithm is called.
And let's see how that works.
One of the most widely used stemmers is called "Porter Stemmer",
and the way it works is that it looks at each word
and identifies the consonants and vowels present in there,
and then it does a bunch of substitutions, eleminations,
based on the number of consonants and vowel pairs present in that.
Let's say, for example, we have the word "caresses".
And these are the three rules that might apply to "caresses".
You look at "sses" present and replace it with "ss".
Or you look at "ss" present and keep it the same.
Don't replace it.
Or you ou look at "s" and just eliminate it.
That is, replace it with nothing.
As you can see, for the word "caresses",
more than one of these rules would apply.
But the trick over here is to look at the longest matching substring.
Which in our case is "sses".
And apply the first rule over here.
So this will then become "caress", which is correct.
Let's say instead the word was "caress".
In that case, "ss" would stay the same,
which is again "caress", which is also correct.
However, nothing in life is perfect,
and the same goes for the Porter Stemmer.
Let's look at one example which showcases the limitation that the stemmer has.
Consider the word "therefore".
I've identified each of the consonants and vowels present in this particular word.
Let's look at these.
So two consonants coming together,
we can collapse them and make them one.
So that's "c".
Let's also look at the pairs of VC's that is a vowel and a consonant coming together.
As you can see, there are three of those occurrences.
One, two and three,
in which case we would denote it as VC raised to the power,
the number of times that appears, in this case three.
And finally ending with the V over here.
This particular exponent, three over here, is called as measurement.
And based on the algorithm there are certain rules that apply to the exponent.
One of such rules, like the ones that we saw over here,
says that if the exponent is non-zero, which means it is greater than zero,
which it is for our case,
and if the word in consideration ends with an "e",
you must eliminate "e".
Which then gives us this as our stem,
"therefor(e)" minus the "e", which is incorrect.
Even though this is a limitation of the Porter stemming algorithm,
the Porter stemmer still remains one of the most widely used
because of the amount of rules that it has
and the amount of correct results it does give,
even if stemming as a whole is heuristic and simplistic in nature.
Another stemmer that you might have heard of is called the Snowball stemmer.
It's a modification of the Porter stemmer.
The Porter stemmer was created only to be used with English words.
However, the Snowball stemmer is multilinguistic,
which means it does work on languages other than English.
The pythonic implementation of Snowball stemmer,
which you can use using NLTK, which stands for Natural Language Toolkit,
which is an NLP package available,
has Snowball stemmer in the event where you can use it to remove "stop words".
So stop words would then be eliminated from the process of stemming.
So those are two ways that the Snowball stemmer is different from Porter Stemmer.
Let's continue looking at some more issues and limitations that exist with stemming.
Two of the most common issues that arise with stemming
are overstemming and understemming.
Let's take a look at some examples.
Considers "universal", "universe" and "university".
Overstemming, as it sounds, over does the stemming part,
or removes too much, more than necessary,
to the point where the words can lose meaning themselves.
The stem for each of those would be
"universe" without the "e" ("univers"),
And based on our knowledge of the English language,
we know that this is incorrect because universal, universe and university
are three words that mean totally different things.
The opposite end of this is "understemming",
where not enough removal has happened.
Consider the example of "alumnus", "alumna", and "alumni".
"Alumna" and "alumni" would remain the same,
whereas "alumnus" would become "alumna",
giving us three different stems, or three different base words.
Which in this case is also incorrect,
because each of these three words mean the same thing,
and so should have the same stem.
Some of the other challenges that exist with stemming are:
it fares badly with named entity recognition.
For example, you might have some proper nouns.
Consider "Boeing".
Based on the examples that we have seen previously,
stemming would reduce the "ing" from it,
giving you the stem of "Boe",
which we know is incorrect because it's a proper noun.
Another example might be that of homonyms.
Consider "rose" the flower,
and "rose", sun rising - the past tense of "rise".
In this case too, the stemming algorithm
would inaccurately get the stem for each of those as "rise".
Which does make sense for the sun rose,
but does not make sense for the flower rose.
You might also run into issues trying to attempt stemming
on languages like Arabic,
which have complex forms present, as it can be difficult to understand
what suffixes and what prefixes are present.
Stemming is a simple yet powerful technique when used in the right way.
Hopefully you learned something useful along this way
when we talked about stemming,
and this has continued to grow strong roots
as you go on in your journey of artificial intelligence.
If you like this video and want to see more like it,
please like and subscribe.
If you have any questions or want to share your thoughts about this topic,
please leave a comment below.
5.0 / 5 (0 votes)