Naive Bayes, Clearly Explained!!!

StatQuest with Josh Starmer
3 Jun 202015:12

Summary

TLDRIn this video, Josh Starburns introduces the concept of Naive Bayes classification, focusing on the Multinomial Naive Bayes Classifier. He explains its application in spam detection by calculating probabilities based on word frequency in normal and spam messages. Josh also covers key concepts like prior probability, likelihood, and how to overcome common issues like zero probabilities using smoothing techniques. Despite its simplicity, Naive Bayes proves highly effective in classifying messages. The video concludes with a discussion on why Naive Bayes is considered 'naive' and highlights additional resources for further study.

Takeaways

  • 🤖 Naive Bayes is a classification technique used to predict outcomes based on word probabilities.
  • 📊 A common version of Naive Bayes is the multinomial classifier, which is useful for text classification like spam filtering.
  • 📈 The classifier works by creating histograms of words from normal and spam messages to calculate word probabilities.
  • 📝 Probabilities of words appearing in normal or spam messages are used to predict if a new message is spam.
  • 💡 Likelihoods and probabilities are interchangeable terms in Naive Bayes when dealing with discrete words.
  • ⚖️ A prior probability is an initial guess about how likely a message is to be normal or spam, based on training data.
  • 🔄 The Naive Bayes classifier treats word order as irrelevant, simplifying the problem but ignoring language structure.
  • 🛠 A common problem, zero probability, is resolved by adding a count to each word's occurrences, known as 'smoothing'.
  • 🚨 Naive Bayes is considered 'naive' because it assumes all features (words) are independent, ignoring relationships between them.
  • 📚 Despite its simplicity, Naive Bayes often performs well in real-world applications, especially for tasks like spam filtering.

Q & A

  • What is the main focus of the video?

    -The video explains the Naive Bayes classifier, with a focus on the multinomial Naive Bayes version. It discusses how to use it for classifying messages as spam or normal.

  • What is the difference between multinomial Naive Bayes and Gaussian Naive Bayes?

    -Multinomial Naive Bayes is commonly used for text classification, while Gaussian Naive Bayes is used for continuous data like weight or height. The video focuses on the former and mentions that Gaussian Naive Bayes will be covered in a follow-up video.

  • How does Naive Bayes classify messages as normal or spam?

    -Naive Bayes calculates the likelihood of each word in a message being present in either normal or spam messages, multiplies these probabilities by initial guesses (priors) for each class, and compares the results to decide whether the message is normal or spam.

  • What is a 'prior probability' in the context of Naive Bayes?

    -A prior probability is the initial guess about the likelihood of a message being normal or spam, regardless of the words it contains. It is typically estimated from the training data.

  • Why is Naive Bayes considered 'naive'?

    -Naive Bayes is considered 'naive' because it assumes that all features (words in this case) are independent of each other. In reality, words often have relationships and specific orders that affect meaning, but Naive Bayes ignores this.

  • What problem arises when calculating probabilities for words not in the training data, and how is it solved?

    -If a word like 'lunch' does not appear in the training data, its probability becomes zero, which leads to incorrect classifications. This problem is solved by adding a small count (known as 'smoothing' or 'Laplace correction') to every word in the histograms.

  • How does Naive Bayes handle repeated words in a message?

    -Naive Bayes considers the frequency of words. For example, if a word like 'money' appears multiple times in a message, it increases the likelihood of the message being spam if 'money' is more common in spam messages.

  • What are 'likelihoods' in the Naive Bayes algorithm?

    -Likelihoods are the calculated probabilities of observing specific words in either normal or spam messages. In this context, likelihoods and probabilities can be used interchangeably.

  • How does Naive Bayes perform compared to other machine learning algorithms?

    -Even though Naive Bayes makes simplistic assumptions, it often performs surprisingly well for tasks like spam detection. It tends to have high bias due to its naivety but low variance, meaning it can be robust in practice.

  • What is the purpose of the alpha parameter in Naive Bayes?

    -The alpha parameter is used for smoothing to prevent zero probabilities when a word doesn't appear in the training data. In this video, alpha is set to 1, meaning one count is added to each word in the histograms.

Outlines

00:00

👨‍🏫 Introduction to Naive Bayes and Stack Quest

The narrator, Josh Starburns, introduces Stack Quest, a series that covers topics like Naive Bayes classification. He explains that the focus of this video is on the Multinomial Naive Bayes classifier, while the Gaussian version will be covered in another video. The concept of Naive Bayes is introduced through an example where messages from friends, family, and spam are filtered, and word probabilities are calculated.

05:00

📊 Calculating Probabilities for Normal and Spam Messages

The process of calculating word probabilities from histograms of normal and spam messages is discussed. Words like 'dear' and 'friend' are used to demonstrate how probabilities are assigned based on occurrences in normal messages. Similar calculations are made for spam messages.

Mindmap

Keywords

💡Naive Bayes

Naive Bayes is a type of probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions between the features. In the video, it is used to explain how this model helps classify whether messages are spam or normal by calculating probabilities based on word occurrences. The 'naive' aspect refers to the assumption that all features (words) are independent, which simplifies calculations but ignores word order.

💡Multinomial Naive Bayes

Multinomial Naive Bayes is a variant of Naive Bayes typically used for document classification, where features represent the frequencies of words. In this context, it is mentioned as the focus of the video, being applied to classify messages as spam or normal based on word probabilities. The video uses examples such as 'dear' and 'friend' to demonstrate how the model calculates probabilities based on word counts.

💡Gaussian Naive Bayes

Gaussian Naive Bayes is another variant of Naive Bayes, used when features follow a continuous distribution, typically a Gaussian (normal) distribution. It is briefly mentioned in the video as a follow-up topic and is contrasted with Multinomial Naive Bayes, which deals with discrete features like word counts.

💡Histogram

A histogram is a graphical representation that organizes a group of data points into specified ranges. In the video, histograms are used to show the frequency of words in normal and spam messages, helping calculate the probabilities of seeing each word in either category. These probabilities form the foundation of the Naive Bayes classification.

💡Prior Probability

Prior probability refers to the initial probability of a hypothesis before any evidence is considered. In the video, it represents the initial guess about whether a message is normal or spam before looking at specific words. For example, the prior probability that a message is normal is calculated based on the proportion of normal messages in the training data.

💡Likelihood

Likelihood refers to the probability of observing the data given a certain hypothesis. In this case, it is used to calculate the probability of encountering certain words given that the message is either normal or spam. The video clarifies that in this context, probabilities and likelihoods are interchangeable because the focus is on discrete word occurrences.

💡Spam Classification

Spam classification is the process of identifying whether a message is unsolicited or irrelevant. In the video, this is the main task the Naive Bayes model is being applied to, by calculating word probabilities and comparing them between normal and spam messages. Examples include calculating the probability of words like 'money' or 'dear' to classify new messages as spam or normal.

💡Alpha (Smoothing)

Alpha represents the smoothing parameter in Naive Bayes, often used to avoid zero probabilities for unseen words. The video explains that adding one to each word count (referred to as alpha = 1) prevents zero probabilities, ensuring that a message with unseen words like 'lunch' can still be classified correctly. This technique is crucial for improving the robustness of the model.

💡Bag of Words

The 'bag of words' model treats text as a collection of words, ignoring grammar and word order. In the video, this concept is emphasized when explaining how Naive Bayes is 'naive'—it assumes all words in a message are independent and order doesn't matter. Despite this simplification, the model can still perform well in classifying spam and normal messages.

💡Bias and Variance

Bias and variance are key concepts in machine learning that describe the trade-off between the accuracy of a model and its ability to generalize. The video explains that Naive Bayes has high bias because it ignores word order, but it also has low variance, meaning it performs consistently well in practice. This balance allows Naive Bayes to be effective despite its simplicity.

Highlights

Introduction to Naive Bayes and its application in spam filtering.

Multinomial Naive Bayes is the primary focus, with Gaussian Naive Bayes mentioned as a follow-up.

Example given of filtering messages by calculating probabilities of words in normal vs. spam messages.

Explanation of how histograms are used to calculate word probabilities for classification.

Clarification of probability vs. likelihood terminology.

Illustration of calculating the probability of the phrase 'dear friend' being in normal or spam messages.

The introduction of prior probability as an initial guess in Naive Bayes calculations.

Key calculation: 'Dear friend' is more likely to be a normal message based on computed probabilities.

Introduction of a problem when unseen words (like 'lunch') appear in new messages.

Solution to the zero-probability problem by adding a count (alpha) to each word.

Explanation of why Naive Bayes is called 'naive': it treats all word orders the same, ignoring grammar and syntax.

Even though Naive Bayes ignores word order, it performs well in practice for tasks like spam filtering.

Introduction of the terms 'high bias' and 'low variance' when describing Naive Bayes' effectiveness.

Promotion of StatQuest study guides for those preparing for exams or job interviews.

Final call to support StatQuest via Patreon, merchandise, or donations.

Transcripts

play00:00

I'm at home during lockdown working on

play00:04

my step quest yeah I'm at home during

play00:07

lockdown working on my stack quest yeah

play00:11

stack quest hello I'm Josh starburns

play00:16

welcome to static quest today we're

play00:19

gonna talk about naive Bayes and it's

play00:21

gonna be clearly explained this stack

play00:25

quest is sponsored by jad bio just add

play00:28

data and their automatic machine

play00:31

learning algorithms will do the rest of

play00:32

the work for you for more details follow

play00:36

the link in the pinned comment below

play00:39

note when most people want to learn

play00:42

about naive Bayes they want to learn

play00:44

about the multinomial naive bayes

play00:46

classifier and that's what we talk about

play00:49

in this video

play00:50

however just know that there is another

play00:53

commonly used version of naive Bayes

play00:55

called Gaussian naive Bayes

play00:57

classification and I cover that in a

play01:00

follow-up stat quest so check that one

play01:03

out when you're done with this quest BAM

play01:08

now imagine we received normal messages

play01:11

from friends and family and we also

play01:13

received spam unwanted messages that are

play01:17

usually scams or unsolicited

play01:18

advertisements and we wanted to filter

play01:22

out the spam messages so the first thing

play01:27

we do is make a histogram of all the

play01:29

words that occur in the normal messages

play01:31

from friends and family we can use the

play01:35

histogram to calculate the probabilities

play01:37

of seeing each word given that it was in

play01:40

a normal message for example the

play01:43

probability we see the word dear given

play01:47

that we saw it in a normal message

play01:50

is eight the total number of times deer

play01:53

occurred in normal messages divided by

play01:57

17 the total number of words in all of

play02:00

the normal messages

play02:02

and that gives us 0.47 so let's put that

play02:07

over the word dear so we don't forget it

play02:11

likewise the probability that we see the

play02:14

word friend given that we saw it in a

play02:18

normal message is 5 the total number of

play02:22

times friend occurred in normal messages

play02:25

divided by 17 the total number of words

play02:28

in all of the normal messages and that

play02:32

gives us zero point two nine so let's

play02:36

put that over the word friend so we

play02:38

don't forget it likewise the probability

play02:42

that we see the word launch given that

play02:44

it is in a normal message is 0.18

play02:48

and the probability that we see the word

play02:50

money given that it is in a normal

play02:53

message is 0.06 now we make a histogram

play02:58

of all the words that occur in the spam

play03:00

and calculate the probability of seeing

play03:04

the word dear given that we saw it in

play03:07

the spam

play03:09

and that is two the number of times we

play03:12

saw deer in the spam divided by seven

play03:16

the total number of words in the spam

play03:18

and that gives us zero point two nine

play03:23

likewise we calculate the probability of

play03:26

seeing the remaining words given that

play03:29

they were in the spam BAM now because

play03:35

these histograms are taking up a lot of

play03:37

space let's get rid of them but keep the

play03:40

probabilities oh no it's the dreaded

play03:44

terminology alert because we have

play03:47

calculated the probabilities of discreet

play03:50

individual words and not the probability

play03:53

of something continuous like weight or

play03:55

height these probabilities are also

play03:58

called likelihoods

play04:01

I mention this because some tutorials

play04:04

say these are probabilities and others

play04:06

say they are likelihoods in this case

play04:09

the terms are interchangeable so don't

play04:12

sweat it we'll talk more about

play04:15

probabilities versus likelihoods when we

play04:17

talk about Gaussian naive Bayes in the

play04:20

follow-up Quest

play04:22

now imagine we got a new message that

play04:25

said dear friend and we want to decide

play04:30

if it is a normal message or spam

play04:34

we start with an initial guess about the

play04:36

probability that any message regardless

play04:39

of what it says is a normal message this

play04:43

guess can be any probability that we

play04:45

want but a common guess is estimated

play04:48

from the training data for example since

play04:51

8 of the 12 messages are normal messages

play04:54

our initial guess will be 0.67 so let's

play05:00

put that under the normal messages so we

play05:02

don't forget it

play05:04

oh no it's another dreaded terminology

play05:07

alert the initial guests that we observe

play05:10

a normal message is called a prior

play05:12

probability

play05:15

now we multiply the initial guess by the

play05:17

probability that the word dear occurs in

play05:20

a normal message and the probability

play05:23

that the word friend occurs in a normal

play05:25

message

play05:27

now we just plug in the values that

play05:29

we've worked out earlier and do the math

play05:32

beep-boop beep-boop it and we get 0.09

play05:39

we can think of 0.09 as the score that

play05:43

dear friend gets if it is a normal

play05:45

message however technically it is

play05:49

proportional to the probability that the

play05:52

message is normal given that it says

play05:54

dear friend

play05:56

so let's put that on top of the normal

play05:58

messages so we don't forget

play06:01

now just like we did before we start

play06:04

with an initial guess about the

play06:06

probability that any message regardless

play06:08

of what it says is spam

play06:11

and just like before the guests can be

play06:14

any probability we want but a common

play06:17

guess is estimated from the training

play06:19

data

play06:20

and since four of the twelve messages

play06:22

are spam our initial guess will be 0.33

play06:28

so let's put that under the spam so we

play06:31

don't forget it

play06:33

now we multiply that initial guess by

play06:36

the probability that the word dear

play06:38

occurs in spam and the probability that

play06:42

the word friend occurs in spam

play06:45

now we just plugged in the values that

play06:47

we worked out earlier and do the math

play06:50

BIP BIP BIP BIP BIP and we get 0.01

play06:57

like before we can think of 0.01 as the

play07:01

score the dear friend gets if it is spam

play07:05

however technically it is proportional

play07:08

to the probability that the message is

play07:11

spam given that it says dear friend

play07:15

and because the score we got for normal

play07:17

message 0.09 is greater than the score

play07:21

we got for spam 0.01 we will decide that

play07:26

dear friend is a normal message double

play07:31

BAM now before we move on to a slightly

play07:35

more complex situation let's review what

play07:38

we've done so far we started with

play07:42

histograms of all the words in the

play07:44

normal messages and all of the words in

play07:47

the spam then we calculated the

play07:50

probabilities of seeing each word given

play07:53

that we saw the word in either a normal

play07:55

message or spam then we made an initial

play07:58

guess about the probability of seeing a

play08:01

normal message

play08:03

this guest can be anything between zero

play08:06

and one but we based hours on the

play08:08

classifications in the training data set

play08:11

then we made the same sort of guess

play08:13

about the probability of seeing spam

play08:17

then we multiplied our initial guests

play08:19

that the message was normal by the

play08:22

probabilities of seeing the words dear

play08:24

and friend given that the message was

play08:27

normal then we multiplied our initial

play08:30

guests that the message was spam by the

play08:33

probabilities of seeing the words dear

play08:36

and friend given that the message was

play08:38

spam

play08:40

then we did the math and decided that

play08:42

dear friend was a normal message because

play08:45

0.09 is greater than 0.01

play08:50

now that we understand the basics of how

play08:52

naive Bayes classification works let's

play08:56

look at a slightly more complicated

play08:58

example

play09:00

this time let's try to classify this

play09:03

message lunch money money money money

play09:07

note this message contains the word

play09:10

money four times and since the

play09:14

probability of seeing the word money is

play09:16

much higher in spam than in normal

play09:19

messages then it seems reasonable to

play09:22

predict that this message will end up

play09:24

being spam so let's do the math

play09:29

calculating the score for a normal

play09:31

message works just like before we start

play09:35

with the initial guess then we multiply

play09:37

it by the probability we see lunch given

play09:40

that it is in a normal message and the

play09:43

probability we see money four times

play09:46

given that it is in a normal message

play09:49

when we do the math we get this tiny

play09:52

number

play09:54

however when we do the same calculation

play09:57

for spam we get zero

play10:02

this is because the probability we see

play10:04

lunch in spam is zero since it was not

play10:07

in the training data and when we plug in

play10:11

zero for the probability we see lunch

play10:13

given that it was in spam then it

play10:17

doesn't matter what value we picked for

play10:19

the initial guess that the message was

play10:21

spam and it doesn't matter what the

play10:24

probability is that we see money given

play10:27

that the message was spam because

play10:30

anything times zero is zero

play10:35

in other words if a message contains the

play10:38

word lunch it will not be classified as

play10:41

spam and that means we will always

play10:44

classify the messages with lunch in them

play10:46

as normal no matter how many times we

play10:49

see the word money

play10:51

and that's a problem

play10:55

to work around this problem people

play10:57

usually add one count represented by a

play11:00

black box to each word in the histograms

play11:04

note the number of counts we add to each

play11:07

word is typically referred to with the

play11:09

Greek letter alpha in this case alpha

play11:13

equals one but we could have said it to

play11:16

anything

play11:18

anyway now when we calculate the

play11:21

probabilities of observing each word

play11:23

we never get 0 for example the

play11:28

probability of seeing lunch given that

play11:30

it is in spam is 1/7 the total number of

play11:36

words in spam plus for the extra counts

play11:40

that we added

play11:42

and that gives us 0.09 note adding

play11:48

counts to each word does not change our

play11:50

initial guess that a message is normal

play11:52

or the initial guess that the message is

play11:55

spam because adding a count to each word

play11:59

did not change the number of messages in

play12:01

the training data set that are normal or

play12:04

the number of messages that are spam

play12:08

now when we calculate the scores for

play12:11

this message we still get a small number

play12:14

for the normal message

play12:17

but now when we calculate the value for

play12:19

spam we get a value greater than zero

play12:22

and since the value for spam is greater

play12:26

than the one for a normal message we

play12:29

classify the message as spam spam

play12:35

now let's talk about why naive Bayes is

play12:38

naive the thing that makes naive Bayes

play12:42

so naive is that it treats all word

play12:45

orders the same for example the normal

play12:50

message score for the phrase dear friend

play12:53

is the exact same for the score for

play12:56

friend dear in other words regardless of

play13:01

how the words are ordered we get 0.08

play13:06

treating all word orders equal is very

play13:09

different from how you and I communicate

play13:13

every language has grammar rules and

play13:15

common phrases but naivebayes ignores

play13:18

all of that stuff

play13:20

instead naivebayes treats language like

play13:24

it is just a bag full of words and each

play13:26

message is a random handful of them

play13:29

naive bayes ignores all the rules

play13:32

because keeping track of every single

play13:34

reasonable phrase in a language would be

play13:36

impossible that said even though naive

play13:41

bayes is naive it tends to perform

play13:43

surprisingly well when separating normal

play13:46

messages from spam

play13:49

in machine learning lingo we'd say that

play13:52

by ignoring relationships among words

play13:54

naivebayes has high bias but because it

play13:59

works well in practice naive Bayes has

play14:02

low variance shameless self-promotion

play14:06

if you are not already familiar with the

play14:09

terms bias and variance check out the

play14:12

quest the link is in the description

play14:14

below

play14:16

triple spam

play14:19

oh no it's one last shameless

play14:22

self-promotion one awesome way to

play14:25

support stack quest is to purchase the

play14:27

naivebayes

play14:28

stack quest study guide it has

play14:31

everything you need to study for an exam

play14:33

or job interview it's eight pages of

play14:36

total awesomeness and while you're there

play14:39

check out the other stack quest study

play14:42

guides there's something for everyone

play14:45

hooray we've made it to the end of

play14:48

another exciting stat quest if you liked

play14:51

this stack quest and want to see more

play14:53

please subscribe and if you want to

play14:55

support stack quest consider

play14:57

contributing to my patreon campaign

play14:58

becoming a channel member buying one or

play15:01

two of my original songs or a t-shirt or

play15:03

a hoodie or just donate the links are in

play15:06

the description below alright until next

play15:09

time quest on

Rate This

5.0 / 5 (0 votes)

Ähnliche Tags
Naive BayesMachine LearningSpam FilterMultinomial Naive BayesGaussian Naive BayesStatQuestClassificationData ScienceProbabilityTraining Data
Benötigen Sie eine Zusammenfassung auf Englisch?