Naive Bayes classifier: A friendly approach

Serrano.Academy
10 Feb 201920:29

Summary

TLDRIn this video, Luis Serrano explains the Naive Bayes classifier, a fundamental concept in probability and machine learning. He uses the example of building a spam detector to illustrate how Bayes' theorem is applied. The video covers calculating the probability of spam emails based on keywords like 'buy' and 'cheap'. It also discusses the 'naive' assumption of independence between features, which simplifies calculations. Luis provides a detailed walkthrough of how to apply Bayes' theorem and the naive assumption to estimate probabilities when not all data points are available.

Takeaways

  • 📝 Bayes' Theorem is a fundamental concept in probability and machine learning, used to calculate the probability of an event based on prior knowledge of conditions.
  • 📝 Naive Bayes is an extension of Bayes' Theorem that simplifies calculations by making the assumption that features are independent, even when they might not be.
  • 📝 The video uses the example of a spam detector to explain how Naive Bayes can be applied to classify emails into spam or not spam based on the presence of certain words.
  • 📝 The script demonstrates how to calculate the probability of an email being spam if it contains specific words, like 'buy' and 'cheap', using Bayes' Theorem.
  • 📝 It explains the concept of conditional probability and how it is used in the context of Naive Bayes to determine the likelihood of spam based on email content.
  • 📝 The video highlights the importance of making naive assumptions about independence between features to simplify the calculations and make the model more manageable.
  • 📝 The script shows how to handle situations where data is sparse or certain combinations of features do not appear in the training set.
  • 📝 It emphasizes that even with the naive assumption of independence, Naive Bayes classifiers can perform well in practice for many classification tasks.
  • 📝 The video concludes by summarizing the process of filling out a probability table and using it to calculate the likelihood of an email being spam based on multiple features.
  • 📝 It challenges viewers to understand the math behind Naive Bayes and appreciates the simplicity of calculating probabilities by dividing one set of data by another.

Q & A

  • What is the Naive Bayes classifier?

    -The Naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions between the features.

  • What is Bayes' theorem?

    -Bayes' theorem is a fundamental principle in probability theory that describes the probability of an event based on prior knowledge of conditions that might be related to the event.

  • How does the Naive Bayes classifier work for spam detection?

    -For spam detection, the Naive Bayes classifier works by calculating the probability of an email being spam based on the presence of certain keywords or features that are indicative of spam.

  • What is the significance of the word 'buy' in the context of the spam detector example?

    -In the spam detector example, the word 'buy' is chosen as a feature that is likely to appear more frequently in spam emails compared to non-spam emails.

  • How is the probability of an email being spam calculated if it contains the word 'buy'?

    -The probability is calculated by dividing the number of spam emails containing the word 'buy' by the total number of emails containing the word 'buy'.

  • What is the role of the word 'cheap' in the spam detection example?

    -Similar to 'buy', 'cheap' is another feature that might be more common in spam emails, and its presence is used to calculate the likelihood of an email being spam.

  • What happens when you apply Naive Bayes to multiple features, like both 'buy' and 'cheap'?

    -When applying Naive Bayes to multiple features, you calculate the combined probability of an email being spam given the presence of all those features, assuming independence between them.

  • Why is the assumption of independence between features considered 'naive'?

    -The assumption of independence is considered 'naive' because in reality, features are often not independent. However, this simplification allows for easier calculations and can still yield good results.

  • How does the Naive Bayes classifier handle situations where certain combinations of features have not been observed in the training data?

    -The classifier uses the assumption of feature independence to estimate probabilities for unseen combinations, allowing it to make predictions even with limited data.

  • What is the importance of the dataset size when using the Naive Bayes classifier?

    -A larger dataset can provide more accurate probabilities for the features, but the Naive Bayes classifier can still perform well with smaller datasets due to its simplicity and the assumption of feature independence.

  • Can the Naive Bayes classifier be improved by considering feature dependencies?

    -Yes, the classifier can potentially be improved by using more sophisticated models that capture feature dependencies, but this comes at the cost of increased complexity and computational requirements.

Outlines

00:00

📊 Introduction to Naive Bayes Classifier

Luis Serrano introduces the concept of the Naive Bayes classifier, explaining its importance in probability and machine learning. He clarifies that Bayes' theorem is about calculating the probability of an event given some prior knowledge, and Naive Bayes extends this idea by making simplifying assumptions to handle complex scenarios. The example of building a spam detector is used to illustrate the concept, where the presence of certain words in emails is correlated with them being spam or not. The video uses the word 'buy' to demonstrate how to calculate the probability of an email being spam based on its content.

05:00

🔍 Handling Overlapping Features in Naive Bayes

The script delves into the challenge of handling multiple features, such as the words 'buy' and 'cheap', in a Naive Bayes classifier. It discusses the issue of zero instances of non-spam emails containing both words in a small dataset and how this could skew the classifier's accuracy. The solution proposed is to make an assumption about the independence of these features and estimate the probability of their co-occurrence based on their individual probabilities, despite the lack of direct evidence in the data.

10:00

📉 Applying Naive Assumptions to Improve Calculations

Luis explains how the Naive Bayes classifier uses the assumption of independence between features to simplify calculations. By assuming that the presence of one word does not affect the presence of another, the video demonstrates how to estimate the probability of an email being spam based on multiple keywords. It shows how to calculate this probability by multiplying the probabilities of individual words appearing in spam emails and compares it to the same calculation for non-spam emails.

15:01

📈 Expanding Naive Bayes to More Features

The script extends the discussion to include more features, such as the word 'work', in the Naive Bayes classifier. It shows how to incorporate additional features into the model by making the same naive independence assumption and calculating the combined probability of multiple words appearing in an email. The video explains how some features can increase the likelihood of an email being spam, while others can decrease it, and how the Naive Bayes classifier combines these to make a prediction.

20:02

📝 Wrapping Up Naive Bayes Explanation

In the final paragraph, Luis summarizes the process of using Naive Bayes for spam detection, emphasizing the simplicity of calculating probabilities by dividing one number by another. He invites viewers to engage with the content by subscribing, liking, sharing, and commenting with questions or suggestions for future videos. The video concludes with a reminder to follow Luis on Twitter for more mathematical insights.

Mindmap

Keywords

💡Naive Bayes Classifier

The Naive Bayes Classifier is a simple yet powerful probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions between the features. In the video, Luis Serrano uses the example of a spam detector to illustrate how the classifier works. The classifier calculates the probability of an email being spam based on the presence of certain words, like 'buy' and 'cheap', assuming that the presence of these words does not depend on each other.

💡Bayes' Theorem

Bayes' Theorem is a fundamental principle in probability theory and statistics that describes the probability of an event based on prior knowledge of conditions that might be related to the event. In the context of the video, Bayes' Theorem is used to calculate the likelihood of an email being spam given the presence of certain words. The theorem is fundamental to understanding how the Naive Bayes Classifier makes its predictions.

💡Spam Detector

A spam detector is a system designed to identify and filter out unsolicited emails, often referred to as 'spam'. In the video, Luis Serrano uses the spam detector as an example application of the Naive Bayes Classifier. The goal is to sort emails into 'spam' or 'ham' (non-spam) categories based on the presence of certain keywords.

💡Conditional Probability

Conditional probability is the probability of an event occurring, given that another event has occurred. In the video, conditional probability is used to calculate the likelihood that an email is spam based on the presence of specific words like 'buy'. For example, the probability that an email is spam given that it contains the word 'buy' is calculated as 80% based on the dataset provided.

💡Feature

In machine learning, a feature is an individual measurable property or characteristic of a phenomenon being observed. In the video, features refer to properties of emails such as the presence of specific words ('buy', 'cheap', 'work') that are used to predict whether an email is spam or not.

💡Independence Assumption

The independence assumption in the context of the Naive Bayes Classifier is the assumption that the presence of a word is unrelated to the presence of any other word. This simplifies the calculations significantly but may not always reflect reality. In the video, Luis Serrano explains that assuming 'buy' and 'cheap' are independent makes the math easier, even though they might not be in reality.

💡Dataset

A dataset is a collection of data, typically used for analysis or to train machine learning models. In the video, Luis Serrano refers to a dataset of 100 emails, with 25 marked as spam and 75 as non-spam, which is used to train the spam detector and to illustrate how the Naive Bayes Classifier works.

💡Probability

Probability is a measure of the likelihood that an event will occur. In the video, probability is used extensively to determine the likelihood of an email being spam based on the presence of certain words. The calculations involve ratios and percentages derived from the dataset to estimate these probabilities.

💡Email

An email, short for 'electronic mail', is a method of exchanging messages and information from an author to one or more recipients. In the video, emails are the subject of classification, with the Naive Bayes Classifier being used to distinguish between spam and non-spam (ham) emails.

💡Ham

In the context of email filtering, 'ham' refers to legitimate, non-spam emails. The term is used in contrast to 'spam', which refers to unsolicited or junk emails. In the video, Luis Serrano uses the terms 'spam' and 'ham' to categorize emails in the dataset used to train the Naive Bayes Classifier.

Highlights

Naive Bayes classifier is based on Bayes' theorem and is useful in machine learning for tasks like spam detection.

Bayes' theorem is about calculating the probability of an event given some information about another event.

Naive Bayes simplifies calculations by making assumptions about the independence of events.

An example of building a spam detector using email data is provided.

The word 'buy' is studied for its correlation with spam emails.

It's found that 80% of emails containing 'buy' are spam.

The word 'cheap' is also studied, with 60% of containing emails classified as spam.

When considering both 'buy' and 'cheap', the probability of an email being spam reaches 100%.

The naive assumption is made that words 'buy' and 'cheap' are independent.

The independence assumption allows for easier calculations even with limited data.

The concept of 'ham' is introduced as a term for non-spam emails.

The video explains how to fill out a probability table using Bayes' theorem.

The importance of normalization in calculating final probabilities is discussed.

Naive Bayes can handle many features by assuming independence between them.

The video concludes by emphasizing that Naive Bayes combines multiple features into a model for spam detection.

The presenter invites viewers to engage by subscribing, liking, sharing, and commenting for more content.

Transcripts

play00:00

i am luis serrano and this video is

play00:01

about the naive Bayes classifier now

play00:04

your base is one of the most important

play00:06

things in probability and it's very

play00:08

useful in machine learning you may have

play00:10

seen it as a complicated formula

play00:11

regarding some ratios of probabilities I

play00:13

like to see this a little further and I

play00:15

like to think of it as what is the

play00:17

probability of something happened given

play00:20

that we know some information that

play00:22

something else happens and then naive

play00:24

Bayes is an extension of this which

play00:26

basically says ok once I have too many

play00:28

events and I don't know how to handle

play00:30

them are there any naive assumptions

play00:32

that I can make on them to make the math

play00:34

work easier and so this is what we're

play00:37

gonna see today so let's start with an

play00:39

example let's say we want to build a

play00:40

spam detector because we are tired of

play00:42

seeing a lot of spam email in our inbox

play00:44

and we want to sort it properly so how

play00:47

do we build it we build it with previous

play00:49

data unless our previous data is a set

play00:52

of a hundred emails and when we look at

play00:55

them carefully there are 25 of them that

play00:58

are spam and 75 and are not spam

play01:00

so what we're gonna do is we're gonna

play01:02

try to pick properties of the emails

play01:05

that we think may correlate with them

play01:08

being spam or not spam so let's pick one

play01:11

let's say we're gonna study the

play01:13

appearance of the word buy so we think

play01:16

that emails that contain the word buy

play01:18

are more likely to be spam than not spam

play01:21

so let's study that let's see how many

play01:23

emails that a spam have the word Buy and

play01:26

turns out there's 20 of them and let's

play01:28

see how many emails that are not spam

play01:30

have the word buy on them

play01:31

so there's five so let's forget about

play01:34

all the others and just look at the spam

play01:37

emails and here's a quiz the quiz says

play01:41

if an email contains the word buy then

play01:43

what is the probability that this email

play01:45

is spam given the data that we have and

play01:48

the options are 40% 60% 80% and a

play01:51

hundred percent so feel free to pause

play01:53

the video and think about it yourself

play01:54

given the data that we have what is the

play01:57

probability that if an email contains

play01:59

over by then it is spam is a conditional

play02:02

probability so I'll tell you the answer

play02:04

the answer is if we look at the emails

play02:07

that contain the word buy well there's

play02:10

$20 spam and five that are not

play02:13

so that mason 80/20 split and so from

play02:17

this data we can see that from the

play02:19

emails our continued whereby 80% of them

play02:21

are spam so the probability we conclude

play02:24

again just from this data that the

play02:26

probability is gonna be 80 percent that

play02:28

it's spam if it contains the word buy

play02:30

therefore we associate the condition

play02:32

containing the word buy with the

play02:34

probability 80 percent and that is

play02:37

exactly what Bayes theorem is you may

play02:39

have seen in a different way it's you

play02:41

know like a formula this is really what

play02:43

it is so just for fun let's do it for a

play02:45

different property for a different word

play02:47

let's say that we think that the word

play02:49

cheap may also be a good way to tell if

play02:52

an email is spam so let's study this

play02:54

word we count how many times the word

play02:55

cheap appears in spaniels that's gonna

play02:58

be in 15 of them and from the non-spam

play03:01

ten of them I have the word cheap so we

play03:04

forget about the rest and quiz again if

play03:07

an email contains where chip was a

play03:09

probability a spam 40 60 80 100 again

play03:12

feel free to pause the video I'll tell

play03:14

you the answer the answer is 60% because

play03:17

if you look at the split there is 15

play03:20

spam and 10 no spam among the ones that

play03:23

contain the word cheap so that's a 60/40

play03:26

split and therefore the solution is 60%

play03:29

so we applied base theorem for two words

play03:32

and obtain 80 and 60 now here's where

play03:35

things get complicated what if we want

play03:37

to apply it for both words at the same

play03:40

time so we want to see what's the

play03:42

probability of an email being spam if it

play03:45

contains both the word bye

play03:47

and the word cheap well we can do the

play03:49

same thing right we can count how many

play03:51

emails contain the word by and then look

play03:55

at how many contain the word cheap and

play03:57

then actually look at the overlap and so

play03:59

there's actually 12 emails that contain

play04:02

the words buy and cheap so that's some

play04:05

good data and then let's look among the

play04:07

no spams let's say that there's these

play04:08

five that contain the word buy and these

play04:11

ten contain they were cheap so actually

play04:13

there's none that contain both words but

play04:15

that's okay we're gonna do the same

play04:17

thing as before we have 12 spam emails

play04:19

and zero no spam emails that contain

play04:22

they were cheap so easiest quiz in the

play04:23

world if an email contains the word

play04:26

buying cheap wise

play04:26

probably a spam forty sixty eighty or a

play04:29

hundred and this should be easy right

play04:31

because there are twelve emails that

play04:33

contain both words zero emails that

play04:36

contain no words and this is a 100% 0%

play04:39

split so the answer is 100% and we are

play04:43

done

play04:43

right well maybe you're being skeptical

play04:47

like me right it seems like that's a

play04:49

little too much like any classifier that

play04:51

tells you something 100% is too strong

play04:53

and where lies the problem well the

play04:56

problem lies here that we had 12 emails

play05:00

that contained about words by and cheap

play05:01

and that's not bad but here we had 0

play05:04

emails so among the non spam emos there

play05:06

are zero emails that contains the words

play05:09

buy and cheap and so that's just

play05:11

unfortunate among our data we don't find

play05:13

the two words but it's possible that

play05:16

these two words could appear right so we

play05:18

can't restrict ourselves to not have a

play05:20

classifier with the words buying cheap

play05:22

just because in our small data set the

play05:23

world stone appear so what could we do

play05:26

well one solution could be just maybe

play05:28

collect more data like go through a lot

play05:30

more emails until we find the words buy

play05:33

and cheap and then do base theorem on

play05:35

those but what if we just can't what if

play05:38

we can't collect more data and we have

play05:40

to do with the data that we have so

play05:43

let's think we have this situation what

play05:46

would you do if you have the situation

play05:47

and you have to sort of imagine how many

play05:51

emails would contain the words buy and

play05:53

cheap so what we're gonna do is try to

play05:56

guess the number try to come up with a

play05:59

sensible amount of emails that would

play06:01

contain the words buy and cheap even if

play06:03

we found none so let's look at a

play06:06

slightly larger DSL let's say we have a

play06:08

hundred emails so this is a different

play06:09

set than the first one we have a hundred

play06:11

emails and let's say that five contain

play06:14

the word buy and let's say that ten

play06:18

contain the word cheap and they don't

play06:21

overlap however what do you think would

play06:25

be a sensible number of emails that

play06:27

would contain the words buy and cheap so

play06:30

let's think 5 out of 100 is 5% so 5% of

play06:34

the emails contain the word body and 10

play06:37

out of 100 is 10 percent so tempers

play06:39

the emails contained that were cheap so

play06:41

in an ideal world where everything was

play06:43

pretty how many emails would contain the

play06:46

words buy and cheap well what is what is

play06:49

ten percent of five percent it's zero

play06:52

point five percent so why don't we just

play06:54

assume that 0.5% of the email contained

play06:59

the word buy and cheap so we can sort of

play07:02

imagine that there is half an email that

play07:05

contains the words buying cheap answers

play07:06

all we're doing is math it doesn't

play07:08

really matter that there's half an email

play07:09

this will work out on our formulas what

play07:11

we did is an assumption we assumed that

play07:15

the words buy and cheap are independent

play07:17

they may not be right it could be that

play07:20

containing the word buy makes it easier

play07:23

to contain the word cheap because you're

play07:24

talking about a product they say buy

play07:25

cheap something or it could be the

play07:27

opposite that if one appears that sort

play07:29

of forces the other want to do not

play07:31

appear to be less likely to appear so

play07:33

it's a it's a quite a strong assumption

play07:35

as a matter of fact many people would

play07:36

say that's a naive assumption assuming

play07:39

that two variables are independent when

play07:41

they may not be is very naive however

play07:44

that's what our algorithm is based

play07:46

because it turns out that if we make

play07:48

these assumptions things still work well

play07:50

and it makes our math much much easier

play07:52

because now we don't have to collect

play07:54

thousands of emails we can collect these

play07:55

100 and from the number of thousands by

play07:57

and the number of peers of cheap we can

play07:59

sort of cook up the numbers of Pearson's

play08:02

of buy and cheap so let's do that let's

play08:06

go back to our data we had 25 spam

play08:09

emails and 20 of them had the word buy

play08:12

and that's four-fifths and 15 of them

play08:16

had the word cheap that's 3/5 so we can

play08:20

imagine that the product of this is 12

play08:23

divided by 25 so we could assume that an

play08:28

average 12 emails here out of 25 would

play08:31

contain the words buy and cheap so in

play08:33

order to find the actual number we

play08:34

multiply by 25 and we get that 12 emails

play08:38

have the words buying cheap so that was

play08:39

kind of lucky that we actually did find

play08:41

12 we're not gonna be that lucky in the

play08:44

other case but we can still do it right

play08:45

so we have 75 emails five of them are

play08:49

buy that's 115

play08:51

of them then ten of them have the word

play08:53

cheap that's two fifteen of them and the

play08:57

product of these two fractions again

play08:59

assuming they're independent is two

play09:01

divided by 225 so that's the fraction of

play09:04

emails that contain the words buy and

play09:06

cheap so to find the actual number or

play09:08

multiply it by 75 and we get 2/3 so in

play09:12

here we have 2/3 of an email contains

play09:15

the words buy and cheap and that's fair

play09:17

let's work with that so we go back to

play09:19

our data and on the Left we have 12

play09:22

emails that contain the word buy and

play09:23

cheap and on the right we have 2/3 of an

play09:26

email that contain the word buy and

play09:27

cheap and we can do math with these ones

play09:29

right because now the quiz says if an

play09:32

email contains the words buy and cheap

play09:34

what is the probability that is spam so

play09:37

let's do some math what is the split

play09:39

among 12 and 2/3 well let's take the

play09:42

spam ones that's 12 and let's take the

play09:46

total number of emails that contain Buy

play09:49

and cheap and that's 12 plus 2/3 because

play09:51

there's 12 spam and 2/3 there are no

play09:54

spam so we can find the ratio between

play09:55

these and by the way if you've seen the

play09:58

formula for base theorem and there's a

play10:00

ratio and it's precisely this one so

play10:02

what do we do with this fraction well we

play10:05

put in lowest terms is 36 over 38 or

play10:08

ninety four point seven three seven

play10:10

percent because this plate is ninety

play10:12

four point seven three seven and five

play10:14

point two six three therefore our final

play10:17

answer is that the words buy and cheap

play10:20

give us a probability of ninety four

play10:22

point seven three seven percent of being

play10:24

spam that means if we have an email with

play10:26

both of those words is ninety four

play10:28

percent point seven three seven likely

play10:30

to be spam and that is precisely the

play10:32

naive Bayes classifier so now Bayes

play10:35

classifier basically it's a combination

play10:37

of Bayes theorem and be naive assumption

play10:42

that two events are going to be

play10:44

independent when they may not be but

play10:46

that naive assumption makes the math

play10:48

much much easier so let's do a little

play10:51

summary what we're really doing is we're

play10:52

gonna fill out this table and some

play10:55

places of the table we can't really fill

play10:57

out the data so we'll fill them out with

play10:58

other places in the table so let's look

play11:01

at spam and those

play11:02

animals we looked at the total was 25

play11:06

spam emails and 75 non-spam emails in

play11:08

our dataset right now the next way we're

play11:11

gonna count how many of them have the

play11:12

word by so 20 of the 25 have the word by

play11:14

that's forfeits and then five of the 75

play11:18

there are no spam have the word by so

play11:20

that's 115 because it's five divided by

play11:22

75 now we're gonna fill in the next row

play11:25

so out of the spam emails the 15 of them

play11:28

contain the word sheep that's

play11:29

three-fifths cousins fifteen by twenty

play11:31

five and ten under $75 not spam contain

play11:35

they were cheap and so that's two 15s

play11:37

because it's 10 divided by 75 now we

play11:40

would love to fill in the last row with

play11:41

data the word their words buy and cheap

play11:44

but unfortunately this is not big enough

play11:46

to actually handle as an event that is

play11:48

so sparse like the words buying cheap

play11:51

appearing and you can imagine if there

play11:53

were more words it would be even harder

play11:55

so we have to cook up this row from the

play11:57

previous ones so what we're gonna do is

play12:00

the naive assumption that the words buy

play12:02

and cheap are independent so that one

play12:04

doesn't imply or push the other one to

play12:07

appear or stop it from appearing and if

play12:11

we make this assumption then we're gonna

play12:12

say that the product of these two is the

play12:14

probability of the word buy and cheap

play12:18

appearing so that's 12 divided by 25 the

play12:20

product of 4/5 and 3/5 so that's gonna

play12:23

be our probability and now if this is

play12:27

the probability of buying cheap

play12:28

appearing how many emails contain buy

play12:31

and cheap all we have to multiply by it

play12:33

by the total number which is 25 so 25

play12:36

times 12 over I 25 is 12 so we conclude

play12:40

that 12 we must should contain the words

play12:43

buying cheap even if there is 12 or 14

play12:45

or 10 or none logically if we have that

play12:49

assumption there should be 12 now let's

play12:53

look at the other two boxes well again

play12:56

we make the assumption that the word

play12:57

pine and chips are independent of each

play12:59

other so the product of this 2 which is

play13:01

2 divided by 225 it's gonna be the

play13:06

probability afterwards buying cheap

play13:08

appearing in an email that is no spam so

play13:11

now how many emails that are not spam

play13:13

contain the word buy and cheap well

play13:15

product of the probability times a total

play13:17

number so how much is two over twenty

play13:18

two hundred twenty five times seventy

play13:20

five that's actually two-thirds so we

play13:23

have twelve spam emails and two-thirds

play13:25

of an email that is not spam that

play13:28

contain the words buying cheap so now we

play13:30

have to normalize right we have to see

play13:31

what is the split how many percentage

play13:34

are spam among the total ones and the

play13:37

total ones is twelve plus two-thirds

play13:39

that's all of our emails that are

play13:41

containing the word pion cheap so we

play13:43

divide twelve the spam ones divided by

play13:46

the total which is twelve plus

play13:47

two-thirds and we get 36 over 38 which

play13:50

is nine four point seven three seven now

play13:53

notice that nice ways extents and the

play13:55

idea is that this extends to many many

play13:57

more properties right because the point

play13:58

is if we have 50 properties and we can't

play14:01

check when they all appear at the same

play14:02

time we can check when one appears and

play14:04

then multiply things right so let's add

play14:06

an extra row to this table let's say we

play14:09

looked at the word work and we're

play14:10

wondering if the word work helps us in

play14:13

our classifier so let's study how much

play14:16

it appears let's say that it appears

play14:18

five times in our spam emails and 30

play14:20

times in our non-spam email so it

play14:22

doesn't look like it's gonna help us

play14:23

that much it looks almost like it's a

play14:25

word that's more correlated to not spam

play14:27

but let's just study it so this 5 out of

play14:30

25 is 1/5 so therefore one fifth of the

play14:33

spaniels contain the word work and six

play14:36

fifteen of the nonce problems contain

play14:38

the word work because 30 divided by 75

play14:40

is 6 over 15

play14:42

so again naive assumption that the words

play14:44

buy cheap and work are all independent

play14:46

therefore the probability that the three

play14:48

of them appear in an email is the

play14:50

product of these three numbers which is

play14:52

12 divided by 125 and again if we want

play14:55

to estimate the number of emails that

play14:57

are spam that contain those three words

play14:59

we multiply the probability times the

play15:01

total and we get twelve divided by five

play15:04

so roughly twelve divided by five which

play15:06

is a little over two emails will be spam

play15:10

and contain the words by chip and work

play15:13

and now let's do it over here we assume

play15:14

again that the three words are

play15:16

independent of each other we take the

play15:18

product of the probabilities and that's

play15:20

gonna be the probability that the words

play15:22

buy cheap and work all appear in an

play15:24

email at the same time when the email is

play15:26

not spam

play15:27

so in order to find the total number of

play15:29

emails that are not spam to contain the

play15:31

words by cheap and work we multiply the

play15:34

probability that they appear times the

play15:37

total number of emails and we get that

play15:38

four out of 15 emails are not spam and

play15:42

contain the word by Cheban work because

play15:44

75 times 12 divided by three three seven

play15:46

five is four fifteen so in summary out

play15:49

of the emails that contain the words by

play15:52

chip and work 12 over five are spam and

play15:55

four over 15 or ham so how many are spam

play16:00

divided by the total well we take twelve

play16:02

hour by five the number spam divided by

play16:05

the total which is 12 over 5 plus four

play16:07

divided by 15 and that is gonna be 36

play16:11

over 44 put in lowest terms or 90% so

play16:15

that's how we combine the three words

play16:17

now is that 90 is less than 97 because

play16:19

the word work actually decreases the

play16:22

probability that an email is spam

play16:23

because as you can see work appears a

play16:26

lot more in spam emails so it does make

play16:28

sense because it's not a word that one

play16:30

would correlate with spam so some of

play16:33

these properties may increase

play16:34

probability and some of them would

play16:35

decrease it but the fact is a nice base

play16:37

helps us combine a bunch of different

play16:41

features into creating a model that

play16:45

calculates the probability that

play16:47

something is spam and these features get

play16:49

combined in a nice way because we don't

play16:51

have to wait until we find an email with

play16:53

all these features we can actually cook

play16:55

up probabilities without having emails

play16:57

that satisfy all of them so if you're

play17:00

like formulas this is really what

play17:01

happened in the background we have this

play17:04

is the formula of Bayes theorem and the

play17:06

letter S stands for being spam the

play17:08

letter H stands for ham which is

play17:09

actually how they call email that are

play17:10

not spam they call them ham and the red

play17:12

letter B stands for by so probability of

play17:15

s given B when you see that vertical bar

play17:18

that is a conditional probability so

play17:22

what the Left says is probability of

play17:24

spam in the word by appears and that's a

play17:27

ratio because most post probabilities

play17:28

are ratios and then the top we have

play17:31

probability of BI given that spam so out

play17:35

of the spanning knows how many of them

play17:37

contain the word by that was 20 out of

play17:38

25 and then probability of s is

play17:41

email spam regardless of any words that

play17:44

it contains to us 25100 because if we

play17:45

remember there were 25 spaniels out of

play17:47

100 total so in the bottom goes

play17:50

everything that total so that's the same

play17:52

thing 20 over 25 times 25 or 100 plus

play17:56

the ham ones so we have what's the

play17:59

probability of the word by appearing if

play18:01

the email is ham that's five or seventy

play18:03

five because out of 75 animals five of

play18:06

them have the word pie and the

play18:08

probability of animal being ham well 75

play18:10

over 100 so if you do that whole formula

play18:13

you get 80% but the interesting thing is

play18:15

if you look at what we did it was

play18:17

exactly that and then what happens with

play18:20

naive Bayes is that we make this

play18:21

assumption that the probability of the

play18:23

word by and the word cheap appearing is

play18:25

the product of the probabilities of the

play18:26

word by appearing and the word cheap

play18:29

appearing again this is not supposed to

play18:31

happen the words buying cheap may be

play18:33

either correlated or inversely

play18:35

correlated maybe one implies the other

play18:37

one maybe one stops the other from

play18:39

appearing but we're gonna assume naively

play18:40

that the product of the probabilities

play18:44

the property of both appearing which is

play18:45

saying this that the probability of some

play18:48

event B intersection segment C is a

play18:50

product of probabilities of B and C

play18:52

appearing again is a naive assumption

play18:54

but we're gonna make it because I've

play18:55

makes our math easier and the full

play18:57

formula for a naive Bayes this is for

play18:59

two events but you can generalize this

play19:00

for many more events is probability of

play19:04

spam given that the words buy and cheap

play19:09

appear is that formula and if we look at

play19:12

all the probabilities here we say

play19:13

probability of spam if it contains the

play19:15

words buy and cheap well it's a ratio on

play19:18

the top we know these probabilities is

play19:19

20 out of 25 or probability of by given

play19:21

that a spam probability of cheap given

play19:23

that it's spam is 15 or 25 you remember

play19:26

correctly or 15 e spam emails containing

play19:28

the word cheap and then again 25 over

play19:30

100 for the probability that an email of

play19:32

spam in the bottom we have the same

play19:33

thing plus 5 over 7 5 the probability

play19:36

that a ham email contains the word pie

play19:38

10 over 75 the product is at honey milk

play19:41

contains were cheap and then the

play19:42

probability that an email is ham which

play19:45

is 75 over 100 you do this math and you

play19:48

get ninety four point seven three seven

play19:50

but I challenge you if it doesn't look

play19:52

super clear look at this slide and go

play19:54

to what we did in night base and

play19:56

convince yourself this is exactly what

play19:57

we did

play19:58

what do then this whole video was

play19:59

nothing different than calculating

play20:01

probabilities by dividing one thing by

play20:04

another so thank you very much that's it

play20:08

for a naive base as usual if you liked

play20:10

it please subscribe for more videos

play20:12

coming up yeah please hit like share it

play20:15

with your friends and feel free to

play20:17

comment to ask any questions or any

play20:19

suggestions for this or any other videos

play20:20

you'd like to see and my twitter handle

play20:23

is louis likes math so thank you very

play20:26

much for your attention and see you in

play20:27

the next video

Rate This

5.0 / 5 (0 votes)

相关标签
Machine LearningNaive BayesSpam DetectorProbabilityData ScienceEmail AnalysisStatistical ModelBayes TheoremConditional ProbabilityFeature Independence
您是否需要英文摘要?