Advent Of Cyber: Day 15 - Building Email Spam Detector with ML

Cybrites
15 Dec 202331:54

Summary

TLDRこのビデオスクリプトは、今年の「Advent of Cyber」イベントに参加しているSakibが、12月1日から24日までの間、毎日新しいことを学びながら、素晴らしい賞を勝ち取る機会を提供していることを紹介しています。特に15日目のタスクである機械学習に関する内容を取り上げ、「Jingle Bell spam machine」というタイトルで、スパムメール検出器の構築に挑戦します。データセットを用いて機械学習モデルをトレーニングし、その効果を評価し、最後にテストメールを用いてモデルのパフォーマンスを検証します。また、モデルの改善点として、テストデータの割合を変更したり、データセットのサイズを増やしたりする方法も提案されています。

Takeaways

  • 📅 このセミナーは、12月1日から24日までのアドベント・オブ・サイバーに関連しています。
  • 🎓 学習目標は毎日提供され、参加者は新しいことを学び、素晴らしい賞を勝ち取る機会があります。
  • 📈 今回のセミナーでは、機械学習、メタ分析、侵入テスト、デジタルフォレンジック、およびインシデント対応など、さまざまなトピックがカバーされます。
  • 📝 今日15日のタスクとして、機械学習を使ったスパムメール検出器の構築について扱います。
  • 🏢 最近、Festival Companyの社員が多数のスパムメールを受け取っており、メールボックスにスパムが入っていました。
  • 👩‍💻 Mekyが、サンプルデータセットを使って機械学習モデルをトレーニングし、スパムメール検出器を構築する任務が与えられました。
  • 🔍 機械学習パイプラインのステップを探索し、データセットをトレーニングデータとテストデータに分割し、モデルを評価する方法を学びます。
  • 📚 理論的な知識を読む必要がありますが、重要なトピックであるため耐えてください。
  • 🔢 NumPyとPandasという2つのライブラリを使用して、データの数値計算とデータ構造を扱います。
  • 📝 Jupyter Notebookを使用して、機械学習プロジェクトを簡単に扱えます。
  • 🔑 データ前処理は、機械学習モデルに適したクリーンで整理された形式にデータを変換する技術です。
  • 📈 CountVectorizerを使用して、テキストを数値形式に変換し、機械学習モデルで使用できるようにします。
  • ✂️ データセットをトレーニングデータとテストデータに分割し、モデルのパフォーマンスをテストします。
  • 🤖 Naive Bayesという統計的手法を使用して、新しいメールがスパムかどうかを判断します。
  • 📊 モデルの評価では、正確性、精密度、再現率、F1スコアなどのメトリックを使用して、モデルのパフォーマンスを評価します。
  • 📧 テストメールを用いて、トレーニング済みのモデルをテストし、予測結果を確認します。
  • 🔧 モデルのパフォーマンスを向上させるために、テストデータの割合を変更したり、データセットのサイズを増やしたりすることができます。

Q & A

  • アドベント・オブ・サイバーはいつから開催されますか?

    -アドベント・オブ・サイバーは12月1日から24日までの間、毎日学習目標を提供しています。

  • 今日のタスクのテーマは何ですか?

    -今日のタスクのテーマは機械学習で、「ジングルベルスパム、機械学習が救世主になる」というタイトルです。

  • スパムメールを検出するために使用されるアルゴリズムは何ですか?

    -スパムメールを検出するためには、ナイーブベイズベースのアルゴリズムが使用されます。

  • データセットをトレーニングデータとテストデータに分割する理由は何ですか?

    -データセットを分割することで、モデルはトレーニングデータで学習し、テストデータで未知のデータに対するパフォーマンスを評価できます。

  • トレーニング前にデータを前処理する理由は何ですか?

    -前処理は、生データを機械学習モデルが理解できる形式に変換するために必要な手順であり、データの質を保証するためです。

  • ナイーブベイズ分類器はどのようにして新しいメールがスパムかどうかを判断するのですか?

    -ナイーブベイズ分類器は、各メールの単語を調べ、スパムとハムのメールでそれぞれの単語がどれだけ頻繁に現れるかを計算して、新しいメールがスパムかどうかを判断します。

  • モデルのパフォーマンスを評価するために使用される指標は何ですか?

    -モデルのパフォーマンスを評価するためには、正確性、精密度、再現率、F1スコアなどの指標が使用されます。

  • テストデータセットでモデルがスパムを正しく予測する確率はどのくらいですか?

    -テストデータセットでモデルがスパムを正しく予測する確率は90%です。

  • テストデータセットでモデルがハム(スパムでないメール)を正しく予測する確率はどのくらいですか?

    -テストデータセットでモデルがハムを正しく予測する確率は99%です。

  • 機械学習パイプラインの最初のステップは何ですか?

    -機械学習パイプラインの最初のステップはデータ収集です。

  • データ前処理で使用される特徴エンジニアリングとは何ですか?

    -特徴エンジニアリングは、新しい特徴を作成したり既存の特徴を変更することで、モデルのパフォーマンスを向上させるための手法です。

  • テストデータセットの何パーセントがスパムとしてマークされましたか?

    -テストデータセットの2つまたは3つのメールがスパムとしてマークされました。

  • テストメールで検出されたスパムメールに含まれるシークレットコードは何ですか?

    -テストメールで検出されたスパムメールに含まれるシークレットコードは「I hate best Festival」です。

Outlines

00:00

🎓 Advent of Cyberの紹介と学習目標

Sakibが今年のAdvent of Cyberに参加し、12月1日から24日までの毎日新しい学びと賞を提供されるイベントについて紹介。また、最終的に証明書も受け取れる。今年は機械学習、メロディ解析、侵入テスト、デジタルフォレンジック、インシデント対応など幅広いトピックがカバーされる。特に、15日目のタスクは機械学習に焦点を当て、「ジングルベルスパム機械学習が日を救う」というタイトルで、スパムメールの検出に役立つ。

05:02

📚 機械学習パイプラインのステップ

Sakibは、Jupyter Notebookを使用して機械学習プロジェクトを進め、データセットの読み込み、データ前処理、特徴量抽出、データのテストとトレーニングへの分割、機械学習モデルの適用と評価、そしてデプロイメントのプロセスを説明。特に、データ前処理では、テキストデータを数値形式に変換するためにCountVectorizerを使用し、データセットを数値行列に変換する。

10:04

🔍 データセットの分割とモデルの選択

データセットをトレーニング用とテスト用に分割し、ナイーブベイズ分類器を使用してモデルをトレーニング。トレーニング後は、テストデータセットでモデルのパフォーマンスを評価し、精度、再現率、リコールなどのメトリックを計算。さらに、テストメールを用いてモデルの予測力をテストし、結果を分析。

15:06

🛠️ モデルの評価とテスト

トレーニングされたナイーブベイズモデルをテストデータセットに適用し、その結果を評価。モデルは、スパムかハムかを正確に予測し、再現率とリコール率が高いと示された。テストメールの1つがスパムとしてマークされ、その中にはシークレットコードが含まれている。

20:08

📈 モデルの改善とデプロイメント

モデルのパフォーマンスを向上させるために、テストデータの割合を変更し、データセットのサイズを増やすなどの改善点が提案される。また、実際の環境でモデルを監視し、ユーザーからのフィードバックを収集してモデルの弱点を特定し、改善を加えることが重要であると強調された。

25:09

🏆 タスクの締めと今後の展望

Sakibは、参加者に対してこのタスクを楽しんでもらい、他のモジュールもチェックしてほしいと要請。また、モデルのパフォーマンスを改善するため、テストデータの割合やデータセットのサイズを調整し、より多くのデータを用いて偽陽性の確率を減らすことができると示唆。最後に、参加者の皆様に感謝の意を表し、今後もお会いできることを楽しみにしていると結んでいる。

Mindmap

Keywords

💡アドベント・オブ・サイバー

「アドベント・オブ・サイバー」は、12月1日から24日までの間、毎日サイバーセキュリティに関する学習目標を提供するイベントです。このイベントは、新しいことを学ぶ機会を提供し、また素晴らしい賞を勝ち取るチャンスも与えます。ビデオのテーマは、このイベントの1日目に焦点を当てた機械学習のタスクです。

💡機械学習

機械学習は、コンピュータがデータから学習し、タスクを遂行する能力を獲得するプロセスです。ビデオでは、機械学習がスパムメール検出器の構築に適用されており、そのプロセスが詳細に説明されています。機械学習は、電子メールの本文を数値データに変換し、スパムメールとハムメールを分類するのに役立ちます。

💡スパムメール検出器

スパムメール検出器は、スパムメールをフィルタリングするために使用されるソフトウェアです。ビデオでは、スパムメールがフィルタリングされない問題に対処するために、機械学習モデルを使ってスパムメール検出器が構築されています。これは、ビデオの主なトピックであり、サイバーセキュリティの重要なアプリケーションの一つです。

💡データセット

データセットは、機械学習モデルをトレーニングするために使用されるデータの集まりです。ビデオでは、異なるソースから収集されたデータセットが提供されており、それを使って機械学習モデルがトレーニングされています。データセットには、メールの分類と本文の2つのカラムが含まれており、これはスパムメール検出器の構築に使われています。

💡データ前処理

データ前処理は、生データをクリーンで整理された形式に変換するプロセスであり、機械学習の重要なステップです。ビデオでは、メールのテキストデータをクリーニングし、小文字に変換し、停用語を削除することで、データを前処理しています。これにより、機械学習モデルがより適切なデータを扱うことができ、パフォーマンスが向上します。

💡特徴量エンジニアリング

特徴量エンジニアリングは、モデルのパフォーマンスを向上させるために新しい特徴を作成したり、既存の特徴を変更したりするプロセスです。ビデオでは、テキストデータを数値データに変換するために、カウントベクトリゼーションという特徴量エンジニアリング手法が使用されています。これにより、機械学習モデルがテキストを理解し、スパムメールを効果的に検出できるようになります。

💡カウントベクトリゼーション

カウントベクトリゼーションは、テキストデータを数値データに変換する機械学習の手法です。ビデオでは、この手法が使用されており、メールの本文をトークンの出現頻度マトリクスに変換します。これにより、機械学習モデルは、スパムメールとハムメールを分類する際に、テキストデータを数値データとして扱うことができます。

💡トレーニングとテストの分割

トレーニングとテストの分割は、機械学習パイプラインのステップで、データセットをトレーニング用とテスト用に分割します。ビデオでは、データセットの20%をテスト用としてランダムに分割し、残りの80%をモデルのトレーニングに使用しています。これにより、モデルは未見過のデータでテストされ、予測能力を評価することができます。

💡ナイーブベイズ分類器

ナイーブベイズ分類器は、確率的な手法に基づく機械学習アルゴリズムで、テキスト分類によく使われます。ビデオでは、ナイーブベイズ分類器がトレーニングされ、メールのテキストデータからスパムメールを検出するのに使われています。ナイーブベイズアルゴリズムは、各単語がスパムメールに登場する確率を計算し、それを用いてメールを分類します。

💡モデル評価

モデル評価は、トレーニングされた機械学習モデルの性能を測定するプロセスです。ビデオでは、テストデータセットに対してモデルの予測を行って、正確性、再現率、感度、およびF1スコアなどのメトリックを計算しています。これにより、モデルがスパムメールを効果的に検出できるかどうかを判断し、改善の余地があるかどうかを特定できます。

💡モデルデプロイ

モデルデプロイは、トレーニングと評価が完了した機械学習モデルを実際の環境に実装するプロセスです。ビデオでは、スパムメール検出器モデルがテストされ、その後、実際のメールシステムにデプロイされる可能性があることが示されています。デプロイメントは、モデルが現実の問題を解決できることを証明する最終ステップです。

Highlights

Sakib is excited to be part of the Advent of Cyber by Try Haack Me Advent, which offers daily learning objectives and prizes from December 1st to 24th.

The event covers various topics including machine learning, malware analysis, penetration testing, digital forensics, and incident response.

Day 15 focuses on machine learning with the task titled 'Jingle Bell Spam Machine Learning Saves the Day'.

The task involves building a spam email detector using machine learning to address an influx of spam emails at a company following a merger.

A sample dataset is provided for training the machine learning model, emphasizing the application of machine learning in cybersecurity.

The learning objectives include understanding the machine learning pipeline, classification, training models, data splitting, and model evaluation.

The lab environment uses Jupyter Notebook, which is introduced as a tool for easy machine learning project work.

The dataset consists of two columns: classification (spam or ham) and message (email body).

Data pre-processing involves cleaning and structuring the data, with techniques such as removing punctuation and stopwords.

CountVectorizer is used to convert text data into a numeric format that machine learning models can understand.

The data is split into training and testing subsets, with 20% of the dataset reserved for testing.

Naive Bayes is chosen as the text classification model for training the spam detector.

The Naive Bayes algorithm calculates the probability of an email being spam based on the frequency of certain words.

The model is evaluated using metrics like accuracy, precision, recall, and F1 score on the test dataset.

The task includes a practical test where the trained model predicts whether random email texts are spam or ham.

The model's effectiveness is tested on a set of emails provided in a CSV file.

Continuous monitoring and feedback are essential for improving the model's performance in a real-world environment.

To enhance the model, one can adjust the percentage of test data, increase the dataset size, or use different machine learning algorithms.

The task concludes with a congratulatory message and an invitation to explore other modules for further learning.

Transcripts

play00:01

hello everyone my name is sakib and I am

play00:05

super excited to be part of this year's

play00:07

Advent of cyber by try Haack me Advent

play00:10

of cyber brings you with amazing

play00:13

learning objectives daily from 1st

play00:16

December to 24th December and each day

play00:19

we get a chance to learn something new

play00:21

and also get a chance to win amazing

play00:24

prizes and at the end once we complete

play00:27

all the tasks we will also get the CER

play00:30

certificate this year we are covering

play00:32

different topics from machine learning

play00:35

Mel analysis penetration testing so and

play00:38

digital forensics and instent response

play00:41

today we are going to cover day 15 task

play00:44

which is on machine

play00:46

learning so let's and the title of

play00:49

today's task is Jingle Bell spam machine

play00:53

learning saves the day before starting

play00:56

I'll start the lab because it will take

play00:58

around 3 to 5 minutes minutes to load so

play01:03

let's give it give it some time to load

play01:06

and read the task out okay so over the

play01:09

past few weeks best Festival company

play01:12

employes have been receiving an

play01:13

extensive number of spam emails these

play01:16

emails are trying to lure users into the

play01:18

Trap of clicking on links and providing

play01:22

credentials spam emails are somehow

play01:25

ending up in mail mailing box this is

play01:29

interesting it looks like the spam

play01:31

detector in place before being the

play01:34

merger has been disabled or damaged

play01:37

deliber suspicion is on me greedy who is

play01:41

not so happy with the merger problem

play01:45

statement meky has been tasked with

play01:48

building a spam email detector using

play01:50

machine learning she has been provided

play01:52

with a sample data set collected from

play01:55

different sources to train the machine

play01:57

learning model so machine learning

play02:00

has so many applications in cyber

play02:03

security domain and building spam

play02:06

detector is one of them and in today's

play02:10

task we will have some Theory to read

play02:14

which may get you boring but uh bear

play02:16

with me it's very interesting topic and

play02:19

we have uh the task is divided into some

play02:22

some steps so that it's easy to

play02:25

understand and

play02:26

absorb learning objectives um in this

play02:29

task we will explore different steps in

play02:31

generic machine learning pipeline

play02:33

machine learning classification and

play02:35

training models how to spit the data set

play02:39

into training and testing data how to

play02:41

prepare the machine learning model how

play02:43

to evaluate the model's

play02:46

Effectiveness lab connections these are

play02:48

some instructions to connect with the

play02:49

lab all you have to do is press the

play02:52

green start machine button that what we

play02:55

have already done the lab will start on

play02:58

the right side in a split screen which

play03:01

we are seeing right now and it will take

play03:03

around 3 to 5 minutes to

play03:05

load overview of jupyter notebook in

play03:08

this room we will be using jupyter

play03:12

notebook as our lab because it's very

play03:15

easy to work on machine learning

play03:18

projects uh while we are working on each

play03:21

command we are running each command uh

play03:24

in front of us

play03:26

so Jupiter notebook is also covered in

play03:30

task two so I hope that you have already

play03:33

have some explosure on jupyter Notebook

play03:36

so um it's important to recall that we

play03:39

will need to run the code from the cells

play03:42

using the Run button uh there are two

play03:45

ways to run the code on present in the

play03:49

cells so this is the layout of Jupiter

play03:51

Notebook on the right side of the screen

play03:53

uh it contains this is the notebook and

play03:55

it contains the cells and in cell we

play03:57

have the code uh and on the left side we

play04:01

have the files that we will be working

play04:03

on so there are two ways one we can

play04:06

press the play button or uh we can also

play04:10

have the shortcut shift enter to execute

play04:13

the

play04:15

commands Okay um exploring machine

play04:17

learning pipeline machine learning

play04:20

pipeline refers to a series of steps

play04:23

involved in building and deploying

play04:25

machine learning model these steps

play04:27

ensure that the data flows effectively

play04:30

from its raw form to pred prediction

play04:33

Insight predictions and insight a

play04:35

typical pipeline would include

play04:38

collecting data from different sources

play04:40

in different forms uh pre-processing it

play04:43

and Performing feature extraction from

play04:45

the data splitting the data into testing

play04:48

and training data and then applying

play04:50

machine learning model for predictions

play04:53

and once it's done uh we can then deploy

play04:57

it in a testing or deploying uh

play04:59

production in

play05:01

mind okay so it looks like the lab has

play05:06

almost started so this is the Jupiter

play05:09

notebook and we have three files one The

play05:13

Notebook itself and the two uh data data

play05:17

sets one emails uncore data set and if

play05:19

we double click we can see that it it

play05:22

has two columns some this is something

play05:24

that we'll explore in in a bit okay so

play05:28

the the zero step would be importing the

play05:31

required Library so what we are going to

play05:33

do okay first I I'll click here so that

play05:35

we have some extra space okay so the

play05:39

first thing will be importing the

play05:41

required libraries right now we need two

play05:45

libraries one numpy and pandas numpy

play05:48

deals with the numeric computation in

play05:51

Python and pandas provide high level

play05:54

data structure and the methods designed

play05:56

to make data analysis fast and easy in

play05:59

Python

play06:00

so what I'll do I'll press shift enter

play06:04

and it will run those commands so if

play06:07

there is a static it means the command

play06:10

is still loading it's taking time but if

play06:12

when it's it's it's gone it means that

play06:15

uh the the commands within the cell has

play06:18

been executed

play06:20

successfully so the first step would be

play06:23

data collection let's read it out data

play06:26

collection is a process of gathering Raw

play06:29

data from various sources to be used for

play06:32

machine learning this data from uh data

play06:36

can originate from various sources such

play06:38

as database text files apis on

play06:40

repositories sensors Etc and here we

play06:44

have a CSV file what we'll do we'll use

play06:48

this uh this data set for our analysis

play06:51

and let's load that so we have the data

play06:54

we need to import that uh uh load that

play06:58

so what I'll do I'll press shift enter

play07:00

again and the data has been loaded

play07:03

successfully uh in the variable data so

play07:06

let's check the data out I'll press the

play07:09

shift enter again and um the head

play07:13

command prints only the top five values

play07:17

and if I want the last five values I'll

play07:20

add tail uh data. tail uh to print that

play07:24

out so the data data set looks like it

play07:28

has two columns classification and the

play07:30

message message contains email body and

play07:35

classification spam or ham so if a if if

play07:39

a message or the email is Spam it will

play07:43

have the classification as samam spam

play07:46

and so on okay uh data frames provide a

play07:51

structured tabular representation of the

play07:53

data that's intuitive and easy to read

play07:56

using the command blow we'll convert the

play07:58

data

play07:59

into Data

play08:01

frame okay uh it will make the data easy

play08:05

to an analyze and

play08:07

manipulate let's run this command this

play08:11

will convert the data set uh into Data

play08:13

frame and print it out uh it has

play08:16

converted now with this command and it

play08:19

looks like we have now we have a total

play08:22

of 4451 rows which are records or emails

play08:28

and two columns which are classification

play08:32

and message let's move on to the next

play08:36

step so uh in this task we have divided

play08:39

the task into small steps so that it's

play08:42

easy for us to understand what we are

play08:45

doing actually so the the second step is

play08:47

data

play08:49

pre-processing let me uh read it out

play08:52

data pre-processing refers to the

play08:54

techniques used to convert raw data into

play08:57

clean organized understanding able and

play09:00

structured format suitable for machine

play09:02

learning given that the raw data is

play09:05

often messy inconsistent incom complete

play09:08

pre-processing is an essential step to

play09:10

ensure that the data we are feeding into

play09:12

the machine learning model is relevant

play09:14

and of high quality here are some of the

play09:17

techniques used for data process uh

play09:20

pre-processing so there are so many

play09:21

techniques that once we have the data we

play09:24

need to clean it for example in in in

play09:28

case of emails uh emails we need to uh

play09:31

first remove all the punctuations all

play09:34

the comma full stop Etc and um the words

play09:38

like and or is this that because they do

play09:43

not add any value uh in machine learning

play09:47

understanding towards the classification

play09:49

of email being spam or ham because they

play09:52

are common and they cannot make any any

play09:55

impact or any difference so we have we

play09:57

have to remove them we have to remove D

play09:59

values Etc so this is the process of uh

play10:03

pre-processing uh feature extraction we

play10:06

can extract the features uh data uh text

play10:09

preprocessing Tok tokenization these are

play10:13

different ways we can apply in in

play10:16

different uh situations feature uh

play10:19

engineering creating new features or

play10:21

modifying existing ones to improve model

play10:26

performance utilizing comp vectorizes in

play10:29

this task the pre-processing thing we

play10:32

will be using is Count vectorizer which

play10:35

is which can be used to convert num uh

play10:38

text into

play10:40

numbers uh why because machine learning

play10:43

model understands numbers not text this

play10:46

means that the text needs to be

play10:48

transformed into numeric format count

play10:51

vectorizer is a class provided by skarn

play10:54

library in Python it achieves achieve

play10:57

this by converting the text text into

play10:59

token vers count

play11:02

Matrix it is used to prepare the data

play11:05

for the machine learning models to use

play11:07

and predict Deion on okay so it it will

play11:12

it will be more clear when we get um

play11:15

execute these commands and understand

play11:17

the output so what I'll do I'll execute

play11:20

these commands and then break it down

play11:21

for

play11:22

you okay so what we are going to do

play11:25

first we are going to

play11:27

import

play11:30

count vectorizer from

play11:33

escalar and then assigning it to a

play11:36

variable

play11:37

vectorizer and then what we are going to

play11:40

do uh the data frame which contains DF

play11:44

that we we had defined earlier which

play11:46

contains the column message we are going

play11:48

to transform using the fitore transform

play11:52

function to transform into numeric

play11:55

values so message is has the text and

play12:00

now if you look if you execute

play12:02

this it will be saved in the variable X

play12:06

and let's PR print it out and see the

play12:09

output okay this is the output which

play12:12

looks kind of confusing but uh let me

play12:14

explain we have the emails the email at

play12:18

zero index in the email at zero index

play12:22

the word at the index 6653 has occurred

play12:27

once similarly in the email in the email

play12:32

um document or email zero index the word

play12:37

in the index 6733 has occurred three

play12:41

times so it is calculating the count of

play12:46

the number of occurrence of of any word

play12:49

for example uh in spam word spam the

play12:52

word uh congratulations may have

play12:55

occurred six times so it would increase

play12:57

the probability so what this has done it

play13:01

has converted our email into numeric

play13:04

Matrix so that we can perform machine

play13:07

learning um um predictions and model on

play13:11

we can train on this data set now so now

play13:15

we have the data set in in in uh numeric

play13:19

representation that's that's very

play13:21

interesting and very important topic

play13:23

very important step uh the third step so

play13:26

the third step is now that we have the

play13:29

data set of 4451 I guess U numeric

play13:33

Matrix matrices what we need to do in

play13:37

the next step we need to divide that

play13:40

data set into two

play13:41

parts why because we want to train our

play13:45

model on one subset and then test our

play13:49

model on the second subset it's

play13:52

important to test the model's

play13:54

performance on unseen data by splitting

play13:57

the data we we can train our model on

play14:00

the subset on one subset and test its

play14:03

performance on another um this is a very

play14:07

clear explanation or uh image like we

play14:12

have the data set we are going to split

play14:14

that into two parts so what we are going

play14:17

to do uh these are the commands let me

play14:20

again run this command and then I'll

play14:23

break it down one by one what we are

play14:25

going to do we are going to

play14:30

first import the function Trainor test

play14:34

split I'll press shift

play14:37

enter and then what I'm going to do uh

play14:41

there are two variables capital x

play14:44

contains the numeric representation of

play14:47

the data set or the email body and now

play14:50

why contains the data frame of

play14:52

classification spam or H so and we are

play14:56

going to use this and we we will be

play14:59

providing three

play15:02

variables capital x the numeric uh

play15:06

representation of email body why the the

play15:10

classification data set data frame and

play15:13

third the test size that is that we have

play15:16

set for uh 20% or 020 so what it it will

play15:20

do it will split randomly 20% of the

play15:24

data set um for test testing purpose and

play15:30

the remaining will be something that we

play15:32

will use and the variables are xcore

play15:36

train and Yore train these are the two

play15:39

variables we will be training over test

play15:42

over over model on and when we are going

play15:45

to test we will use X test and Y test

play15:50

and this has been explained here as well

play15:53

X train the subset for the features to

play15:55

be used for training y train the

play15:58

corresponding labels for X train set and

play16:02

uh similarly for others two as well now

play16:05

we have splitted over let me let me run

play16:11

this so we have successfully splited our

play16:15

data into two parts and the variables um

play16:18

the results are stored

play16:20

here next step is like this is the next

play16:24

step so we are moving one step at a

play16:27

time

play16:29

model training now that we have the data

play16:32

set ready The Next Step would be to

play16:33

choose the text classification model and

play16:36

use it to train on the given data set

play16:39

some commonly used text classification

play16:41

models are explained below uh knif based

play16:45

classification spvm logistic regression

play16:48

D entries these are all the uh

play16:51

classification models that are used uh

play16:55

to train on the text based uh data set

play16:59

and in this task we will be using knife

play17:02

base so let Let's uh dive in straight to

play17:05

the knife base knife base is a

play17:09

statistical method that uses the

play17:11

probability of certain words appearing

play17:13

in spam and in ham emails to determine

play17:18

whether a new email is Spam or not let's

play17:22

break it down how this actually works so

play17:26

how how knife based classification works

play17:28

let's say we have a bunch of emails some

play17:31

labeled as spam and some as

play17:34

ham just like this data set the N based

play17:38

algorithm learns from these emails it

play17:41

looks at the words in each email and

play17:44

calculates how frequently each word

play17:47

appears in Sam in spam or ham emails for

play17:51

instance words like free win offer

play17:54

Lottery may appear more in spam emails

play17:58

so if we take a look at uh this data set

play18:01

we we can see that for example uh in

play18:03

spam there's a word a award maybe award

play18:07

is is used frequently in multiple spam

play18:10

so it will remember

play18:11

that the knif Bas algorithm calculates

play18:14

the probability of the emails being

play18:17

spamed based on the words it

play18:20

contain when a model is trained with

play18:23

knife Bas and gets a new email that says

play18:26

for example when a free toy now then it

play18:30

thinks that this is where it's training

play18:33

itself when often appears in spam so

play18:36

this increases the chance of email being

play18:39

spamm free is also very common in spam

play18:43

further increases the spam

play18:46

probability may be neutral often

play18:49

appearing in both spam and H after

play18:53

considering all the words it calculates

play18:55

the overall probability of the email

play18:58

being spam or

play18:59

ham if the calculated probability of

play19:03

spam is higher than that of ham the

play19:07

algorithm classifies the email as spam

play19:09

otherwise it's classified as H okay

play19:13

let's use knif Bas to train the model as

play19:18

explained below so um this is this is

play19:21

how knife based work let's execute these

play19:24

commands and then I'll break down step

play19:28

by so in the next step what we are going

play19:30

to do we are going to import multi-

play19:34

nominal MB U which is the function in

play19:38

escalin escalin do knif based uh Library

play19:42

so what we are going to do we are going

play19:44

to import this and then assigning it to

play19:47

the variable classification so class clf

play19:50

would uh would remember or would would

play19:53

be used as knife based model it will be

play19:56

used to train H in the next step if you

play19:59

look at we we are trying to use this to

play20:03

fit or train on these two variables

play20:07

which contain the numeric representation

play20:10

of the emails email body and uh the

play20:15

classification of either spam or ham so

play20:18

what it's doing it's it's working on

play20:20

looking at this data set and uh trying

play20:24

to remember the numeric representation

play20:25

of this data set and trying to remember

play20:28

what words what's the frequency of which

play20:31

word and which word word were more

play20:33

frequent in spam so that it remembers

play20:36

what's next so okay uh let me execute

play20:40

this it it's trying to remember based on

play20:44

what we are producing what we are U

play20:47

giving it as a as a as an

play20:49

input Next Step would be F fifth

play20:53

step model

play20:57

evaluation

play20:58

so in model evaluation we will evaluate

play21:02

if our model has predicted correctly so

play21:05

what we are going to do we are going now

play21:08

we are going to provide it the test data

play21:13

set so after training it's essential to

play21:17

evaluate the model's performance on the

play21:19

test set to check its predictive power

play21:22

this will give you the metrics such as

play21:25

accuracy precision and recall let's let

play21:28

me execute these commands and then I'll

play21:32

uh we look at the output and we'll uh

play21:34

try to break it down one by

play21:38

one so I'll execute these commands shift

play21:43

enter okay so let me break it down we

play21:47

are importing classification undor

play21:50

report from esal Matrix and then we are

play21:56

providing our classification with to

play22:00

asking it to predict on xor test data

play22:04

set which is the data set that we had

play22:07

splitted before and then we are printing

play22:09

the classification report on y test and

play22:13

Y predict so what we what it has PR

play22:16

predicted and we are uh printing that

play22:19

out so this is the output uh let me

play22:22

understand read it out so it says that

play22:26

for for him the model predicts it's not

play22:29

a Spam it's correct n 99% of the time

play22:34

for for for the spam if the model says

play22:37

that it's spam the model is correct 90%

play22:42

of the time similarly for the for the

play22:44

ham for the recall the model captures

play22:47

99% of non spam or ham messages and

play22:51

similarly for uh the spam it captures

play22:55

96% of actual spam F1 is the balance

play23:00

score between these two uh of these two

play23:03

sport is like we have

play23:08

789 ham messages in the in the test data

play23:11

set and 102 total spam messages in the

play23:16

test data set and these are the

play23:18

averages okay so this is the output

play23:22

means we have tried to predict our model

play23:25

predict on on the test dat data set what

play23:28

we have done we have first uh trained

play23:30

our model uh which is saved on clf uh

play23:34

classification and it remembers then it

play23:37

it tries to predict on the test data set

play23:40

that we have provided next so

play23:43

the each one is explained below okay so

play23:48

now it's time to test our model we have

play23:51

uh we have evaluated it's time to test

play23:55

our

play23:56

model so how how we are going to do we

play23:59

are going to provide it with with with

play24:02

some random email text that we may we

play24:06

think that it may be a Spam or ham we

play24:08

need to test first we will need to

play24:11

transform them transform the message and

play24:14

then provide that message to our

play24:17

classifier and then we'll see let's

play24:20

print out if the message what what's the

play24:22

prediction so what's next step number

play24:26

six okay I'll press shift enter and it

play24:30

says it's it's a Spam why because it's

play24:34

looking at it it says that today's offer

play24:37

claim your this and that so it has

play24:40

calculated based on the probability of

play24:43

the words like offer claim worth

play24:46

discount vouchers Etc which appear more

play24:50

often in spam than in normal messages

play24:53

there there's a chance of false positive

play24:55

but uh let's leave that out for now

play24:58

okay so that's it that's our task and

play25:03

what's next maxky is Happy that workable

play25:06

spam detector model has been developed

play25:09

she has provided us with some test

play25:11

emails in the file test emails. CSV and

play25:15

wants us to run the prepared model

play25:18

against these emails to test our model

play25:22

results so we have if look we have a

play25:25

test emails uh file which is only

play25:33

messages okay so we are going to it says

play25:37

that update the following code to

play25:39

include the test email files and run the

play25:41

train train model against these emails

play25:44

so what we are going to do we are going

play25:46

to

play25:48

add the

play25:54

test

play25:56

test

play26:00

emails

play26:02

Dov

play26:12

and okay the test file has been loaded

play26:16

and we need to change this as well it

play26:20

would be

play26:22

testore data because that's the variable

play26:25

where we are buing this CSV file so

play26:29

let's load this and print the top five

play26:32

values so uh this is we we only have uh

play26:35

the messages which contains email okay

play26:39

uh we are going to transform our

play26:42

messages and into numeric and um Matrix

play26:46

and then uh we will apply the prediction

play26:49

so what we are going to do we we are

play26:52

using the vectorizer to transform the

play26:56

test data

play26:58

and and the column messages and then

play27:03

applying the prediction asking our train

play27:06

model to predict on this me uh this uh

play27:11

data set so let's run

play27:13

this and let's try our result

play27:18

out excellent so what we have done uh

play27:21

what we have got we have got the

play27:23

prediction on those messages uh those

play27:26

email board so this is the message and

play27:30

this is the spam and excellent so we

play27:34

have G okay so that's it we have got the

play27:38

result now it's time for the conclusion

play27:41

that's it from the task and from the

play27:43

Practical point of view we have to

play27:45

consider the following points to ensure

play27:47

the effectiveness and reli reliability

play27:49

of the model there are so many uh things

play27:52

to do uh some of them are mentioned here

play27:54

like continuously monitor the models

play27:57

performance ments on a test data or in a

play27:59

real environment collect feedback from

play28:02

users and regarding regarding false

play28:04

positives use this feedback to ensure

play28:06

the model's weaknesses and the areas to

play28:08

improve deploy the model into production

play28:11

so there there couple of steps that we

play28:12

need to do on what's next and uh next

play28:16

after after we have uh predicted the

play28:19

model and prepared it and next time is

play28:22

to deploy it in the in the production ex

play28:26

so let's let's solve this so what is the

play28:31

key first step in the machine learning

play28:33

pipeline that's should that should be

play28:35

very simple it's data

play28:40

collection

play28:42

submit okay uh which data pre-processing

play28:45

feature is used to create new features

play28:48

or modify existing one to improve the

play28:51

models performance so let's scroll down

play28:54

scroll up and the these are the model

play28:59

classifications and I think if we read

play29:02

them out it's feature engineering

play29:04

because it's used to create new features

play29:07

or modify existing ones to improve model

play29:10

performance so this should be the second

play29:22

answer excent during the data SP spading

play29:26

step 20% of the data set was split for

play29:28

testing what is the percentage weightage

play29:31

average for precision of spam protection

play29:34

so let's find

play29:39

out the weightage average

play29:43

that's

play29:46

98

play29:47

[Music]

play29:49

0.98 so um next question is how many of

play29:53

the test emails are marked as spam um I

play29:57

think I think it's two or three if I'm

play29:59

not wrong

play30:03

it's okay it's one 2 3

play30:13

and one of the emails that is detected

play30:17

as spam contains a secret code what's

play30:20

the code so these are the two spam

play30:25

emails let's look at this and one of the

play30:29

emails should contain the flag secret

play30:32

code okay this is the secret code

play30:36

um it's it says I hate best

play30:41

Festival let's

play30:43

[Music]

play30:47

change I

play30:54

hate

play30:56

best

play31:00

excellent so uh that's it if you enjoyed

play31:02

this room please check out other fishing

play31:04

modules as well complete so we have

play31:06

completed the task congratulations and

play31:09

just as a final uh this is not it what

play31:12

we need to do we need to improve our uh

play31:15

models performance how we can do there

play31:17

are a couple of steps we can change the

play31:20

percentage of the test data uh from 20%

play31:24

to let's say 30% to see how our

play31:26

prediction model works or we can also

play31:29

increase the size of the data set

play31:31

because we are in a testing environment

play31:33

we can use uh limited uh data set but

play31:37

the more size of the data set we'll have

play31:39

uh there is the more chance of less uh

play31:43

false positives let's it okay I hope you

play31:46

enjoyed this uh little task and um see

play31:51

you around thank you so much

Rate This

5.0 / 5 (0 votes)

相关标签
サイバーセキュリティ機械学習スパム検出データセット特徴抽出分類モデル評価指標PythonJupyter NotebookK-Foldテキスト分類データ前処理
您是否需要英文摘要?