Advent Of Cyber: Day 15 - Building Email Spam Detector with ML
Summary
TLDRこのビデオスクリプトは、今年の「Advent of Cyber」イベントに参加しているSakibが、12月1日から24日までの間、毎日新しいことを学びながら、素晴らしい賞を勝ち取る機会を提供していることを紹介しています。特に15日目のタスクである機械学習に関する内容を取り上げ、「Jingle Bell spam machine」というタイトルで、スパムメール検出器の構築に挑戦します。データセットを用いて機械学習モデルをトレーニングし、その効果を評価し、最後にテストメールを用いてモデルのパフォーマンスを検証します。また、モデルの改善点として、テストデータの割合を変更したり、データセットのサイズを増やしたりする方法も提案されています。
Takeaways
- 📅 このセミナーは、12月1日から24日までのアドベント・オブ・サイバーに関連しています。
- 🎓 学習目標は毎日提供され、参加者は新しいことを学び、素晴らしい賞を勝ち取る機会があります。
- 📈 今回のセミナーでは、機械学習、メタ分析、侵入テスト、デジタルフォレンジック、およびインシデント対応など、さまざまなトピックがカバーされます。
- 📝 今日15日のタスクとして、機械学習を使ったスパムメール検出器の構築について扱います。
- 🏢 最近、Festival Companyの社員が多数のスパムメールを受け取っており、メールボックスにスパムが入っていました。
- 👩💻 Mekyが、サンプルデータセットを使って機械学習モデルをトレーニングし、スパムメール検出器を構築する任務が与えられました。
- 🔍 機械学習パイプラインのステップを探索し、データセットをトレーニングデータとテストデータに分割し、モデルを評価する方法を学びます。
- 📚 理論的な知識を読む必要がありますが、重要なトピックであるため耐えてください。
- 🔢 NumPyとPandasという2つのライブラリを使用して、データの数値計算とデータ構造を扱います。
- 📝 Jupyter Notebookを使用して、機械学習プロジェクトを簡単に扱えます。
- 🔑 データ前処理は、機械学習モデルに適したクリーンで整理された形式にデータを変換する技術です。
- 📈 CountVectorizerを使用して、テキストを数値形式に変換し、機械学習モデルで使用できるようにします。
- ✂️ データセットをトレーニングデータとテストデータに分割し、モデルのパフォーマンスをテストします。
- 🤖 Naive Bayesという統計的手法を使用して、新しいメールがスパムかどうかを判断します。
- 📊 モデルの評価では、正確性、精密度、再現率、F1スコアなどのメトリックを使用して、モデルのパフォーマンスを評価します。
- 📧 テストメールを用いて、トレーニング済みのモデルをテストし、予測結果を確認します。
- 🔧 モデルのパフォーマンスを向上させるために、テストデータの割合を変更したり、データセットのサイズを増やしたりすることができます。
Q & A
アドベント・オブ・サイバーはいつから開催されますか?
-アドベント・オブ・サイバーは12月1日から24日までの間、毎日学習目標を提供しています。
今日のタスクのテーマは何ですか?
-今日のタスクのテーマは機械学習で、「ジングルベルスパム、機械学習が救世主になる」というタイトルです。
スパムメールを検出するために使用されるアルゴリズムは何ですか?
-スパムメールを検出するためには、ナイーブベイズベースのアルゴリズムが使用されます。
データセットをトレーニングデータとテストデータに分割する理由は何ですか?
-データセットを分割することで、モデルはトレーニングデータで学習し、テストデータで未知のデータに対するパフォーマンスを評価できます。
トレーニング前にデータを前処理する理由は何ですか?
-前処理は、生データを機械学習モデルが理解できる形式に変換するために必要な手順であり、データの質を保証するためです。
ナイーブベイズ分類器はどのようにして新しいメールがスパムかどうかを判断するのですか?
-ナイーブベイズ分類器は、各メールの単語を調べ、スパムとハムのメールでそれぞれの単語がどれだけ頻繁に現れるかを計算して、新しいメールがスパムかどうかを判断します。
モデルのパフォーマンスを評価するために使用される指標は何ですか?
-モデルのパフォーマンスを評価するためには、正確性、精密度、再現率、F1スコアなどの指標が使用されます。
テストデータセットでモデルがスパムを正しく予測する確率はどのくらいですか?
-テストデータセットでモデルがスパムを正しく予測する確率は90%です。
テストデータセットでモデルがハム(スパムでないメール)を正しく予測する確率はどのくらいですか?
-テストデータセットでモデルがハムを正しく予測する確率は99%です。
機械学習パイプラインの最初のステップは何ですか?
-機械学習パイプラインの最初のステップはデータ収集です。
データ前処理で使用される特徴エンジニアリングとは何ですか?
-特徴エンジニアリングは、新しい特徴を作成したり既存の特徴を変更することで、モデルのパフォーマンスを向上させるための手法です。
テストデータセットの何パーセントがスパムとしてマークされましたか?
-テストデータセットの2つまたは3つのメールがスパムとしてマークされました。
テストメールで検出されたスパムメールに含まれるシークレットコードは何ですか?
-テストメールで検出されたスパムメールに含まれるシークレットコードは「I hate best Festival」です。
Outlines
🎓 Advent of Cyberの紹介と学習目標
Sakibが今年のAdvent of Cyberに参加し、12月1日から24日までの毎日新しい学びと賞を提供されるイベントについて紹介。また、最終的に証明書も受け取れる。今年は機械学習、メロディ解析、侵入テスト、デジタルフォレンジック、インシデント対応など幅広いトピックがカバーされる。特に、15日目のタスクは機械学習に焦点を当て、「ジングルベルスパム機械学習が日を救う」というタイトルで、スパムメールの検出に役立つ。
📚 機械学習パイプラインのステップ
Sakibは、Jupyter Notebookを使用して機械学習プロジェクトを進め、データセットの読み込み、データ前処理、特徴量抽出、データのテストとトレーニングへの分割、機械学習モデルの適用と評価、そしてデプロイメントのプロセスを説明。特に、データ前処理では、テキストデータを数値形式に変換するためにCountVectorizerを使用し、データセットを数値行列に変換する。
🔍 データセットの分割とモデルの選択
データセットをトレーニング用とテスト用に分割し、ナイーブベイズ分類器を使用してモデルをトレーニング。トレーニング後は、テストデータセットでモデルのパフォーマンスを評価し、精度、再現率、リコールなどのメトリックを計算。さらに、テストメールを用いてモデルの予測力をテストし、結果を分析。
🛠️ モデルの評価とテスト
トレーニングされたナイーブベイズモデルをテストデータセットに適用し、その結果を評価。モデルは、スパムかハムかを正確に予測し、再現率とリコール率が高いと示された。テストメールの1つがスパムとしてマークされ、その中にはシークレットコードが含まれている。
📈 モデルの改善とデプロイメント
モデルのパフォーマンスを向上させるために、テストデータの割合を変更し、データセットのサイズを増やすなどの改善点が提案される。また、実際の環境でモデルを監視し、ユーザーからのフィードバックを収集してモデルの弱点を特定し、改善を加えることが重要であると強調された。
🏆 タスクの締めと今後の展望
Sakibは、参加者に対してこのタスクを楽しんでもらい、他のモジュールもチェックしてほしいと要請。また、モデルのパフォーマンスを改善するため、テストデータの割合やデータセットのサイズを調整し、より多くのデータを用いて偽陽性の確率を減らすことができると示唆。最後に、参加者の皆様に感謝の意を表し、今後もお会いできることを楽しみにしていると結んでいる。
Mindmap
Keywords
💡アドベント・オブ・サイバー
💡機械学習
💡スパムメール検出器
💡データセット
💡データ前処理
💡特徴量エンジニアリング
💡カウントベクトリゼーション
💡トレーニングとテストの分割
💡ナイーブベイズ分類器
💡モデル評価
💡モデルデプロイ
Highlights
Sakib is excited to be part of the Advent of Cyber by Try Haack Me Advent, which offers daily learning objectives and prizes from December 1st to 24th.
The event covers various topics including machine learning, malware analysis, penetration testing, digital forensics, and incident response.
Day 15 focuses on machine learning with the task titled 'Jingle Bell Spam Machine Learning Saves the Day'.
The task involves building a spam email detector using machine learning to address an influx of spam emails at a company following a merger.
A sample dataset is provided for training the machine learning model, emphasizing the application of machine learning in cybersecurity.
The learning objectives include understanding the machine learning pipeline, classification, training models, data splitting, and model evaluation.
The lab environment uses Jupyter Notebook, which is introduced as a tool for easy machine learning project work.
The dataset consists of two columns: classification (spam or ham) and message (email body).
Data pre-processing involves cleaning and structuring the data, with techniques such as removing punctuation and stopwords.
CountVectorizer is used to convert text data into a numeric format that machine learning models can understand.
The data is split into training and testing subsets, with 20% of the dataset reserved for testing.
Naive Bayes is chosen as the text classification model for training the spam detector.
The Naive Bayes algorithm calculates the probability of an email being spam based on the frequency of certain words.
The model is evaluated using metrics like accuracy, precision, recall, and F1 score on the test dataset.
The task includes a practical test where the trained model predicts whether random email texts are spam or ham.
The model's effectiveness is tested on a set of emails provided in a CSV file.
Continuous monitoring and feedback are essential for improving the model's performance in a real-world environment.
To enhance the model, one can adjust the percentage of test data, increase the dataset size, or use different machine learning algorithms.
The task concludes with a congratulatory message and an invitation to explore other modules for further learning.
Transcripts
hello everyone my name is sakib and I am
super excited to be part of this year's
Advent of cyber by try Haack me Advent
of cyber brings you with amazing
learning objectives daily from 1st
December to 24th December and each day
we get a chance to learn something new
and also get a chance to win amazing
prizes and at the end once we complete
all the tasks we will also get the CER
certificate this year we are covering
different topics from machine learning
Mel analysis penetration testing so and
digital forensics and instent response
today we are going to cover day 15 task
which is on machine
learning so let's and the title of
today's task is Jingle Bell spam machine
learning saves the day before starting
I'll start the lab because it will take
around 3 to 5 minutes minutes to load so
let's give it give it some time to load
and read the task out okay so over the
past few weeks best Festival company
employes have been receiving an
extensive number of spam emails these
emails are trying to lure users into the
Trap of clicking on links and providing
credentials spam emails are somehow
ending up in mail mailing box this is
interesting it looks like the spam
detector in place before being the
merger has been disabled or damaged
deliber suspicion is on me greedy who is
not so happy with the merger problem
statement meky has been tasked with
building a spam email detector using
machine learning she has been provided
with a sample data set collected from
different sources to train the machine
learning model so machine learning
has so many applications in cyber
security domain and building spam
detector is one of them and in today's
task we will have some Theory to read
which may get you boring but uh bear
with me it's very interesting topic and
we have uh the task is divided into some
some steps so that it's easy to
understand and
absorb learning objectives um in this
task we will explore different steps in
generic machine learning pipeline
machine learning classification and
training models how to spit the data set
into training and testing data how to
prepare the machine learning model how
to evaluate the model's
Effectiveness lab connections these are
some instructions to connect with the
lab all you have to do is press the
green start machine button that what we
have already done the lab will start on
the right side in a split screen which
we are seeing right now and it will take
around 3 to 5 minutes to
load overview of jupyter notebook in
this room we will be using jupyter
notebook as our lab because it's very
easy to work on machine learning
projects uh while we are working on each
command we are running each command uh
in front of us
so Jupiter notebook is also covered in
task two so I hope that you have already
have some explosure on jupyter Notebook
so um it's important to recall that we
will need to run the code from the cells
using the Run button uh there are two
ways to run the code on present in the
cells so this is the layout of Jupiter
Notebook on the right side of the screen
uh it contains this is the notebook and
it contains the cells and in cell we
have the code uh and on the left side we
have the files that we will be working
on so there are two ways one we can
press the play button or uh we can also
have the shortcut shift enter to execute
the
commands Okay um exploring machine
learning pipeline machine learning
pipeline refers to a series of steps
involved in building and deploying
machine learning model these steps
ensure that the data flows effectively
from its raw form to pred prediction
Insight predictions and insight a
typical pipeline would include
collecting data from different sources
in different forms uh pre-processing it
and Performing feature extraction from
the data splitting the data into testing
and training data and then applying
machine learning model for predictions
and once it's done uh we can then deploy
it in a testing or deploying uh
production in
mind okay so it looks like the lab has
almost started so this is the Jupiter
notebook and we have three files one The
Notebook itself and the two uh data data
sets one emails uncore data set and if
we double click we can see that it it
has two columns some this is something
that we'll explore in in a bit okay so
the the zero step would be importing the
required Library so what we are going to
do okay first I I'll click here so that
we have some extra space okay so the
first thing will be importing the
required libraries right now we need two
libraries one numpy and pandas numpy
deals with the numeric computation in
Python and pandas provide high level
data structure and the methods designed
to make data analysis fast and easy in
Python
so what I'll do I'll press shift enter
and it will run those commands so if
there is a static it means the command
is still loading it's taking time but if
when it's it's it's gone it means that
uh the the commands within the cell has
been executed
successfully so the first step would be
data collection let's read it out data
collection is a process of gathering Raw
data from various sources to be used for
machine learning this data from uh data
can originate from various sources such
as database text files apis on
repositories sensors Etc and here we
have a CSV file what we'll do we'll use
this uh this data set for our analysis
and let's load that so we have the data
we need to import that uh uh load that
so what I'll do I'll press shift enter
again and the data has been loaded
successfully uh in the variable data so
let's check the data out I'll press the
shift enter again and um the head
command prints only the top five values
and if I want the last five values I'll
add tail uh data. tail uh to print that
out so the data data set looks like it
has two columns classification and the
message message contains email body and
classification spam or ham so if a if if
a message or the email is Spam it will
have the classification as samam spam
and so on okay uh data frames provide a
structured tabular representation of the
data that's intuitive and easy to read
using the command blow we'll convert the
data
into Data
frame okay uh it will make the data easy
to an analyze and
manipulate let's run this command this
will convert the data set uh into Data
frame and print it out uh it has
converted now with this command and it
looks like we have now we have a total
of 4451 rows which are records or emails
and two columns which are classification
and message let's move on to the next
step so uh in this task we have divided
the task into small steps so that it's
easy for us to understand what we are
doing actually so the the second step is
data
pre-processing let me uh read it out
data pre-processing refers to the
techniques used to convert raw data into
clean organized understanding able and
structured format suitable for machine
learning given that the raw data is
often messy inconsistent incom complete
pre-processing is an essential step to
ensure that the data we are feeding into
the machine learning model is relevant
and of high quality here are some of the
techniques used for data process uh
pre-processing so there are so many
techniques that once we have the data we
need to clean it for example in in in
case of emails uh emails we need to uh
first remove all the punctuations all
the comma full stop Etc and um the words
like and or is this that because they do
not add any value uh in machine learning
understanding towards the classification
of email being spam or ham because they
are common and they cannot make any any
impact or any difference so we have we
have to remove them we have to remove D
values Etc so this is the process of uh
pre-processing uh feature extraction we
can extract the features uh data uh text
preprocessing Tok tokenization these are
different ways we can apply in in
different uh situations feature uh
engineering creating new features or
modifying existing ones to improve model
performance utilizing comp vectorizes in
this task the pre-processing thing we
will be using is Count vectorizer which
is which can be used to convert num uh
text into
numbers uh why because machine learning
model understands numbers not text this
means that the text needs to be
transformed into numeric format count
vectorizer is a class provided by skarn
library in Python it achieves achieve
this by converting the text text into
token vers count
Matrix it is used to prepare the data
for the machine learning models to use
and predict Deion on okay so it it will
it will be more clear when we get um
execute these commands and understand
the output so what I'll do I'll execute
these commands and then break it down
for
you okay so what we are going to do
first we are going to
import
count vectorizer from
escalar and then assigning it to a
variable
vectorizer and then what we are going to
do uh the data frame which contains DF
that we we had defined earlier which
contains the column message we are going
to transform using the fitore transform
function to transform into numeric
values so message is has the text and
now if you look if you execute
this it will be saved in the variable X
and let's PR print it out and see the
output okay this is the output which
looks kind of confusing but uh let me
explain we have the emails the email at
zero index in the email at zero index
the word at the index 6653 has occurred
once similarly in the email in the email
um document or email zero index the word
in the index 6733 has occurred three
times so it is calculating the count of
the number of occurrence of of any word
for example uh in spam word spam the
word uh congratulations may have
occurred six times so it would increase
the probability so what this has done it
has converted our email into numeric
Matrix so that we can perform machine
learning um um predictions and model on
we can train on this data set now so now
we have the data set in in in uh numeric
representation that's that's very
interesting and very important topic
very important step uh the third step so
the third step is now that we have the
data set of 4451 I guess U numeric
Matrix matrices what we need to do in
the next step we need to divide that
data set into two
parts why because we want to train our
model on one subset and then test our
model on the second subset it's
important to test the model's
performance on unseen data by splitting
the data we we can train our model on
the subset on one subset and test its
performance on another um this is a very
clear explanation or uh image like we
have the data set we are going to split
that into two parts so what we are going
to do uh these are the commands let me
again run this command and then I'll
break it down one by one what we are
going to do we are going to
first import the function Trainor test
split I'll press shift
enter and then what I'm going to do uh
there are two variables capital x
contains the numeric representation of
the data set or the email body and now
why contains the data frame of
classification spam or H so and we are
going to use this and we we will be
providing three
variables capital x the numeric uh
representation of email body why the the
classification data set data frame and
third the test size that is that we have
set for uh 20% or 020 so what it it will
do it will split randomly 20% of the
data set um for test testing purpose and
the remaining will be something that we
will use and the variables are xcore
train and Yore train these are the two
variables we will be training over test
over over model on and when we are going
to test we will use X test and Y test
and this has been explained here as well
X train the subset for the features to
be used for training y train the
corresponding labels for X train set and
uh similarly for others two as well now
we have splitted over let me let me run
this so we have successfully splited our
data into two parts and the variables um
the results are stored
here next step is like this is the next
step so we are moving one step at a
time
model training now that we have the data
set ready The Next Step would be to
choose the text classification model and
use it to train on the given data set
some commonly used text classification
models are explained below uh knif based
classification spvm logistic regression
D entries these are all the uh
classification models that are used uh
to train on the text based uh data set
and in this task we will be using knife
base so let Let's uh dive in straight to
the knife base knife base is a
statistical method that uses the
probability of certain words appearing
in spam and in ham emails to determine
whether a new email is Spam or not let's
break it down how this actually works so
how how knife based classification works
let's say we have a bunch of emails some
labeled as spam and some as
ham just like this data set the N based
algorithm learns from these emails it
looks at the words in each email and
calculates how frequently each word
appears in Sam in spam or ham emails for
instance words like free win offer
Lottery may appear more in spam emails
so if we take a look at uh this data set
we we can see that for example uh in
spam there's a word a award maybe award
is is used frequently in multiple spam
so it will remember
that the knif Bas algorithm calculates
the probability of the emails being
spamed based on the words it
contain when a model is trained with
knife Bas and gets a new email that says
for example when a free toy now then it
thinks that this is where it's training
itself when often appears in spam so
this increases the chance of email being
spamm free is also very common in spam
further increases the spam
probability may be neutral often
appearing in both spam and H after
considering all the words it calculates
the overall probability of the email
being spam or
ham if the calculated probability of
spam is higher than that of ham the
algorithm classifies the email as spam
otherwise it's classified as H okay
let's use knif Bas to train the model as
explained below so um this is this is
how knife based work let's execute these
commands and then I'll break down step
by so in the next step what we are going
to do we are going to import multi-
nominal MB U which is the function in
escalin escalin do knif based uh Library
so what we are going to do we are going
to import this and then assigning it to
the variable classification so class clf
would uh would remember or would would
be used as knife based model it will be
used to train H in the next step if you
look at we we are trying to use this to
fit or train on these two variables
which contain the numeric representation
of the emails email body and uh the
classification of either spam or ham so
what it's doing it's it's working on
looking at this data set and uh trying
to remember the numeric representation
of this data set and trying to remember
what words what's the frequency of which
word and which word word were more
frequent in spam so that it remembers
what's next so okay uh let me execute
this it it's trying to remember based on
what we are producing what we are U
giving it as a as a as an
input Next Step would be F fifth
step model
evaluation
so in model evaluation we will evaluate
if our model has predicted correctly so
what we are going to do we are going now
we are going to provide it the test data
set so after training it's essential to
evaluate the model's performance on the
test set to check its predictive power
this will give you the metrics such as
accuracy precision and recall let's let
me execute these commands and then I'll
uh we look at the output and we'll uh
try to break it down one by
one so I'll execute these commands shift
enter okay so let me break it down we
are importing classification undor
report from esal Matrix and then we are
providing our classification with to
asking it to predict on xor test data
set which is the data set that we had
splitted before and then we are printing
the classification report on y test and
Y predict so what we what it has PR
predicted and we are uh printing that
out so this is the output uh let me
understand read it out so it says that
for for him the model predicts it's not
a Spam it's correct n 99% of the time
for for for the spam if the model says
that it's spam the model is correct 90%
of the time similarly for the for the
ham for the recall the model captures
99% of non spam or ham messages and
similarly for uh the spam it captures
96% of actual spam F1 is the balance
score between these two uh of these two
sport is like we have
789 ham messages in the in the test data
set and 102 total spam messages in the
test data set and these are the
averages okay so this is the output
means we have tried to predict our model
predict on on the test dat data set what
we have done we have first uh trained
our model uh which is saved on clf uh
classification and it remembers then it
it tries to predict on the test data set
that we have provided next so
the each one is explained below okay so
now it's time to test our model we have
uh we have evaluated it's time to test
our
model so how how we are going to do we
are going to provide it with with with
some random email text that we may we
think that it may be a Spam or ham we
need to test first we will need to
transform them transform the message and
then provide that message to our
classifier and then we'll see let's
print out if the message what what's the
prediction so what's next step number
six okay I'll press shift enter and it
says it's it's a Spam why because it's
looking at it it says that today's offer
claim your this and that so it has
calculated based on the probability of
the words like offer claim worth
discount vouchers Etc which appear more
often in spam than in normal messages
there there's a chance of false positive
but uh let's leave that out for now
okay so that's it that's our task and
what's next maxky is Happy that workable
spam detector model has been developed
she has provided us with some test
emails in the file test emails. CSV and
wants us to run the prepared model
against these emails to test our model
results so we have if look we have a
test emails uh file which is only
messages okay so we are going to it says
that update the following code to
include the test email files and run the
train train model against these emails
so what we are going to do we are going
to
add the
test
test
emails
Dov
and okay the test file has been loaded
and we need to change this as well it
would be
testore data because that's the variable
where we are buing this CSV file so
let's load this and print the top five
values so uh this is we we only have uh
the messages which contains email okay
uh we are going to transform our
messages and into numeric and um Matrix
and then uh we will apply the prediction
so what we are going to do we we are
using the vectorizer to transform the
test data
and and the column messages and then
applying the prediction asking our train
model to predict on this me uh this uh
data set so let's run
this and let's try our result
out excellent so what we have done uh
what we have got we have got the
prediction on those messages uh those
email board so this is the message and
this is the spam and excellent so we
have G okay so that's it we have got the
result now it's time for the conclusion
that's it from the task and from the
Practical point of view we have to
consider the following points to ensure
the effectiveness and reli reliability
of the model there are so many uh things
to do uh some of them are mentioned here
like continuously monitor the models
performance ments on a test data or in a
real environment collect feedback from
users and regarding regarding false
positives use this feedback to ensure
the model's weaknesses and the areas to
improve deploy the model into production
so there there couple of steps that we
need to do on what's next and uh next
after after we have uh predicted the
model and prepared it and next time is
to deploy it in the in the production ex
so let's let's solve this so what is the
key first step in the machine learning
pipeline that's should that should be
very simple it's data
collection
submit okay uh which data pre-processing
feature is used to create new features
or modify existing one to improve the
models performance so let's scroll down
scroll up and the these are the model
classifications and I think if we read
them out it's feature engineering
because it's used to create new features
or modify existing ones to improve model
performance so this should be the second
answer excent during the data SP spading
step 20% of the data set was split for
testing what is the percentage weightage
average for precision of spam protection
so let's find
out the weightage average
that's
98
[Music]
0.98 so um next question is how many of
the test emails are marked as spam um I
think I think it's two or three if I'm
not wrong
it's okay it's one 2 3
and one of the emails that is detected
as spam contains a secret code what's
the code so these are the two spam
emails let's look at this and one of the
emails should contain the flag secret
code okay this is the secret code
um it's it says I hate best
Festival let's
[Music]
change I
hate
best
excellent so uh that's it if you enjoyed
this room please check out other fishing
modules as well complete so we have
completed the task congratulations and
just as a final uh this is not it what
we need to do we need to improve our uh
models performance how we can do there
are a couple of steps we can change the
percentage of the test data uh from 20%
to let's say 30% to see how our
prediction model works or we can also
increase the size of the data set
because we are in a testing environment
we can use uh limited uh data set but
the more size of the data set we'll have
uh there is the more chance of less uh
false positives let's it okay I hope you
enjoyed this uh little task and um see
you around thank you so much
Voir Plus de Vidéos Connexes
5.0 / 5 (0 votes)