Building a Plagiarism Detector Using Machine Learning | Plagiarism Detection with Python

Artificial intelligence
28 Jul 202454:04

Summary

TLDRThis video script outlines the development of a plagiarism detector using natural language processing. It covers understanding plagiarism, showcasing a user interface, and detailing the project's setup from scratch. Key aspects include utilizing libraries like NLTK and scikit-learn, data preprocessing, model training with classifiers, and evaluation metrics. The script also guides through deploying the model with Flask, creating a user interface, and ensuring the model's accuracy with tests, concluding with suggestions for further dataset expansion and development.

Takeaways

  • 📝 The script outlines a project for creating a plagiarism detector using natural language processing (NLP) techniques.
  • 🔍 It defines plagiarism as the unauthorized use of someone else's work, ideas, or intellectual property without proper attribution or permission.
  • 🛠️ The project utilizes the NLTK library for NLP tasks and includes algorithms like Logistic Regression, Random Forest, and Naive Bayes for classification.
  • 📊 The script demonstrates the use of TF-IDF vectorizer for feature extraction from textual data, which is crucial for any NLP project.
  • 📚 It explains the preprocessing steps for text data, including the removal of punctuation, lowercasing, and elimination of stop words.
  • 📈 The project involves training a model and evaluating it using metrics like accuracy score, precision, recall, F1 score, and confusion matrix.
  • 📝 The script provides insights into handling data distribution and the importance of having a balanced dataset for training the model.
  • 💻 The tutorial covers deploying the model using a Flask web framework, creating a user interface for input, and displaying the detection results.
  • 🔧 The importance of matching the scikit-learn version used in training with the one in the deployment environment to avoid mismatch errors is highlighted.
  • 🌐 The script describes creating an HTML form for user input and using CSS for styling the web interface to make it more attractive.
  • 🔑 The final takeaway emphasizes the need for user support through likes, comments, and subscriptions for the channel, indicating the educational nature of the content.

Q & A

  • What is the main objective of the project described in the script?

    -The main objective of the project is to create a plagiarism detector using natural language processing techniques.

  • What is plagiarism according to the script?

    -Plagiarism is the act of using someone else's work, ideas, or intellectual property without proper attribution or permission.

  • What are the steps involved in creating the plagiarism detection model?

    -The steps include understanding plagiarism, creating a user interface, importing necessary libraries, loading and preprocessing the dataset, feature extraction, model training, evaluation, and deployment.

  • Which libraries and tools are mentioned for natural language processing tasks?

    -The libraries and tools mentioned include NLTK for natural language processing tasks, pandas for loading the dataset, and scikit-learn for machine learning classifiers and metrics.

  • What is the importance of removing stop words in an NLP project?

    -Removing stop words is important because they are typically irrelevant to the meaning of the text and can reduce the effectiveness of text analysis or processing.

  • What is the role of TF-IDF vectorizer in the plagiarism detection project?

    -The TF-IDF vectorizer is used for feature extraction, converting textual data into a numerical format that can be understood by machine learning models.

  • How is the model evaluated in the script?

    -The model is evaluated using accuracy score, classification report for precision, recall, and F1 score, and confusion matrix to check for misclassifications.

  • What are the different machine learning classifiers used in the project?

    -The classifiers used include Logistic Regression, Random Forest Classifier, Multinomial Naive Bayes, and Support Vector Classifier.

  • Why is it necessary to save the trained model and vectorizer?

    -It is necessary to save the trained model and vectorizer to avoid retraining for new inputs and to facilitate easy deployment and integration into production systems.

  • What is the purpose of creating a user interface for the plagiarism detector?

    -The purpose of creating a user interface is to allow users to input text and receive feedback on whether the text is plagiarized, making the model accessible and user-friendly.

  • How is the Flask framework used in the deployment of the plagiarism detector?

    -The Flask framework is used to create a web application that receives user input, processes it through the plagiarism detection model, and displays the results on a webpage.

Outlines

00:00

🔍 Introduction to Plagiarism Detection Project

The speaker introduces a plagiarism detection project using natural language processing (NLP). They explain the concept of plagiarism and demonstrate the project's user interface, showing how it detects copied text. The speaker also outlines the project's technical requirements, including the NLTK library, machine learning classifiers, and evaluation metrics. They mention the need for feature extraction using TF-IDF vectorizer and the importance of cleaning text data.

05:02

📚 Understanding the Dataset and Preprocessing

The speaker discusses the structure of the plagiarism dataset, which includes source text, plagiarized text, and labels indicating plagiarism status. They emphasize the need for data preprocessing, such as removing punctuation, converting text to lowercase, and eliminating stopwords, to prepare the data for the NLP model. The dataset is acknowledged to be small and potentially imperfect, and the speaker offers to share it for further enhancement.

10:06

🤖 Feature Extraction and Model Training

The speaker explains the process of converting text data into a numerical format using TF-IDF vectorization, which is essential for machine learning models. They describe the model training process, starting with logistic regression, and then moving on to random forest and other classifiers. The importance of model evaluation using accuracy scores, classification reports, and confusion matrices is highlighted.

15:06

🏗️ Model Selection and Deployment

The speaker compares the performance of different machine learning models, such as logistic regression, random forest, naive Bayes, and support vector machines, to select the best model for plagiarism detection. They discuss the process of saving the trained model and vectorizer using the pickle library for deployment, ensuring that the model can be reused without retraining.

20:08

🛠️ Setting Up the Flask Application

The speaker provides a step-by-step guide to setting up a Flask application for the plagiarism detection system. They explain the need for creating a virtual environment, installing Flask and the correct version of scikit-learn, and structuring the project files. The speaker also details the creation of HTML templates for the user interface and the initial setup of the Flask app.

25:10

🎨 Designing the User Interface

The speaker focuses on designing an attractive user interface for the plagiarism detector using HTML and CSS. They describe adding a title, form elements for text input, and a submit button. The speaker also discusses the importance of styling the page using CSS to make it visually appealing and user-friendly.

30:11

🔌 Integrating the Backend with the Frontend

The speaker explains how to integrate the backend logic with the frontend interface using Flask routes and templates. They demonstrate how to handle form submissions, process user input, perform predictions using the loaded model, and display the results on the webpage. The speaker also emphasizes the importance of matching the scikit-learn version between the development and production environments to avoid errors.

35:13

📝 Finalizing the Plagiarism Detection System

The speaker wraps up the project by discussing the final steps, including creating a function to handle user input, vectorize the text, and return the plagiarism detection result. They also mention the need to test the system with various texts to ensure its accuracy and reliability. The speaker encourages viewers to support the channel through likes, comments, and subscriptions.

Mindmap

Keywords

💡Plagiarism

Plagiarism refers to the act of using someone else's work, ideas, or intellectual property without proper attribution or permission. In the context of the video, it is the main subject of the project being discussed, which is a plagiarism detector. The script mentions that plagiarism can be detected when a user inputs text that has been copied from another source without proper citation.

💡Natural Language Processing (NLP)

Natural Language Processing is a subfield of artificial intelligence that focuses on the interaction between computers and humans using natural language. The video script discusses an NLP project, which is a plagiarism detector that processes and analyzes text to determine if it is original or copied.

💡User Interface

A user interface (UI) is the point of interaction between a user and a program or device. In the script, the creation of a user interface for the plagiarism detector is mentioned, which allows users to input text and receive feedback on whether the text is plagiarized.

💡Machine Learning Classifiers

Machine learning classifiers are algorithms that help in classifying data into different categories. The script discusses several classifiers such as logistic regression, random forest, and support vector machines, which are used to train the plagiarism detector model to differentiate between plagiarized and non-plagiarized text.

💡TF-IDF Vectorizer

TF-IDF stands for Term Frequency-Inverse Document Frequency and is a numerical statistic that reflects how important a word is to a document in a collection or corpus. The script mentions using a TF-IDF vectorizer for feature extraction in the plagiarism detection project, which converts text data into numerical format that can be understood by machine learning models.

💡Feature Extraction

Feature extraction is the process of obtaining useful information from raw data. In the context of the video, feature extraction is crucial for converting textual data into a format that can be analyzed by the machine learning model, with the TF-IDF vectorizer playing a key role in this process.

💡Binarized Classification

Binarized classification refers to a classification problem where the output is binary, i.e., it has two classes. The script mentions that the plagiarism detection model deals with a binarized classification problem, with the two classes being plagiarized and not plagiarized.

💡Confusion Matrix

A confusion matrix is a table used to describe the performance of a classification model. The script discusses the importance of a confusion matrix in evaluating the model's performance, showing how many instances were correctly or incorrectly classified.

💡Accuracy Score

Accuracy score is a measure of a model's performance, defined as the ratio of correctly predicted observations to the total observations. The script mentions using the accuracy score to evaluate the plagiarism detector model's performance on the test data.

💡Flask Framework

Flask is a lightweight web framework for Python that allows for the creation of web applications. The script discusses using the Flask framework to create a simple user interface for the plagiarism detector, allowing users to input text and receive results through a web browser.

💡Pickle Library

The pickle library in Python is used for serializing and de-serializing Python object structures. In the script, the pickle library is mentioned for saving the trained model and the TF-IDF vectorizer, and later for loading them in the Flask application for deployment.

Highlights

Introduction to a plagiarism detector project using natural language processing.

Explanation of what constitutes plagiarism in the context of using someone else's work without proper attribution.

Demonstration of a user interface for inputting text to be checked for plagiarism.

Showcasing the output of the plagiarism detector model indicating 'plagiarism detected'.

Use of the TF-ID vectorizer for feature extraction in NLP tasks.

Importance of cleaning textual data by removing punctuation, converting to lowercase, and eliminating stop words.

Utilization of machine learning classifiers such as Logistic Regression, Random Forest, and Naive Bayes for the plagiarism detection model.

Discussion on the importance of model evaluation using accuracy score, classification report, and confusion matrix.

The process of training the plagiarism detection model using a dataset with labeled examples.

Addressing the need for a larger dataset for a more detailed plagiarism system.

Instructions on how to preprocess text data for NLP projects, including custom Python functions.

Conversion of textual data into numerical format for machine learning models using TF-IDF vectorization.

Splitting the dataset into training and testing sets for model training and evaluation.

Comparison of different machine learning models' performance in detecting plagiarism.

Final selection of the best-performing model for deployment in a plagiarism detection system.

Explanation of how to save and load a trained model and vectorizer using the pickle library.

Development of a function to detect plagiarism in user-provided text for deployment.

Instructions for setting up a Flask application for the web interface of the plagiarism detector.

Importance of matching the scikit-learn version between the training environment and the production environment.

Creation of a simple user interface using HTML and CSS for the plagiarism detector application.

Integration of the backend logic with the frontend interface to create a complete web application.

Final demonstration of the plagiarism detector web application in action.

Call to action for viewers to like, comment, and subscribe for support of the channel.

Transcripts

play00:02

so we have another natural language

play00:04

processing project plagerism uh detector

play00:09

so first we need to understand what is

play00:11

plagarism then I'll show couple of

play00:14

outputs because we have also created

play00:17

this user

play00:20

interface plagarism is some kind of

play00:25

actor uh in which you use someone else

play00:28

work or someone else ideas or

play00:31

intellectual property without proper

play00:35

attribution or

play00:50

permission so we have this user input

play00:54

okay researchers have discovered a new

play00:57

species of butterfly in the

play01:00

rainforest now definitely I have copied

play01:03

this from someone else article so yes

play01:07

this could be a plagarism now let's see

play01:09

what the model say so the moment I click

play01:11

on this button it will give me the

play01:13

output plagarism

play01:15

detected uh let me try one uh more input

play01:19

I have another input uh like practicing

play01:23

yoga enhances physical

play01:26

flexibility so clearly you can see that

play01:30

that it could not be a plagarized text

play01:33

because it could be someone else uh IDE

play01:36

creativity so let's see uh no plagarism

play01:40

detected so yeah this is the model now

play01:43

let's start the project uh from scratch

play01:50

soly so let's do that quickly first we

play01:53

need ntk Library which is used for uh

play01:56

natural language processing task and

play01:58

within nltk we have couple of algorithms

play02:02

pre-trend uh nltk do download you need

play02:05

to download it directly in your jet

play02:07

notebook Which is

play02:09

popular and after that you need to

play02:12

import uh pandas for loading the data

play02:16

set

play02:18

SPD uh you also need to import string uh

play02:21

module in Python it will help us in

play02:24

cleaning the text because we are dealing

play02:26

with textual data and it's a NLP Pro

play02:30

reor uh and from nltk carpus we need to

play02:34

import Stop wordss uh because we have to

play02:36

remove stop wordss irrelevant words so

play02:40

from nltk do

play02:42

carpus it should suggest me but it's not

play02:45

suggesting me uh you need to

play02:51

import toop

play02:53

words again it's not suggesting me let

play02:56

me type it

play02:58

manually uh once you do do that you also

play03:01

need to load the first basically for

play03:04

this project we will use five machine

play03:07

learning

play03:09

classifier listic regression random

play03:11

Forest

play03:13

classifier uh let me quickly first

play03:15

import from sk. linear

play03:18

model logistic

play03:25

[Music]

play03:27

regation so we need to import

play03:30

from scale learn do model selection you

play03:35

have to import RPL to class it will help

play03:38

us in splitting the data set into

play03:41

different sets like testing and

play03:43

training now once you train the model

play03:45

you need to evaluate that model by using

play03:48

couple of uh matrices so you need to

play03:52

import from SK L do matrices you need to

play03:56

import first we need to import accuracy

play03:58

score it will help us calculate the

play04:01

accuracy and then we need a

play04:03

classification report it will help

play04:05

us uh calculating uh Precision recall F1

play04:10

score for each class so we have two

play04:12

class plagarized and not plagarized so

play04:14

basically it's a biner classification

play04:15

problem and we also need confusion MX so

play04:18

that we check how many mistakes our our

play04:21

model is doing for each class or how

play04:24

many correctly classify the

play04:28

classes and the fin final thing which is

play04:30

very important for any NLP project that

play04:32

is feature extraction so for feature

play04:35

extraction we use different tools but we

play04:36

will use TF vectorizer is I mostly

play04:39

recommend you for any NLP project so

play04:41

from

play04:42

skan do feature extraction clause. text

play04:46

you need to import TFI

play04:49

vectorizer okay I'll explain all the

play04:51

stuffs a bit later but yes this these

play04:54

are the Imports okay so let's run it uh

play04:58

there's in some talkx issue

play05:01

uh from nltk do Corpus inut

play05:07

Stoppers just uh basically there should

play05:12

be from okay so let me uh

play05:17

rerun so yeah now we need to import the

play05:20

data set so I'll give it a name data

play05:23

then I'll use pendas Library read CSV

play05:26

function and I'll pass my data and let

play05:29

me show you the first few

play05:34

cells so this is the data set now you

play05:37

need to understand the structure and

play05:41

format of this data set so you can

play05:43

clearly see in this data set we have

play05:44

three columns Source text plagerized

play05:48

text and we have label so Source text

play05:51

and plagerized text contains the

play05:53

document uh of textual data so these are

play05:56

our input features and label contain two

play05:59

classes 0 and one here one

play06:02

represent plagarized text and zero

play06:04

represent not plagarized

play06:06

text so let me quickly show you the

play06:09

distribution of label

play06:13

feature distribution check you need to

play06:17

use this

play06:19

function counts value counts so you can

play06:23

clearly see we have equal distribution

play06:27

so we don't need to balance this data

play06:28

set one thing more which is important

play06:31

for you and for everyone to use this

play06:33

project this data set is created by me

play06:35

so there might be mistakes and this is a

play06:38

very uh lower data set because let me

play06:41

show you the shap it's not a big data uh

play06:45

it's I

play06:46

think 370 records only so if you want to

play06:51

use a very detailed plagerism system you

play06:53

need to increase this data set you can

play06:55

further work on this data set I'll share

play06:57

with you the data and the entire

play07:00

notebook code and the

play07:06

development so the next thing that is

play07:10

very important in any any NLP project

play07:14

natural language

play07:18

processing now in this textu data you

play07:21

need to remove couple of things so let's

play07:24

do that one by one for that you can use

play07:27

uh different uh techniques our tools but

play07:31

let's do it with our custom python

play07:33

function so let me give it a name by the

play07:36

name of pre

play07:39

preprocessor

play07:41

pre-process text this function will text

play07:45

will take a text so here let me pass a

play07:47

parameter by the name of text now first

play07:50

we will

play07:52

remove uh punctuation because

play07:55

punctuation are the characters or terms

play07:57

that we don't need for any model so we

play08:01

will remove from each text all the

play08:04

punctuation so you have string

play08:08

Library so on the text but first let me

play08:11

call this uh function pre process text

play08:16

and let me pass my text this is my let

play08:20

me add some punctuation or special

play08:23

character something like

play08:26

this uh like this

play08:30

my text to use for dummy

play08:35

test now the first thing I would like to

play08:38

do is to

play08:40

remove uh the punctuation so text will

play08:43

be equal to text. translate function

play08:45

here you can use translet

play08:49

function and first we have to call St

play08:53

strr parameter in this in this translate

play08:55

function and we have to make translation

play08:59

so here you have to call on this St Str

play09:01

another function now you don't need to

play09:03

worry about all the stuffs you don't

play09:05

need to remember all this stuff but yeah

play09:08

this is the

play09:09

flow and now see first what we have done

play09:12

on the text that this function is

play09:27

getting function

play09:49

okay you see this so for now just let me

play09:53

return this text and let's see you can

play09:55

clearly see we remove this punctuation

play09:58

marks from the

play10:01

text Second Step let's lower the text as

play10:05

you can see here in this text we have uh

play10:10

capital capital so the next

play10:14

step uh we have

play10:16

to

play10:18

convert to lower case so lower casing is

play10:23

very simple you can just call lower

play10:25

function on text now you can see here t

play10:29

is

play10:32

smaller and the third and final step

play10:35

which is again important removing stop

play10:38

words you can skip this part you can

play10:42

also skip this part but you are not able

play10:44

to skip this third step for any NLP

play10:47

project because in English language we

play10:50

have a lot of stop words in a sentence

play10:52

stop words like is from my this like

play10:57

noun sorry pronouns and uh uh

play11:02

preposition in English grammar so

play11:05

preposition uh pronouns all these words

play11:09

are stop words you need to remove them

play11:12

so remove stop words now for stop words

play11:16

removing we already imported from nltk

play11:20

carpus stop wordss

play11:22

class so you can call this stop wordss

play11:26

and there are you can call words and

play11:29

there you can specify the language

play11:31

because internally it's used I think

play11:32

five or six language like German French

play11:35

language uh English but I'm I'm dealing

play11:38

with English language so I'll pass here

play11:40

English and I need a unique stop words

play11:45

so you can call on this set function in

play11:48

Python it will get all the unique stop

play11:49

words like E

play11:51

from and I'll store that in a variable

play11:54

by the name of Stoppers because I'll use

play11:57

this variable on my data to remove all

play12:00

the Stu from each text so it's very

play12:04

clear now what you can do you you can

play12:06

use a list

play12:09

parameter uh or something like you can

play12:12

use Loops but I mostly use

play12:17

Loop in a single line so first I'll

play12:20

extract

play12:23

word uh from this text but we need to

play12:27

toonize this text

play12:29

so for that you can use split function

play12:32

okay now I will check that if the

play12:35

word is not

play12:38

located in the starware that I have

play12:41

defined so basically when this text is

play12:46

passed here it will remove all the

play12:48

punctuation then lower casing here I'm

play12:51

defining my stop words and here what I'm

play12:53

doing I'm running a loop on this text

play12:56

basically on this first I'm splitting it

play13:00

uh like this on comma reconizing it

play13:04

okay now it will take one by one first

play13:07

it will take this and here I am checking

play13:11

that if this word I mean this is located

play13:14

in Stoppers and this is a stopper so it

play13:17

is located and it will skip it it will

play13:19

not pass here then it will take this

play13:22

this is stopper then this then this text

play13:25

now text is not a stopper so it will

play13:27

check that if this text is not in stop

play13:30

it will pass to this word and I'll store

play13:34

it uh again in a text but it will return

play13:39

a list so I don't want list I want a

play13:41

string so I'll join it this entire by a

play13:46

string something like this so let me run

play13:49

it now you can see we have only

play13:51

important words like text use dummy text

play13:55

okay all the stop words has been removed

play14:00

okay now let's apply this function for

play14:03

our data set so let me add here a cell I

play14:08

have two input columns Source text and

play14:11

plagerized text so

play14:15

data source text and I'll call here a

play14:19

apply function and I'll simply pass my

play14:22

custom python

play14:25

preprocess

play14:27

preprocess why it's not

play14:32

working okay it should be

play14:38

pre-processed let me run it

play14:40

again and yeah pre-process

play14:45

text so I will store the clean text

play14:49

again in the same column our input

play14:52

feature so let me copy paste the

play14:55

same for plagerized

play15:02

text and let me run it now if I show the

play15:06

data

play15:09

set this is the data set now here you

play15:12

can clearly see uh this is the first

play15:15

record on zero index researchers

play15:17

discover new

play15:19

species and plagarized text scientists

play15:21

found previously unknown but if I check

play15:24

the original

play15:26

text researchers have disc discover word

play15:29

now here you can see we have a stop word

play15:31

a a stop word right but after cleaning

play15:35

that data and removing all the stoppers

play15:37

you can clearly see we don't see that

play15:39

stops like the

play15:42

stoppers researchers previously have but

play15:46

here you can see we don't have have

play15:48

researchers discovered new so now we

play15:51

have important words now once now we

play15:55

know that these two features are input

play15:57

feature so we have to can can add both

play15:59

these two features and label is our

play16:01

Target

play16:03

column we have to convert this textual

play16:06

data into numerical format because

play16:08

machine Any machine learning model can

play16:10

only understand numerical data not texal

play16:15

data so here we have to extract the

play16:17

features and also we have to convert the

play16:19

textual data into numerical format for

play16:21

that as I already mentioned we'll

play16:24

use TF ID vectorizer so I'll create an

play16:27

object of this TF F

play16:29

IDF vectorizer something like that then

play16:33

I'll call tfid vectorizer I I already

play16:37

important now what I can do I can simply

play16:42

convert all my text to

play16:46

numerical uh format by calling this

play16:48

function fit

play16:50

transform now here I have to pass both

play16:53

features so first I can do data source

play16:58

text

play17:00

and here I will

play17:01

concatenate by space with data my

play17:05

plagerized

play17:08

text okay

play17:13

and yeah this is the code in single line

play17:17

we have converted and I'll store the

play17:19

result in

play17:22

X so basically now we have converted all

play17:25

the textual data into numerical format

play17:27

by using tfid vectorizer now let's

play17:30

create y our Target and we know that

play17:32

it's a

play17:34

label now we have input feature and we

play17:37

have Target feature we only need to do

play17:41

Trent

play17:42

split so for Trent split we already have

play17:45

a function by the name of but first let

play17:48

me create four variable

play17:50

extr X test y TR

play17:56

y test and I'll use here Trend test BL

play18:01

function and I'll pass my input feature

play18:03

and Target feature now here we have to

play18:05

specify the test size I want uh to store

play18:09

only 20% data for test and remaining 80%

play18:14

data for training

play18:15

sets and I want shuffling so here you

play18:18

can pause random stat mostly I use 42

play18:22

you can use any number it doesn't matter

play18:24

now you have also TR test plate now

play18:27

let's train our first model logistic

play18:30

regression so logistic regession we

play18:33

already uh imported I think so we need

play18:35

to do just create an object for logistic

play18:38

regation like this now we have to turn

play18:40

logistic regation by using fit method

play18:44

or TR set and test

play18:51

set sorry trans set from input feature

play18:55

as well as corresponding Target

play18:56

feature now let's test the model but for

play19:00

that we need to do some prediction for

play19:02

test data these things are very basic

play19:05

Basics you have learned a lot in my

play19:07

projects uh because I've done a lot of

play19:08

projects in this channel so I'm not

play19:10

going to repeat all the stuff again

play19:11

again but yes here we are turning the

play19:13

model and here we are doing some

play19:14

prediction or

play19:16

testing so you can call predict method

play19:20

here and you can pause your input

play19:22

feature test

play19:24

data uh now let's calculate accuracy so

play19:27

print

play19:32

accuracy of the model will

play19:36

be we we already important accuracy

play19:38

method here you can pass two

play19:40

things your y

play19:43

test and so here basically we are going

play19:46

comparon

play20:01

confusion

play20:05

classification

play20:07

information I have a very beautiful

play20:10

playlist and very knowledgeable playlist

play20:13

Cy learn tutorial you can watch that you

play20:16

will understand a lot about all the

play20:18

stuffs that we use in any machine

play20:20

learning project using Psy lar and

play20:23

finally I'll calculate confusion Matrix

play20:26

so this is confusion Matrix and let me

play20:28

run

play20:29

now the model is giving attitude

play20:31

accuracy which is good honestly which is

play20:35

good class this is the classification

play20:37

report now let's let me give you a quick

play20:39

overview otherwise I have explained a

play20:41

lot of detail about this stuff the

play20:43

mathematics behind each method like

play20:45

Precision recall F1 score in my Cy L

play20:48

playlist you can view it in my this

play20:50

channel artificial intelligence so we

play20:53

have two claes zero not plagarized one

play20:56

plagarized for zero the Precision is 79

play20:59

which is good 86 is the recall uh 87 is

play21:03

the F score so for now just the the

play21:07

moment you get higher values higher

play21:11

Precision recall and F1 score it's mean

play21:13

for this class your model is doing quite

play21:16

amazing and the same thing goes with

play21:17

another class if you have higher values

play21:20

like 90 or like this one 86 it mean your

play21:23

model is doing great job this is the

play21:25

average accuracy and you can also

play21:27

calculate average

play21:29

Precision add3 for both

play21:32

classes and the same thing goes for

play21:34

recall F1 score

play21:36

now uh confusion Matrix is very

play21:39

important let me zoom out a bit so you

play21:43

can clearly

play21:44

see the

play21:47

output uh let me

play21:50

rerun I think I have to zoom out

play21:55

again now let me rerun again again it's

play21:59

again giving but yeah it's okay let me

play22:02

zoom in I just wanted to show you a

play22:04

proper output so that you can understand

play22:07

confusion Matrix now in confusion Matrix

play22:10

we have 2X two D array because we have

play22:13

two classes so this is uh actual this is

play22:19

the correct class uh correct prediction

play22:22

for one class let's say here we have a

play22:24

zero and this is the correct classes for

play22:26

another class let's say we have a one

play22:29

but diagonally up and diagonally down

play22:32

like five is the mistakes of our model

play22:35

and at here is the mistakes for another

play22:37

class for for our model I can't explain

play22:40

clearly here if I plot it then you will

play22:42

better understand but I'm not going to

play22:45

do that because I've already explained

play22:47

the mathematics behind this the

play22:48

theoretical part but yeah this is the

play22:50

general overview if you want to just

play22:52

understand you can uh view the diagonal

play22:55

part if the diagonal part has larger

play22:56

value it's mean your model is doing job

play22:59

but diagonal up and diagonal down if you

play23:01

see some zeros again if let's say here

play23:04

is a zero and here is a zero it mean

play23:07

that your model didn't misclassified any

play23:10

class so it mean that your model didn't

play23:11

perform any mistakes but here you can

play23:14

see we have at so here model done some

play23:17

kind of mistakes and here again five

play23:20

okay now let's let me copy paste this

play23:24

code because now we are applying the

play23:25

same method for random Forest

play23:29

so let's apply random forest for random

play23:31

Forest I'll use from scal

play23:34

learn uh you can uh emble method esal

play23:38

Lan

play23:39

do

play23:42

emble you can

play23:44

import random forest

play23:48

classifier and now let me past the same

play23:51

code here I'll simply

play23:54

repeat modify random Forest

play23:56

classifier now one thing

play23:59

random for take couple of parameter like

play24:01

an estimator so you have to pause that

play24:03

estimator I want

play24:05

100 and it's also take random State

play24:08

again I'm going to give it 42 now other

play24:11

method is same as it is can just run the

play24:14

sell So Random forest model accuracy

play24:20

is uh

play24:23

79 uh yeah it's good I think again this

play24:28

model is performing a little bit better

play24:32

but logistic was better than this model

play24:35

let's try Nave

play24:37

Bas because whenever you deal uh in

play24:41

classification problem and you are using

play24:45

machine learning model then there you

play24:46

have two models support Vector machine

play24:48

and Nave base for textual data so they

play24:51

will perform better in comparison to the

play24:53

other model because they espcially met

play24:55

for textual data okay so let let me

play24:59

paste the same code and we need to

play25:02

import NV Bas so importing NV

play25:05

Bas you you need again Cy L and you have

play25:10

to import it from

play25:12

Nev Base

play25:15

Class

play25:18

multinomial

play25:21

andb okay multinomial Nave

play25:24

base you might be thinking why I'm not

play25:26

going to explain this model so I already

play25:29

explained you can explore in this

play25:31

channel but this model is performing

play25:33

better than the previous two

play25:36

model uh where is you can see the

play25:40

accuracy is 86 and precision recall F1

play25:43

score for both classes are amazing so

play25:45

this model is good but let's try support

play25:48

Vector machine so you have to import it

play25:51

from

play25:54

sk.

play25:56

svm you have to

play25:59

import

play26:01

SVC meaning support vector

play26:04

classifier and

play26:06

then oops let

play26:09

me copy the previous

play26:13

code this part

play26:15

only contrl

play26:18

C

play26:20

and let me pass to

play26:22

here this should be SVC now SVC has two

play26:26

types linear and nonlinear and we are

play26:28

dealing with uh linearity because we are

play26:32

drawing a linear line between two

play26:34

classes zero and one PL eyes not

play26:36

plagerized so here you have to pause uh

play26:39

that

play26:40

parameter uh because it's used

play26:42

internally two parameters and sorry two

play26:45

types of uh linear support Vector

play26:47

classifiers so here you have to specify

play26:50

kernel and since we are dealing with

play26:53

linearity so I'll use linear and again

play26:58

you have to pause random stat 42 let me

play27:01

run it so this model is performing

play27:05

better than all the previous model

play27:08

because we are achieving

play27:10

at7 so finally we have to select this

play27:13

model for deployment for production so

play27:16

let's save this model saving

play27:20

model is very important because I'm not

play27:22

going to uh train again and again all

play27:24

the previous code we have to save this

play27:26

model and the vectorizer

play27:29

uh for production for new input so for

play27:32

that you use pickle library and in

play27:35

pickle you use dump method and here you

play27:38

have to pass two two things first the

play27:40

model you are going to save second open

play27:42

parameter where you have to give again

play27:44

two things name it the model we have

play27:47

model and here you have to give the

play27:50

extension PK and binary mode so we are

play27:52

going to save you here you have to use

play27:54

WB W WB means right binary mode and you

play28:00

have to do the same thing for vectorizer

play28:03

you might be thinking why we need

play28:05

vectorizer because user will not put

play28:08

numerical data user will put as I

play28:10

already shown you in the introductory

play28:13

part of this video in website in the

play28:16

user interface where user was putting

play28:18

some kind of textual data so there we

play28:21

also need the train vectorizer I mean

play28:23

the tfid

play28:25

vectorizer and you also need to save

play28:27

this one by this name now I'm not going

play28:30

to run this because previously I already

play28:34

save this too here you can see model. PK

play28:38

and tfid Dev vectorizer do pkl now we

play28:41

need to load this two why we need to

play28:44

load these two things I mean the model

play28:46

and TF vectorizer that we have

play28:51

saved we need to use them so for saving

play28:55

you can use this code again pickle but

play28:58

this time instead of dump you have to

play29:00

use load method and here you have to

play29:02

pass open specify the name of the model

play29:05

and this time you have to give RB read

play29:07

binary mode because this time we are

play29:09

reading not saving and the same thing

play29:11

goes for T So finally I have a model and

play29:14

tfid vectorizer which are

play29:17

loaded uh I think I didn't run the

play29:21

previous cell so let me copy this pickle

play29:24

and let me past it here so now we have

play29:28

loaded model now the final thing in this

play29:30

jupter notebook is to create a

play29:33

function or detection system then we

play29:36

will work on the deployment part so I

play29:40

will create here a function uh that will

play29:44

Tech user input user input

play29:48

text in this function I'll do two things

play29:50

first I'll

play29:53

vectorizer the

play29:54

text

play29:56

then will we'll do

play29:59

prediction by model and then a return

play30:04

final

play30:06

output so this is the user input

play30:09

researchers have discovered a new

play30:11

species of butterfly in Amazon ran

play30:14

forest and I have called the function

play30:17

detect and I have passed input text so

play30:19

this function will get this input text

play30:22

here so let's vectorize it vectorize it

play30:25

so we already loaded Trend TFI

play30:28

vectorizer and here this time we don't

play30:29

need to call F transform we are not

play30:31

going to tr we are just doing some

play30:33

transformation so here you have to use

play30:35

transform method and you have to pass in

play30:38

list keep in mind here you have to pass

play30:40

in

play30:43

list user input text right and finally

play30:47

we will have

play30:50

this

play30:52

vectorized text now let's do

play30:56

prediction and and store the result in

play30:59

result variable so model. predict method

play31:03

and here just pass the vectorized text I

play31:06

hope you get the flow and logic of this

play31:09

function and finally return result but

play31:12

it will be it will return either zero or

play31:16

one as per the prediction but I will use

play31:18

here a logic the logic would be again in

play31:20

single line so I will write here

play31:24

plure plagerism

play31:26

[Music]

play31:31

detected when if the

play31:34

result is equal equal to

play31:38

one when the model predict for this text

play31:42

this vectorized text it will give you

play31:44

either zero or one at a time so here if

play31:47

the result you might you don't confuse

play31:49

with this zero this zero is just for

play31:51

removing the list because the zero or

play31:54

one that's uh that's return returned by

play31:57

this model will will be in a list form

play31:59

so I'm removing them from list so if it

play32:02

is equal to one then the model will

play32:04

return that plagerism

play32:08

detected

play32:10

else so if it is not one so it mean that

play32:13

it is zero and for zero we have

play32:16

no plure Rison I hope you get the idea

play32:21

let me run

play32:23

this uh inv valid syntox

play32:27

[Music]

play32:29

okay we don't need to here colum and now

play32:33

let me run this so you can see plagarism

play32:37

detected because it was a plagarized

play32:41

text I have another text as I already

play32:44

shown I think I didn't shown you it has

play32:47

no plagiarism playing musical

play32:49

instruments enhances creativity you can

play32:51

clearly see any anyone can write this

play32:53

thing can use uh his or her own idea a

play32:58

text so let's see again I have passed

play33:02

this input text to the function no pler

play33:05

rism this is practicing yoga I have

play33:07

already shown and again here we we do

play33:11

not have any

play33:13

plagerism so that was the Jitter

play33:15

notebook code I hope you get everything

play33:19

from scratch now we need to work on the

play33:23

simple user interface using floss

play33:25

framework in Python and the background

play33:27

code

play33:28

but always for any project some students

play33:33

say that uh we are facing some

play33:37

unmatching issues in py charm in the

play33:40

deployment

play33:42

part

play33:51

model so you need to install the same

play33:55

version of cyut learn in p environment

play33:58

where you use in the jet notebook

play34:01

because you train the model and TFI

play34:03

vectorizer with this version so you have

play34:06

to exactly install that version here now

play34:10

so we have open pyam now here are few

play34:14

things uh important for all the student

play34:17

that mostly get errors related to

play34:20

installation or something first you need

play34:23

to create virtual environment and I hope

play34:25

you have learned how to create virtu

play34:28

environment while open pyam or any ID

play34:30

like vs code for python now in

play34:33

this in this folder I have this folder

play34:36

in this folder you must have model. pkl

play34:40

file that we already save there in

play34:42

jupyter Notebook and tfid vectorizer PK

play34:46

you must have these two files you don't

play34:48

need to this Jupiter notebook or data

play34:50

set you can keep it no worries but you

play34:51

can remove it but these two files are

play34:55

must then you have to create app.py file

play34:59

which is empty you can see it is empty

play35:01

in this we will create the flask

play35:03

framework code backend code now flask or

play35:07

Jango need templates for user interface

play35:10

for HTML files so here you have to

play35:13

create templates directory or folder

play35:16

with the same spelling without any

play35:18

mistake

play35:20

templates some student just create

play35:22

template but you have to add plural as

play35:26

well ES

play35:28

and within this template you need to

play35:29

create index.html currently it's empty

play35:33

so first we will create the user

play35:35

interface now this is the basic andn

play35:39

talkx of any HTML file this is the ti

play35:42

title plagerism detector now in body for

play35:46

now I'll just use H1 tag and here I will

play35:52

say I think you are just watching this

play35:57

video and not liking and

play36:01

subscribing this

play36:03

channel

play36:05

yet we will print this thing on our

play36:08

website so let's do that from back end

play36:11

using flask

play36:13

framework now again you need to go to

play36:17

your terminal okay because you need to

play36:20

install couple of things now here you

play36:23

can see I'm here and my environment is

play36:26

already created here

play36:28

and here you have to PP install flask

play36:32

you need to install flask I already

play36:35

installed so I'm not going to install

play36:36

just enter and it will be installed

play36:38

after that you need to install cyut

play36:41

learn equal equal to the version you

play36:44

have used in jupter Notebook while

play36:46

training the model so I have used

play36:49

1.3.2 version so you you use the same

play36:51

version here as well and once you do

play36:54

that you will not get any mismatch error

play36:57

now first we need to import couple of

play37:00

things from flask uh

play37:02

framework first you need to import

play37:05

capital F whilea flask which will help

play37:07

us to create an object then you need to

play37:10

import render template which will help

play37:13

us to communicate with user interface

play37:15

HTML files or redirect it will work

play37:18

something like communication okay I'll

play37:20

show a bit

play37:21

later and also request because we we

play37:24

have to request to the form for for

play37:26

taking the user

play37:28

input now here you have to

play37:32

create uh app object now this is

play37:37

built-in syntax you you will not be able

play37:39

to change it now you have to call this

play37:41

app in Python main function and if you

play37:44

are a python programmer you might have

play37:48

seen if name equal equal to under uncore

play37:53

man this is the python man and here you

play37:55

have to run the app by specifying

play37:59

debugger will be so this is the basics

play38:01

and tax you will never be able to change

play38:03

it now between app and between Python

play38:07

man you have to implement all the logic

play38:10

The

play38:12

Roots getting data from user interface

play38:14

and passing data to the user interface

play38:16

to be displayed so between this part we

play38:20

will do the logic first we need to load

play38:22

the two

play38:23

things uh I have to import

play38:29

uh pickle Library as well now first I

play38:32

will load with the help of pickle load

play38:35

function I'll pass open and here the

play38:40

model with read binary mode okay let me

play38:45

check the recording

play38:46

quickly and I will do the same for

play38:49

vectorizer so let me replicate this line

play38:52

with the by pressing control and

play38:55

d and here vectorizer

play38:58

PK let me copy this and give me a proper

play39:02

name so I have successfully imported

play39:04

sorry loaded model and tfid

play39:11

vectorizer and

play39:24

viewers English speakers that's why I'm

play39:27

use combination both okay so we have

play39:30

load loaded this to uh model and tfid

play39:34

vectorizer now let's communicate with

play39:37

the index.html file I mean the user

play39:39

interface so for that you use

play39:41

root okay and in root the first root

play39:45

I'll just pass empty because by default

play39:48

I want to open directly the index.html

play39:50

file when I run this app so here you

play39:53

have to give a function by any name and

play39:56

with the help of

play39:58

return and render template you can go to

play40:02

that file HTML file as I already

play40:05

mentioned that render template is used

play40:07

for communication between back end and

play40:08

front

play40:11

end so let me right click and run app

play40:15

now in console you will get a

play40:18

URL just click on this URL and your app

play40:20

will be pop up in your Chrome browser

play40:22

you can clearly see it's I zoomed it a

play40:25

lot so let me zoom out I I think you are

play40:27

just watching this video and liking and

play40:30

subscribing this channel yet I should

play40:32

write here let me change it basically

play40:35

this message for all the user who are

play40:39

watching this video I think you are just

play40:41

watching this video and not liking and

play40:46

subscribing this channel so yeah in this

play40:49

way uh you create user interface and

play40:54

communication but we will not displaying

play40:56

this thing

play40:58

what we can do I will here use container

play41:02

so D I will give a class container cuz I

play41:07

will also apply my CSS for making it

play41:10

attractive and appealing and within this

play41:13

container I will use H1 tag by title

play41:18

pledge

play41:20

rism detector so if I go to my page and

play41:24

reload

play41:25

it now we have

play41:27

plagerism

play41:30

detector after

play41:32

that I will use a

play41:36

form uh action will be detect this part

play41:39

again important for web development I'll

play41:43

explain a bit later and Method will be

play41:46

post and within the

play41:48

form I will pass text area where user

play41:51

will

play41:52

input text but this text area TT couple

play41:56

of things things

play41:58

like

play42:02

name name will be text again this part

play42:05

important but I will explain a bit later

play42:07

plus holder plus holder when you want to

play42:10

show something before so that user can

play42:12

understand it enter text

play42:15

here dot dot

play42:17

dot

play42:21

and one last attribute that is

play42:26

required required mean

play42:28

important

play42:30

so in form we have passed two things

play42:33

action and method in action we have

play42:36

passed URL or path detect because with

play42:40

the help of this action in back end when

play42:43

we will implement the code for back end

play42:45

you will get and understand this thing

play42:48

so here

play42:49

specifying action just for getting the

play42:52

form there in backround and we have two

play42:55

methods get and post post for sending

play42:58

and receiving data between back end and

play42:59

front end so here we are passing post

play43:02

and this is the just getting input for

play43:04

user so let me refresh now this is

play43:08

the uh our text

play43:12

area here we pass name text because when

play43:18

user input here text I will get the text

play43:20

in back end I mean in app.py with the

play43:23

help of this

play43:25

name now within the same form you need

play43:28

to pass a

play43:30

button so button I would like to pass

play43:34

the type will be

play43:38

submit

play43:40

and title will be check

play43:44

for Ledger rism let me quickly show

play43:50

you so we have this button but this page

play43:53

is not designed this page is not

play43:55

appealing this page is not attractive we

play43:58

have to make it attractive and appealing

play44:01

now for that either you can you can use

play44:03

bootstrap and bootstrap classes or you

play44:05

can use your you your own

play44:08

CSS like inline CSS you can call it now

play44:13

on this container and this entire body

play44:15

I'll do couple of

play44:18

styling so here in

play44:20

style you can do that by just calling

play44:23

the classes so I have done this CSS all

play44:27

ready let me quickly explain it within

play44:29

the style on on the entire body of my

play44:32

page I have uh kept this properties and

play44:36

attributes okay

play44:39

now like background color set display

play44:44

will be Flex a lot of properties and

play44:48

their corresponding values and we have

play44:50

done some kind of CSS for container I

play44:53

want the WID width 50% background will

play44:57

be this padding so these are some

play44:59

attributes and also added box Shadow now

play45:03

uh just let me quickly

play45:05

reload now we have something like

play45:07

attractive

play45:10

okay so you can clearly see these are

play45:13

the Box Shad now we have to work on this

play45:16

text area and this button

play45:20

okay so I have

play45:23

already uh done the CSS just let me past

play45:27

it here so this is the CSS for the

play45:30

button and text area and

play45:33

result this is the button and this is

play45:35

the text area you can clearly see uh I

play45:39

didn't implemented result yet but we

play45:42

will see it a bit later so just let me

play45:48

refresh so yeah this is the form now

play45:52

when we click on this check for

play45:54

plagerism there we will display the

play45:55

result so but for now we need to get

play46:00

this

play46:01

text uh in f.p so in f.p I will create

play46:06

another

play46:08

root

play46:12

so f.

play46:14

root and here we have detect in form and

play46:20

here you have to give methods will be

play46:22

equal to either get or just post let's

play46:26

say and here you have to define a

play46:28

function let's say

play46:31

predict and within this prediction

play46:33

function we will do

play46:34

something or we can give it a proper

play46:37

name to this function like

play46:39

uh

play46:42

detect

play46:44

uh

play46:48

plagerism

play46:51

PLM

play46:53

okay now the first thing

play46:57

uh we have to

play47:00

do uh what should I first do first we

play47:04

have to get the data from the form so

play47:07

it's very simple we have a request

play47:09

method from floss and here you can call

play47:12

form and here you can pause the name of

play47:15

that input field so that was text so it

play47:19

will get the text and we will we will

play47:21

store it input

play47:24

text got it

play47:30

and

play47:34

secondly now we have to do we have to

play47:36

apply the two

play47:38

steps

play47:41

like uh vectorized tfid vectorized

play47:45

dot

play47:47

transform this

play47:49

input

play47:51

text and I will give it a name

play47:58

vectorized

play48:01

text and finally let's do

play48:06

prediction so we have a model.

play48:10

predict this

play48:14

veiz

play48:15

text and

play48:19

finally

play48:21

uh we we will update this result with

play48:24

this

play48:25

message uh like

play48:27

like

play48:30

plagerism

play48:33

detected

play48:34

if

play48:37

result equal equal to one else no

play48:44

[Music]

play48:46

plagerism right and finally we will send

play48:49

this

play48:51

result to our index.html to be displayed

play48:54

there so you can use render

play48:57

template and go to

play49:00

index.html and there result will be

play49:02

equal to result now let me explain the

play49:05

flow so in this function I mean in this

play49:08

root first we are getting the root the

play49:10

form then we are creating this function

play49:13

and within this function first we are

play49:16

getting uh from this

play49:19

text where user input the text we are

play49:22

getting that text and storing in input

play49:24

text then we are doing vectorization

play49:26

ition prediction creating the message as

play49:30

per the result either zero or one and

play49:33

then we are passing that method so

play49:35

either we will this or we will this in

play49:37

this result finally so finally we have

play49:40

to print this result on

play49:42

index.html so outside the form outside

play49:45

the form uh I will use this template

play49:49

when you implement python code in HTML

play49:52

then you can use this template so if

play49:55

result first I'll check

play49:57

many get that we have geted we have get

play50:00

the result or not then for now I'll just

play50:05

print with this syntax result again the

play50:09

syntax is built in you cannot

play50:11

change and finally I will end the if so

play50:16

this is the syntax sometime we call it

play50:18

Ginger template and this is the printing

play50:22

like in Python we do print but if you

play50:24

print variable in index using python or

play50:26

flos or Jango sorry not Jango Jango

play50:31

there is a difference between Jango and

play50:32

ginger template so yeah this is the

play50:35

template uh now I have to run this again

play50:38

because I have made a lot of

play50:44

changes so just click on the

play50:48

URL

play50:50

and hello friend for now this is

play50:55

something no please lism so it's

play50:59

working

play51:01

and uh let's make this output a little

play51:05

bit attractive so I can go to

play51:09

index.html and I'll use this

play51:13

code because I've already result class

play51:16

where I've done some CSS you can see it

play51:19

like padding margin top now let's

play51:23

[Applause]

play51:25

refresh

play51:27

here and let me use this text I mean

play51:32

this

play51:34

one so the model is working uh quite

play51:37

amazingly because it has great result so

play51:41

no plagarism

play51:42

uh let

play51:44

me let me try from my data uh what was

play51:48

the name this was the data let me just

play51:52

use Source

play51:54

text uh let's go to Source text and

play51:59

let's

play52:01

use

play52:05

uh this one on this index I first I'll

play52:09

get this

play52:15

text now

play52:17

volunteering Fosters Community spirit I

play52:21

don't know what it

play52:24

saying let's see but what the model

play52:29

say uh no

play52:31

plagerism let's try one more let me take

play52:34

a random text on 50 index I don't know

play52:37

what honey bees communicate series Don M

play52:40

I think this could be or not but let's

play52:43

see what the models

play52:45

in uh this is plagarized

play52:48

uh let me try one

play52:51

more on N

play52:53

index Sahara Desert largest H desert

play53:00

world now let's see what the model say

play53:04

plagerism detected because this could be

play53:06

someone else idea someone else hardw

play53:09

working someone else article text so

play53:12

yeah this is the project and

play53:15

finally one thing uh your support in the

play53:19

form of likes comments and subscription

play53:22

to this channel would be really

play53:25

appreciated so so thank you so much for

play53:27

watching this video I have shared this

play53:30

entire code with you this is the Jupiter

play53:31

notebook code f.p code data set

play53:34

index.html model. p and tfid vectorizer

play53:37

PK if you want to use model. pkl tfid

play53:41

vectorizer I mean my one then in Pam you

play53:44

have to create virtual environment and

play53:45

there you have to install the same

play53:46

version of Cy L like

play53:50

1.3.2 so thank you so much for watching

play53:52

this video and we will see uh in

play53:55

upcoming another amazing NLP project

play53:58

thank you and finally again subscribe to

play54:01

this Channel and like this video

Rate This

5.0 / 5 (0 votes)

Ähnliche Tags
Plagiarism DetectionNLP ProjectMachine LearningText AnalysisData SciencePython CodingTF-IDFClassifier ModelsWeb DevelopmentFlask Framework
Benötigen Sie eine Zusammenfassung auf Englisch?