Data Preparation (PART 1) - Building a Netflix Recommendation System
Summary
TLDRThe video script is a tutorial on preparing a movie dataset for a recommendation application. It emphasizes the importance of data preparation, which is said to take up to 90% of a data scientist's time. The process is divided into five parts, starting with importing libraries like pandas and naai, then loading the dataset. The script guides through checking for missing values, replacing them with 'unknown', and cleaning the data by removing symbols and converting text to lowercase. The cleaned data is then saved for further use in subsequent tutorials.
Takeaways
- đ The process involves five parts to prepare the dataset for movie recommendation.
- â±ïž Data preparation is emphasized as the most time-consuming part, taking up to 90% of a data scientist's or machine learning engineer's work.
- đ The dataset is sourced from a provided link, with an alternative to directly load it from the script.
- đ The script uses Python libraries such as pandas and naai for data manipulation.
- đ The initial step is to load the dataset and view the first five rows to understand the data structure.
- đ The dataset contains various features like director name, critic reviews, movie duration, and Facebook likes for directors and actors.
- đïž The script demonstrates how to check the shape of the dataset, number of columns, and select important features for the analysis.
- đ The data is cleaned by replacing missing values with 'unknown' and removing unnecessary symbols like pipe signs.
- âŹïž The script converts all text data to lowercase to maintain consistency.
- đ The cleaned and prepared data is saved as 'data_1.csv' for future use in the recommendation system.
Q & A
What is the main focus of the tutorial described in the transcript?
-The main focus of the tutorial is to guide users through the process of data preparation for building a movie recommendation application.
How much time is typically spent on data preparation in a data science project according to the speaker?
-The speaker mentions that data scientists or machine learning engineers spend almost 90% of their time on data preparation.
What is the first step the speaker takes in preparing the dataset?
-The first step in preparing the dataset is to import the necessary libraries, specifically pandas and naai.
What does the speaker suggest doing with missing values in the dataset?
-The speaker suggests replacing missing values with the word 'unknown'.
What specific data is the speaker using from the 'movie metadata.csv' file?
-The speaker is using data such as director name, actor names, movie title, and genres from the 'movie metadata.csv' file.
How does the speaker handle the pipe '|' symbol found in the genre data?
-The speaker replaces the pipe '|' symbol with nothing (removes it) to avoid issues when working with the data in Python.
Why is it important to convert all text data to lower case according to the speaker?
-Converting all text data to lower case is a good practice for consistency, especially when working with text data in Python.
What does the speaker do to ensure consistency in movie titles?
-The speaker strips any terminating characters at the end of the movie titles to ensure consistency.
How does the speaker save the prepared data for later use?
-The speaker saves the prepared data as a CSV file named 'data_1.csv'.
What is the next step after saving the first prepared dataset according to the tutorial?
-The next step is to prepare the second dataset, which will be covered in the next tutorial.
Outlines
đŹ Introduction to Data Preparation for Movie Recommendation
The speaker begins by introducing a tutorial on how to prepare a dataset for building a movie recommendation application. They emphasize that data preparation is the most time-consuming part of the process, often taking up to 90% of a data scientist or machine learning engineer's time. The tutorial is structured into five parts, focusing on data preparation first. The dataset used is the 'movie metadata.csv', which can be downloaded from a provided link or loaded directly from the script. The speaker outlines the steps involved in the process and mentions that they will be using Python libraries such as pandas for data manipulation. They also demonstrate how to load the dataset and view the first five rows to understand the initial structure and content of the data.
đ Exploring and Selecting Data Features
In this segment, the speaker discusses the importance of exploring the dataset to understand its features and selecting the relevant ones for the recommendation model. They mention that the dataset includes various features such as color, director name, critic reviews, movie duration, and Facebook likes for directors and actors. The speaker decides to focus on a subset of these features, including director name, actor names, and movie title, as they are more relevant for the recommendation system. They also address the issue of missing values, choosing to replace them with the word 'unknown' to maintain consistency in the dataset.
đ ïž Cleaning and Preprocessing the Data
The speaker continues with the data cleaning process, focusing on removing unwanted symbols and converting all text to lowercase to maintain consistency. They specifically target the pipe symbol '|' found in the 'General' column, replacing it with nothing to avoid potential issues during data processing in Python. The speaker also demonstrates how to strip trailing symbols from movie titles to ensure clean data for analysis.
đŸ Saving the Prepared Data for Future Use
Finally, the speaker demonstrates how to save the cleaned and preprocessed data as a new CSV file named 'data_1.csv'. They explain that this prepared dataset will be used in subsequent tutorials to build the movie recommendation system. The speaker concludes the first part of the tutorial by indicating that the next tutorial will cover the preparation of a second dataset, suggesting a multi-part approach to the overall project.
Mindmap
Keywords
đĄData Preparation
đĄMachine Learning Engineer
đĄRecommendation System
đĄPandas
đĄCSV
đĄData Scientist
đĄFeature Selection
đĄMissing Values
đĄNormalization
đĄData Cleaning
đĄMovie Metadata
Highlights
Introduction to a step-by-step tutorial on preparing a movie recommendation dataset.
Emphasis on the importance of data preparation in data science and machine learning, often accounting for 90% of the work.
Overview of the five-part process for data preparation.
Explanation of how to obtain the dataset and the provided link for downloading.
Demonstration of loading the dataset using pandas.
Instruction on how to view the first five rows of the dataset to understand its structure.
Details on the dataset's features, such as director name, critic reviews, and Facebook likes.
Importance of selecting only the necessary features for the recommendation model.
Procedure to check the shape of the dataset to understand the number of rows and columns.
Explanation of how to handle missing values by replacing them with 'unknown'.
Technique to remove pipe symbols from genre data to avoid issues in data processing.
Conversion of all text data to lowercase to maintain consistency.
Guidance on observing and cleaning data to remove unwanted symbols or characters.
Process of saving the cleaned dataset as a CSV file for future use.
Anticipation of the next tutorial covering the preparation of the second dataset.
Transcripts
all right so let's get started um I'm
going to walk you through step by step
how we're going to get the data set how
we're going to prepare the data set um
to the point that we can use to make
this recommendation okay and then um I
mean use it to build this application to
make the recommendation of the movies
all right so um we're going to do this
in um in in five parts okay that is the
data preparation remember that um most
of the work that you do as a data
scientist or as a machine learning
engineer is I mean you spend almost 90 %
of your time preparing the data I mean
the deployment and everything is just is
just a simple thing okay um or building
the model is just a simple thing but the
most important thing is to get the data
right to prepare it rightly okay and
that is what I'm going to walk you
through in this um in this particular
tutorial we're going to do um it in four
parts so this is this is one all right
and then we go to two then we go to
three right and then we go to four um
yeah I'm still human right and then we
go to um we go to Five okay so th that's
that's what we're going to do um in this
um I mean we it's going to be a
different I mean five different videos
all right I'm not going to put all of
them in this in this um in this in this
tutorial so we're going to do it for the
first the first preparation okay for
this
tutorial okay now um we going to use
data set from different different
sources and I've given you a link over
here where you can just click on the
link and then you get the data set okay
um alternatively I'm going to load the
data set for you you can see that I've
loaded it here okay movie metadata. CSV
that's what I'm going to use um you can
download it from this link right if you
click on this link it's just going to
get um the data over there right I also
give you directly if you don't want to
go and download you can just use it you
don't need to I mean even go there right
so um the first thing first we need to
import the libraries that we're going to
use okay so I'm I hope you're ready I
mean you're ready for us to go through
all right um we I've imported pandas
here and then um naai all right so let
me run this all good so if you see the
take here show that's everything is fine
all right now the next thing that I'm
going to do is to load the data set okay
so you can see over here I'm using um
pd. read CSV then I'm reading the
moviecore metadata right so this is the
path right that I just got for you I'm
sure by now you know how to do all these
things right just copy the path and then
you put it over here okay then I want to
see the first five um row right that is
the head of the data set right so let me
run that good so this is this is what we
have in this particular um first part of
the data that we're going to um use okay
let me actually click on this link so
that you know where the data is reside
in all right just in case you want to
you want to um download so the data is
here just click on this download button
and you get it all right so um let me
just close it all right so that that's
what I have over here now um you can see
that over here we have um the color we
have the director name that's the
director who is directing the movie we
have the number of critic movie I mean
number of critic for reviews the
duration of the movie the director
Facebook likes uh we have the actor at
three Facebook likes at two name actor
one Facebook lies girls um it do the
girl sales that they have okay and then
we have the general over here then we
have the actor one name movie title uh
number of voted users um cast to
Facebook likes and all those um I mean a
lot of features over here right a lot of
features as as you can see okay so this
this is the first data that we need okay
and uh I'm going to I'm going to walk
you through what we need to do with this
data okay so um the next thing here is
to see the uh the number of rows and the
number of columns that we have that we
going to um deal with okay so um that is
the shape if I run it you see that we
have around
5,43 uh rows and then we have 28 columns
um over there okay then um the next
thing is to check these number of
columns that we have right so we said
that we have around 28 columns if I run
this you be able to see all the columns
that we have right this are all the
columns we're not going to use all of
them I mean we going to select those
that are important to us right um if you
want if you want you can use all of them
but I just want this code to run
otherwise um it's going to take a bit of
time and uh I mean a bit of memory here
I don't want to um be pausing it as as I
am recording right but if I mean on your
own free time you can just use as many
features that you want right but I'm
going to cut I mean I'm going to just
select some of the some of the features
to do this okay um the next thing that
I'm going to do over here is that I'm
going to show you um the I mean we have
we have data up to 2016 okay we have the
data up to 2016 um later on I'm going to
show you how you can get the data from
2017 201 18 2019 2020 I'm going to show
you how you can do it we will do that in
the um the I mean when we doing the
processing in in the second the second
um tutorial right I'm going to show you
how you can get the other data but this
one is up to 2016 okay if I run this
code over
here okay now you can see um over here
right you can see it's you can see the
date is up to 2016 okay it's up to 2016
all the way from um 1916 right
1916 uh all the way to 2016 that's the
data that we have so you can see that
it's not a small data right all the way
from 1916 to um 2016 okay okay so over
here I'm just using M lab and then the
data that is the data that we loaded up
here right the data that we loaded up
here is what um you can see over here
right and then the title here right the
title here is one of the columns um over
here right let me go over here show you
that um Title Here Title Here Title Here
Come on come on come on come on come on
where are you where are you oh let me
see let me see yeah it's here okay title
here okay so that's that's that's you
can see that is 2009 2007 2015 2012
that's that's what I'm selecting okay
and then um the value counts right to
count how many of them right if there
are any um four I mean if there are any
missing values right I'm not going to
drop them so that's why over here you
can see the N an over here right you can
see that there's n over here then I sort
it from I mean from the highest to the
smallest right so that that's why you
can see something like that and then um
I ploted finally using a bar plot okay
the figure size is just how big and how
tall you want it to be right so 15
height and then 16 wide that's that's
that's basically what is here now um we
know that we have data up to 2016 that's
the only thing I wanted to show you over
here now the next thing that I want to
do um let me get rid of this
going push this one a bit right now um I
told you that I'm going to select some
of the features right I'm not going to
use all the features in this data set to
build um I mean to finally um do the
recommendation right uh as I said I want
it to run as fast as possible so that I
can walk you through the steps but in
your own free time you can just leave
all the features it's not going to cause
any harm it's actually going to give you
a good prediction in fact more than even
what I'm going to show you because the
more the data the the I mean the
the more the model is going to learn
right the better the model is going to
learn all right now what I'm going to
select over here is is this I'm going to
select the director name the actor name
I mean actor one name actor two name
actor three name and then the generals
right and then the movie title these are
the things that I'm going to select to I
mean moving forward this is what I'm
going to use okay so um actually let me
run this one and then um here get rid of
it now let me show you the data finally
finally okay so now you see that we have
this one director name actor one actor
two actor three and then General and
finally we have movie title okay that's
the only thing that I need over here
okay now what I'm going to do is um if
you if you check
um over here you you you can realize
that there are some missing values in
there right if uh maybe let me do this
one so that it will be clear it will be
better over here if I do data dot um sna
right dot U maybe let me first to SN and
see what happens over here okay I'm
going to have some false false false CU
see there's some true over there if
there is true there it means that
there's a missing value there right you
can see there there's some true there
what will even make it
more um more understandable is this okay
now you see that directa name there are
around 104 missing um directa names
right there around um seven act one I
mean act after one which is missing
after one names which are missing and
then um about 13 of the actor two names
which are missing 23 of the act three
names which are missing and in general
movie title there's nothing missing in
there okay so if there's anything
missing what I'm going to do is that I
want to replace all the missing um
values with unknown right with the
letter unknown I mean with the word
unknown right that's what I'm going to
do so I'll go into all the that are
having missing Val that's director name
actor one actor two actor three right
that's what I'm doing over here okay and
then I'm going to put um I mean in place
of the Mage Valu I'm going to put
unknown over there so you can see that
I'm using unknown over here okay unknown
over here and that's that's that's
that's what I want to do so if I see any
I mean any n I'm just going to put
unknown over there all right so that's
that's basically what I want to um do
now um let me run that so that um that
get done now over here what I'm going to
do is this I'm going to show you the
data and now you're going to see that um
this what we have right we we don't have
the N if if I show you a g you see that
um let me bring this code that we used
to check um if there are any missing
values now if I do that again let me do
this and then come here if I run this
code now you see that there are no
missing values right they are no missing
because python don't understand or known
as a missing value you just consider it
as a value right the only thing that
python understands to be unknown is n n
all right so um that's that's that's
fine that's what I want to do over here
now the next thing that I want to do is
um this I want to strip this thing if if
you if you if you observe let's go up
here right if You observe the general
right You observe the general you see
that we have action right we have
action over here now in between action
and then Adventure you see we have
action adventure we have fantasy and
then we have um sci-fi okay now in
between action and adventure you see
that we have this thing over here we
have this pipe um symbol over here the
same thing you see that we have it over
here you see that we have another one
over here right now what I want to do is
that I want to strip off that right
because that becomes a symbol and you're
working with python I mean you're
working with data that has this kind of
symbols it causes a lot of problems in
Python right so I want to strip it off
wherever I see it I just want to take
them off right in order of the data
that's what I want to do over here all
right now let's go down here and you can
see what I'm doing so um if I pick if I
pick the general right if I pick the the
general I'm going to I'm going to see
this one is actually a string it's a
string right it's a string um so I'm
using St Str right representing string
and I'm going to use this replace
function and then replace this with
nothing okay I'm just going to replace
this um let me it's actually like this
I'm going to replace this pipe sign with
nothing all right so whenever I see a
pipe sign I replace it with nothing
that's basically what is going on over
here so let me run
that all right that's fine that's fine
now um let me show you the data
again okay now let's go to the general
column uh where is it now you see that
we see that we have nothing there right
it's replaced with nothing just the
space right just the space and then we
don't have this um that pipe sign over
there again that's that's fine that's
what I want to do okay now um the next
thing that I want to do is convert
everything to a lower case okay if
you're working with a test it's a good
practice that um all your data is is
consistent right know that some are in
capital letters some are in small
letters it's actually um if you're
working with data using especially using
python to work with data I mean work
with test in particular is is a good
thing to do all right so I'm going to
convert everything to lower case right
that's that's what I'm going to do so
the movie title um if you go to the
movie title I'm going to pick everything
here right every movie title over here
and then convert it to lower case all
right so let me run that that's fine uh
then let me show you what is there right
now if we go to the movie title right
you see that everything is lower case
right unlike here you see that they were
um starting with capital letters and
then um so on and so forth but if you
see here everything everything is in
lower case right there's no caps over
there that's that's fine that's what we
want to do um now it's a good uh I mean
a good practice to observe your data to
make sure that there are no s other
symbols over there remember that we've
dealt with this pipe sign right we we
were able to remove um this pipe sign
from there right now let me show you
some of the data over here right um I'm
going to the title the movie title
column right in the movie title column
I'm just going to pick one data over
there right any any data at all so in
this case I'm going to pick a data at
column 8 right so let me run that so you
see you see that um this is the data
that is The Avengers Age of Ultron this
is the movie title over there right at
column 8 um here it's not showing
because it's um there's a break let okay
let me show you the first one which is
the zero right let me um change with
this instead of eight let me put zero
there and then uh let me run it now you
see that this is aat right that's that's
what is there let me go to the you see
that is there a that's that's what is
there now if you see um this right there
is this this this symbol over there
which I don't like right this is this
symbol over there which I don't like
okay I need to remove this because this
is a terminating it's a now terminating
character I don't I don't want that
because it's going to cause a lot of
confusion later on on you're working
with it okay so I want to strip that of
right it's it's present in all of the
movie titles right it's so it's it's a
good practice to um take time to observe
your data and see what is there if I
search for something else maybe 24 a
data point that's ow number 24 let me
see you see that's there you see that
it's still at the end over there okay
it's present in all of them right let me
SE for something around maybe um
say 67 right data point over there and
see you see that is there right you see
that is there so I mean I want to strip
that right I want to strip that it's
present in every every data point if I
search for say 21 data point at
21 you see that is there right so that's
that's what I'm going to do in the next
um cell okay so what I'm going to do
over here is that if I pick the movie
title okay what I'm going to do is that
I'm going to strip everything at the end
this is a negative indexing right
starting from the end so whatever is
there at the end there I want to strip
it okay so I'm using Lambda function for
that okay Lambda function for that so
the SX is actually representing the data
right just represent I mean the title
right it's going to represent all the
title and then the last I mean the from
negative one is actually going to give
you that right it's going to strip
everything every every every character
at the end all right there's a negative
index and and I'm sure if you've um
going through the Python tutorial you
understand negative indexing all right
now once um this part is done the next
thing that you need to do is to check to
see if everything is done right so let
me run this one okay now I'm going to
check the data point at um at at at I
mean at at row number one and see what
is there now you see that at the end
it's not there right if I check for
let's see some of them okay 21 it was
there right so let me see at 21 21 come
on
okay now you see that it's not there The
Amazing Spider-Man that's that's all you
see that over here was Amazing
Spider-Man and something else now it's
just the Amazing Spider-Man that's the
only thing that I want I want these
things over there okay now we are good
to go for that okay um what I'm going to
do is that I'm going to save this data
that I've prepared so far this is what
I've done right this is what I've done
and I'm going to save this data um that
I've prepared right this is just one of
the data that we're going to use so I'm
going to save it and then um I'm going
to use data. 2or CSV that is to save it
we're going to use it later on okay um
the name that I'm giving is um data 1.
CSV right you can just give it any name
that you want right you can just give it
any name that you want so I'm going to
run this one over here and then um Let
me refresh it over now you see that it's
there right data 1. CSV I've saved it so
I've prepared the first data set and
I've saved it right we are good to go
for the to prepare the second data set
right I told you we using different
different data sets to do this right so
now um the next tutorial we're going to
see how we prepare the second data right
so see you in the next tutorial
Voir Plus de Vidéos Connexes
Machine Learning Tutorial Python - 15: Naive Bayes Classifier Algorithm Part 2
Python Pandas Tutorial 5: Handle Missing Data: fillna, dropna, interpolate
Scraping Data from a Real Website | Web Scraping in Python
Case Study on Regression Part I
Machine Learning Tutorial Python - 9 Decision Tree
Plant Leaf Disease Detection Using CNN | Python
5.0 / 5 (0 votes)