Data Preparation (PART 1) - Building a Netflix Recommendation System

Data Mentor
12 Oct 202318:10

Summary

TLDRThe video script is a tutorial on preparing a movie dataset for a recommendation application. It emphasizes the importance of data preparation, which is said to take up to 90% of a data scientist's time. The process is divided into five parts, starting with importing libraries like pandas and naai, then loading the dataset. The script guides through checking for missing values, replacing them with 'unknown', and cleaning the data by removing symbols and converting text to lowercase. The cleaned data is then saved for further use in subsequent tutorials.

Takeaways

  • πŸ“ˆ The process involves five parts to prepare the dataset for movie recommendation.
  • ⏱️ Data preparation is emphasized as the most time-consuming part, taking up to 90% of a data scientist's or machine learning engineer's work.
  • πŸ”— The dataset is sourced from a provided link, with an alternative to directly load it from the script.
  • πŸ“Š The script uses Python libraries such as pandas and naai for data manipulation.
  • πŸ“ The initial step is to load the dataset and view the first five rows to understand the data structure.
  • πŸ“‹ The dataset contains various features like director name, critic reviews, movie duration, and Facebook likes for directors and actors.
  • πŸ—‚οΈ The script demonstrates how to check the shape of the dataset, number of columns, and select important features for the analysis.
  • πŸ” The data is cleaned by replacing missing values with 'unknown' and removing unnecessary symbols like pipe signs.
  • ⬇️ The script converts all text data to lowercase to maintain consistency.
  • πŸ”– The cleaned and prepared data is saved as 'data_1.csv' for future use in the recommendation system.

Q & A

  • What is the main focus of the tutorial described in the transcript?

    -The main focus of the tutorial is to guide users through the process of data preparation for building a movie recommendation application.

  • How much time is typically spent on data preparation in a data science project according to the speaker?

    -The speaker mentions that data scientists or machine learning engineers spend almost 90% of their time on data preparation.

  • What is the first step the speaker takes in preparing the dataset?

    -The first step in preparing the dataset is to import the necessary libraries, specifically pandas and naai.

  • What does the speaker suggest doing with missing values in the dataset?

    -The speaker suggests replacing missing values with the word 'unknown'.

  • What specific data is the speaker using from the 'movie metadata.csv' file?

    -The speaker is using data such as director name, actor names, movie title, and genres from the 'movie metadata.csv' file.

  • How does the speaker handle the pipe '|' symbol found in the genre data?

    -The speaker replaces the pipe '|' symbol with nothing (removes it) to avoid issues when working with the data in Python.

  • Why is it important to convert all text data to lower case according to the speaker?

    -Converting all text data to lower case is a good practice for consistency, especially when working with text data in Python.

  • What does the speaker do to ensure consistency in movie titles?

    -The speaker strips any terminating characters at the end of the movie titles to ensure consistency.

  • How does the speaker save the prepared data for later use?

    -The speaker saves the prepared data as a CSV file named 'data_1.csv'.

  • What is the next step after saving the first prepared dataset according to the tutorial?

    -The next step is to prepare the second dataset, which will be covered in the next tutorial.

Outlines

00:00

🎬 Introduction to Data Preparation for Movie Recommendation

The speaker begins by introducing a tutorial on how to prepare a dataset for building a movie recommendation application. They emphasize that data preparation is the most time-consuming part of the process, often taking up to 90% of a data scientist or machine learning engineer's time. The tutorial is structured into five parts, focusing on data preparation first. The dataset used is the 'movie metadata.csv', which can be downloaded from a provided link or loaded directly from the script. The speaker outlines the steps involved in the process and mentions that they will be using Python libraries such as pandas for data manipulation. They also demonstrate how to load the dataset and view the first five rows to understand the initial structure and content of the data.

05:02

πŸ” Exploring and Selecting Data Features

In this segment, the speaker discusses the importance of exploring the dataset to understand its features and selecting the relevant ones for the recommendation model. They mention that the dataset includes various features such as color, director name, critic reviews, movie duration, and Facebook likes for directors and actors. The speaker decides to focus on a subset of these features, including director name, actor names, and movie title, as they are more relevant for the recommendation system. They also address the issue of missing values, choosing to replace them with the word 'unknown' to maintain consistency in the dataset.

10:03

πŸ› οΈ Cleaning and Preprocessing the Data

The speaker continues with the data cleaning process, focusing on removing unwanted symbols and converting all text to lowercase to maintain consistency. They specifically target the pipe symbol '|' found in the 'General' column, replacing it with nothing to avoid potential issues during data processing in Python. The speaker also demonstrates how to strip trailing symbols from movie titles to ensure clean data for analysis.

15:05

πŸ’Ύ Saving the Prepared Data for Future Use

Finally, the speaker demonstrates how to save the cleaned and preprocessed data as a new CSV file named 'data_1.csv'. They explain that this prepared dataset will be used in subsequent tutorials to build the movie recommendation system. The speaker concludes the first part of the tutorial by indicating that the next tutorial will cover the preparation of a second dataset, suggesting a multi-part approach to the overall project.

Mindmap

Keywords

πŸ’‘Data Preparation

Data preparation is a critical phase in data science and machine learning projects, involving cleaning, organizing, and transforming raw data into a format suitable for analysis. In the video, the speaker emphasizes that data preparation is the most time-consuming part of the process, often accounting for up to 90% of a data scientist's work. The script describes how to get a dataset and prepare it for building a movie recommendation application, highlighting the importance of this step.

πŸ’‘Machine Learning Engineer

A machine learning engineer is a professional who applies machine learning techniques to build systems and software solutions. They are involved in designing, training, and deploying machine learning models. The script mentions machine learning engineers in the context of spending a significant amount of time on data preparation, which is a foundational task for their role.

πŸ’‘Recommendation System

A recommendation system is an algorithm that suggests items or products to users based on their preferences. In the script, the speaker is guiding viewers on how to prepare data to build an application for movie recommendations, which is a type of recommendation system. The system would use the prepared data to suggest movies that users might like.

πŸ’‘Pandas

Pandas is a powerful Python library used for data manipulation and analysis. It provides data structures and functions needed to work with structured data. In the script, the speaker imports Pandas to handle the dataset, demonstrating its use for loading, viewing, and manipulating the data.

πŸ’‘CSV

CSV stands for Comma-Separated Values, a file format that stores tabular data in plain text. The script mentions loading a dataset from a 'movie metadata.csv' file, which is a CSV file containing information about movies. CSV files are commonly used for data exchange between different applications.

πŸ’‘Data Scientist

A data scientist is a professional who analyzes, processes, and interprets complex digital data from various sources, using scientific methods, processes, algorithms, and systems to extract insights and support decision-making. The video script refers to the data scientist's role in spending a significant amount of time on data preparation.

πŸ’‘Feature Selection

Feature selection is the process of reducing the number of input variables in a dataset by selecting a subset of the most relevant features for model construction. In the script, the speaker decides to select certain features like director name, actor names, and movie title for the recommendation model, illustrating the concept of feature selection.

πŸ’‘Missing Values

Missing values refer to the absence of data in a dataset. In the script, the speaker checks for missing values in the dataset and decides to replace them with the word 'unknown'. Handling missing values is an important step in data preparation to ensure the quality of the dataset.

πŸ’‘Normalization

Normalization is the process of adjusting values measured on different scales to a notionally common scale. Although not explicitly mentioned in the script, the concept is alluded to when discussing the need to handle different data types and formats, such as converting all text to lower case, which is a form of normalization in text data.

πŸ’‘Data Cleaning

Data cleaning involves correcting or removing corrupt, inaccurate, duplicate, or improperly formatted data. In the script, the speaker describes cleaning the dataset by removing pipe symbols and converting all text to lower case, which are common data cleaning tasks.

πŸ’‘Movie Metadata

Movie metadata refers to structured information about movies, such as title, director, actors, genre, and release date. The script discusses loading a dataset called 'movie metadata.csv', which contains various metadata attributes of movies that will be used to build the recommendation system.

Highlights

Introduction to a step-by-step tutorial on preparing a movie recommendation dataset.

Emphasis on the importance of data preparation in data science and machine learning, often accounting for 90% of the work.

Overview of the five-part process for data preparation.

Explanation of how to obtain the dataset and the provided link for downloading.

Demonstration of loading the dataset using pandas.

Instruction on how to view the first five rows of the dataset to understand its structure.

Details on the dataset's features, such as director name, critic reviews, and Facebook likes.

Importance of selecting only the necessary features for the recommendation model.

Procedure to check the shape of the dataset to understand the number of rows and columns.

Explanation of how to handle missing values by replacing them with 'unknown'.

Technique to remove pipe symbols from genre data to avoid issues in data processing.

Conversion of all text data to lowercase to maintain consistency.

Guidance on observing and cleaning data to remove unwanted symbols or characters.

Process of saving the cleaned dataset as a CSV file for future use.

Anticipation of the next tutorial covering the preparation of the second dataset.

Transcripts

play00:00

all right so let's get started um I'm

play00:02

going to walk you through step by step

play00:04

how we're going to get the data set how

play00:06

we're going to prepare the data set um

play00:08

to the point that we can use to make

play00:10

this recommendation okay and then um I

play00:12

mean use it to build this application to

play00:14

make the recommendation of the movies

play00:16

all right so um we're going to do this

play00:18

in um in in five parts okay that is the

play00:22

data preparation remember that um most

play00:24

of the work that you do as a data

play00:25

scientist or as a machine learning

play00:27

engineer is I mean you spend almost 90 %

play00:30

of your time preparing the data I mean

play00:32

the deployment and everything is just is

play00:34

just a simple thing okay um or building

play00:36

the model is just a simple thing but the

play00:38

most important thing is to get the data

play00:40

right to prepare it rightly okay and

play00:42

that is what I'm going to walk you

play00:43

through in this um in this particular

play00:45

tutorial we're going to do um it in four

play00:48

parts so this is this is one all right

play00:50

and then we go to two then we go to

play00:52

three right and then we go to four um

play00:56

yeah I'm still human right and then we

play00:58

go to um we go to Five okay so th that's

play01:01

that's what we're going to do um in this

play01:04

um I mean we it's going to be a

play01:06

different I mean five different videos

play01:08

all right I'm not going to put all of

play01:09

them in this in this um in this in this

play01:12

tutorial so we're going to do it for the

play01:14

first the first preparation okay for

play01:16

this

play01:17

tutorial okay now um we going to use

play01:21

data set from different different

play01:22

sources and I've given you a link over

play01:24

here where you can just click on the

play01:26

link and then you get the data set okay

play01:28

um alternatively I'm going to load the

play01:30

data set for you you can see that I've

play01:31

loaded it here okay movie metadata. CSV

play01:35

that's what I'm going to use um you can

play01:37

download it from this link right if you

play01:39

click on this link it's just going to

play01:40

get um the data over there right I also

play01:43

give you directly if you don't want to

play01:44

go and download you can just use it you

play01:46

don't need to I mean even go there right

play01:49

so um the first thing first we need to

play01:51

import the libraries that we're going to

play01:52

use okay so I'm I hope you're ready I

play01:55

mean you're ready for us to go through

play01:56

all right um we I've imported pandas

play02:00

here and then um naai all right so let

play02:02

me run this all good so if you see the

play02:05

take here show that's everything is fine

play02:07

all right now the next thing that I'm

play02:08

going to do is to load the data set okay

play02:12

so you can see over here I'm using um

play02:15

pd. read CSV then I'm reading the

play02:18

moviecore metadata right so this is the

play02:20

path right that I just got for you I'm

play02:22

sure by now you know how to do all these

play02:24

things right just copy the path and then

play02:25

you put it over here okay then I want to

play02:28

see the first five um row right that is

play02:31

the head of the data set right so let me

play02:32

run that good so this is this is what we

play02:36

have in this particular um first part of

play02:39

the data that we're going to um use okay

play02:42

let me actually click on this link so

play02:44

that you know where the data is reside

play02:46

in all right just in case you want to

play02:48

you want to um download so the data is

play02:50

here just click on this download button

play02:51

and you get it all right so um let me

play02:53

just close it all right so that that's

play02:55

what I have over here now um you can see

play02:59

that over here we have um the color we

play03:01

have the director name that's the

play03:03

director who is directing the movie we

play03:05

have the number of critic movie I mean

play03:07

number of critic for reviews the

play03:08

duration of the movie the director

play03:11

Facebook likes uh we have the actor at

play03:14

three Facebook likes at two name actor

play03:16

one Facebook lies girls um it do the

play03:19

girl sales that they have okay and then

play03:21

we have the general over here then we

play03:23

have the actor one name movie title uh

play03:26

number of voted users um cast to

play03:29

Facebook likes and all those um I mean a

play03:32

lot of features over here right a lot of

play03:34

features as as you can see okay so this

play03:37

this is the first data that we need okay

play03:40

and uh I'm going to I'm going to walk

play03:43

you through what we need to do with this

play03:45

data okay so um the next thing here is

play03:47

to see the uh the number of rows and the

play03:50

number of columns that we have that we

play03:51

going to um deal with okay so um that is

play03:54

the shape if I run it you see that we

play03:56

have around

play03:58

5,43 uh rows and then we have 28 columns

play04:02

um over there okay then um the next

play04:05

thing is to check these number of

play04:07

columns that we have right so we said

play04:09

that we have around 28 columns if I run

play04:11

this you be able to see all the columns

play04:13

that we have right this are all the

play04:16

columns we're not going to use all of

play04:17

them I mean we going to select those

play04:19

that are important to us right um if you

play04:21

want if you want you can use all of them

play04:23

but I just want this code to run

play04:25

otherwise um it's going to take a bit of

play04:27

time and uh I mean a bit of memory here

play04:30

I don't want to um be pausing it as as I

play04:33

am recording right but if I mean on your

play04:35

own free time you can just use as many

play04:38

features that you want right but I'm

play04:39

going to cut I mean I'm going to just

play04:41

select some of the some of the features

play04:43

to do this okay um the next thing that

play04:46

I'm going to do over here is that I'm

play04:48

going to show you um the I mean we have

play04:51

we have data up to 2016 okay we have the

play04:53

data up to 2016 um later on I'm going to

play04:56

show you how you can get the data from

play04:58

2017 201 18 2019 2020 I'm going to show

play05:02

you how you can do it we will do that in

play05:04

the um the I mean when we doing the

play05:07

processing in in the second the second

play05:10

um tutorial right I'm going to show you

play05:11

how you can get the other data but this

play05:13

one is up to 2016 okay if I run this

play05:16

code over

play05:19

here okay now you can see um over here

play05:22

right you can see it's you can see the

play05:25

date is up to 2016 okay it's up to 2016

play05:28

all the way from um 1916 right

play05:31

1916 uh all the way to 2016 that's the

play05:35

data that we have so you can see that

play05:36

it's not a small data right all the way

play05:38

from 1916 to um 2016 okay okay so over

play05:43

here I'm just using M lab and then the

play05:45

data that is the data that we loaded up

play05:47

here right the data that we loaded up

play05:50

here is what um you can see over here

play05:53

right and then the title here right the

play05:55

title here is one of the columns um over

play05:57

here right let me go over here show you

play06:01

that um Title Here Title Here Title Here

play06:04

Come on come on come on come on come on

play06:06

where are you where are you oh let me

play06:08

see let me see yeah it's here okay title

play06:11

here okay so that's that's that's you

play06:13

can see that is 2009 2007 2015 2012

play06:16

that's that's what I'm selecting okay

play06:18

and then um the value counts right to

play06:20

count how many of them right if there

play06:22

are any um four I mean if there are any

play06:24

missing values right I'm not going to

play06:26

drop them so that's why over here you

play06:27

can see the N an over here right you can

play06:29

see that there's n over here then I sort

play06:32

it from I mean from the highest to the

play06:35

smallest right so that that's why you

play06:37

can see something like that and then um

play06:39

I ploted finally using a bar plot okay

play06:42

the figure size is just how big and how

play06:44

tall you want it to be right so 15

play06:46

height and then 16 wide that's that's

play06:48

that's basically what is here now um we

play06:51

know that we have data up to 2016 that's

play06:53

the only thing I wanted to show you over

play06:54

here now the next thing that I want to

play06:56

do um let me get rid of this

play07:00

going push this one a bit right now um I

play07:03

told you that I'm going to select some

play07:05

of the features right I'm not going to

play07:06

use all the features in this data set to

play07:09

build um I mean to finally um do the

play07:12

recommendation right uh as I said I want

play07:14

it to run as fast as possible so that I

play07:16

can walk you through the steps but in

play07:18

your own free time you can just leave

play07:19

all the features it's not going to cause

play07:21

any harm it's actually going to give you

play07:23

a good prediction in fact more than even

play07:25

what I'm going to show you because the

play07:26

more the data the the I mean the

play07:29

the more the model is going to learn

play07:31

right the better the model is going to

play07:33

learn all right now what I'm going to

play07:35

select over here is is this I'm going to

play07:37

select the director name the actor name

play07:40

I mean actor one name actor two name

play07:42

actor three name and then the generals

play07:45

right and then the movie title these are

play07:46

the things that I'm going to select to I

play07:48

mean moving forward this is what I'm

play07:50

going to use okay so um actually let me

play07:53

run this one and then um here get rid of

play07:57

it now let me show you the data finally

play08:00

finally okay so now you see that we have

play08:02

this one director name actor one actor

play08:04

two actor three and then General and

play08:06

finally we have movie title okay that's

play08:09

the only thing that I need over here

play08:11

okay now what I'm going to do is um if

play08:14

you if you check

play08:16

um over here you you you can realize

play08:20

that there are some missing values in

play08:22

there right if uh maybe let me do this

play08:24

one so that it will be clear it will be

play08:27

better over here if I do data dot um sna

play08:32

right dot U maybe let me first to SN and

play08:36

see what happens over here okay I'm

play08:38

going to have some false false false CU

play08:40

see there's some true over there if

play08:41

there is true there it means that

play08:43

there's a missing value there right you

play08:45

can see there there's some true there

play08:48

what will even make it

play08:49

more um more understandable is this okay

play08:54

now you see that directa name there are

play08:55

around 104 missing um directa names

play08:58

right there around um seven act one I

play09:02

mean act after one which is missing

play09:04

after one names which are missing and

play09:05

then um about 13 of the actor two names

play09:08

which are missing 23 of the act three

play09:10

names which are missing and in general

play09:12

movie title there's nothing missing in

play09:14

there okay so if there's anything

play09:17

missing what I'm going to do is that I

play09:19

want to replace all the missing um

play09:21

values with unknown right with the

play09:23

letter unknown I mean with the word

play09:25

unknown right that's what I'm going to

play09:27

do so I'll go into all the that are

play09:29

having missing Val that's director name

play09:31

actor one actor two actor three right

play09:34

that's what I'm doing over here okay and

play09:36

then I'm going to put um I mean in place

play09:39

of the Mage Valu I'm going to put

play09:41

unknown over there so you can see that

play09:43

I'm using unknown over here okay unknown

play09:46

over here and that's that's that's

play09:48

that's what I want to do so if I see any

play09:50

I mean any n I'm just going to put

play09:52

unknown over there all right so that's

play09:54

that's basically what I want to um do

play09:56

now um let me run that so that um that

play09:59

get done now over here what I'm going to

play10:02

do is this I'm going to show you the

play10:05

data and now you're going to see that um

play10:08

this what we have right we we don't have

play10:10

the N if if I show you a g you see that

play10:14

um let me bring this code that we used

play10:16

to check um if there are any missing

play10:19

values now if I do that again let me do

play10:23

this and then come here if I run this

play10:25

code now you see that there are no

play10:27

missing values right they are no missing

play10:29

because python don't understand or known

play10:32

as a missing value you just consider it

play10:34

as a value right the only thing that

play10:36

python understands to be unknown is n n

play10:40

all right so um that's that's that's

play10:42

fine that's what I want to do over here

play10:44

now the next thing that I want to do is

play10:46

um this I want to strip this thing if if

play10:50

you if you if you observe let's go up

play10:52

here right if You observe the general

play10:55

right You observe the general you see

play10:57

that we have action right we have

play10:59

action over here now in between action

play11:02

and then Adventure you see we have

play11:04

action adventure we have fantasy and

play11:06

then we have um sci-fi okay now in

play11:09

between action and adventure you see

play11:10

that we have this thing over here we

play11:13

have this pipe um symbol over here the

play11:15

same thing you see that we have it over

play11:17

here you see that we have another one

play11:19

over here right now what I want to do is

play11:21

that I want to strip off that right

play11:23

because that becomes a symbol and you're

play11:25

working with python I mean you're

play11:27

working with data that has this kind of

play11:29

symbols it causes a lot of problems in

play11:32

Python right so I want to strip it off

play11:33

wherever I see it I just want to take

play11:36

them off right in order of the data

play11:38

that's what I want to do over here all

play11:40

right now let's go down here and you can

play11:43

see what I'm doing so um if I pick if I

play11:46

pick the general right if I pick the the

play11:49

general I'm going to I'm going to see

play11:51

this one is actually a string it's a

play11:53

string right it's a string um so I'm

play11:55

using St Str right representing string

play11:58

and I'm going to use this replace

play11:59

function and then replace this with

play12:02

nothing okay I'm just going to replace

play12:04

this um let me it's actually like this

play12:06

I'm going to replace this pipe sign with

play12:08

nothing all right so whenever I see a

play12:11

pipe sign I replace it with nothing

play12:12

that's basically what is going on over

play12:14

here so let me run

play12:18

that all right that's fine that's fine

play12:20

now um let me show you the data

play12:24

again okay now let's go to the general

play12:27

column uh where is it now you see that

play12:31

we see that we have nothing there right

play12:33

it's replaced with nothing just the

play12:34

space right just the space and then we

play12:36

don't have this um that pipe sign over

play12:38

there again that's that's fine that's

play12:40

what I want to do okay now um the next

play12:43

thing that I want to do is convert

play12:45

everything to a lower case okay if

play12:47

you're working with a test it's a good

play12:48

practice that um all your data is is

play12:51

consistent right know that some are in

play12:53

capital letters some are in small

play12:54

letters it's actually um if you're

play12:56

working with data using especially using

play12:58

python to work with data I mean work

play13:00

with test in particular is is a good

play13:03

thing to do all right so I'm going to

play13:05

convert everything to lower case right

play13:08

that's that's what I'm going to do so

play13:09

the movie title um if you go to the

play13:12

movie title I'm going to pick everything

play13:14

here right every movie title over here

play13:16

and then convert it to lower case all

play13:19

right so let me run that that's fine uh

play13:23

then let me show you what is there right

play13:26

now if we go to the movie title right

play13:28

you see that everything is lower case

play13:30

right unlike here you see that they were

play13:32

um starting with capital letters and

play13:34

then um so on and so forth but if you

play13:36

see here everything everything is in

play13:38

lower case right there's no caps over

play13:40

there that's that's fine that's what we

play13:42

want to do um now it's a good uh I mean

play13:45

a good practice to observe your data to

play13:47

make sure that there are no s other

play13:48

symbols over there remember that we've

play13:50

dealt with this pipe sign right we we

play13:53

were able to remove um this pipe sign

play13:55

from there right now let me show you

play13:57

some of the data over here right um I'm

play14:00

going to the title the movie title

play14:02

column right in the movie title column

play14:04

I'm just going to pick one data over

play14:06

there right any any data at all so in

play14:08

this case I'm going to pick a data at

play14:10

column 8 right so let me run that so you

play14:12

see you see that um this is the data

play14:16

that is The Avengers Age of Ultron this

play14:18

is the movie title over there right at

play14:20

column 8 um here it's not showing

play14:22

because it's um there's a break let okay

play14:25

let me show you the first one which is

play14:26

the zero right let me um change with

play14:30

this instead of eight let me put zero

play14:31

there and then uh let me run it now you

play14:34

see that this is aat right that's that's

play14:36

what is there let me go to the you see

play14:39

that is there a that's that's what is

play14:41

there now if you see um this right there

play14:45

is this this this symbol over there

play14:47

which I don't like right this is this

play14:50

symbol over there which I don't like

play14:52

okay I need to remove this because this

play14:54

is a terminating it's a now terminating

play14:56

character I don't I don't want that

play14:58

because it's going to cause a lot of

play15:00

confusion later on on you're working

play15:01

with it okay so I want to strip that of

play15:05

right it's it's present in all of the

play15:07

movie titles right it's so it's it's a

play15:10

good practice to um take time to observe

play15:13

your data and see what is there if I

play15:15

search for something else maybe 24 a

play15:17

data point that's ow number 24 let me

play15:19

see you see that's there you see that

play15:21

it's still at the end over there okay

play15:24

it's present in all of them right let me

play15:26

SE for something around maybe um

play15:29

say 67 right data point over there and

play15:31

see you see that is there right you see

play15:34

that is there so I mean I want to strip

play15:36

that right I want to strip that it's

play15:37

present in every every data point if I

play15:40

search for say 21 data point at

play15:43

21 you see that is there right so that's

play15:46

that's what I'm going to do in the next

play15:49

um cell okay so what I'm going to do

play15:52

over here is that if I pick the movie

play15:54

title okay what I'm going to do is that

play15:56

I'm going to strip everything at the end

play15:58

this is a negative indexing right

play16:00

starting from the end so whatever is

play16:02

there at the end there I want to strip

play16:04

it okay so I'm using Lambda function for

play16:06

that okay Lambda function for that so

play16:09

the SX is actually representing the data

play16:12

right just represent I mean the title

play16:14

right it's going to represent all the

play16:16

title and then the last I mean the from

play16:18

negative one is actually going to give

play16:21

you that right it's going to strip

play16:22

everything every every every character

play16:25

at the end all right there's a negative

play16:27

index and and I'm sure if you've um

play16:29

going through the Python tutorial you

play16:31

understand negative indexing all right

play16:34

now once um this part is done the next

play16:36

thing that you need to do is to check to

play16:38

see if everything is done right so let

play16:40

me run this one okay now I'm going to

play16:44

check the data point at um at at at I

play16:47

mean at at row number one and see what

play16:49

is there now you see that at the end

play16:51

it's not there right if I check for

play16:53

let's see some of them okay 21 it was

play16:56

there right so let me see at 21 21 come

play16:59

on

play17:01

okay now you see that it's not there The

play17:03

Amazing Spider-Man that's that's all you

play17:05

see that over here was Amazing

play17:06

Spider-Man and something else now it's

play17:09

just the Amazing Spider-Man that's the

play17:11

only thing that I want I want these

play17:13

things over there okay now we are good

play17:15

to go for that okay um what I'm going to

play17:18

do is that I'm going to save this data

play17:20

that I've prepared so far this is what

play17:22

I've done right this is what I've done

play17:23

and I'm going to save this data um that

play17:26

I've prepared right this is just one of

play17:27

the data that we're going to use so I'm

play17:30

going to save it and then um I'm going

play17:32

to use data. 2or CSV that is to save it

play17:35

we're going to use it later on okay um

play17:37

the name that I'm giving is um data 1.

play17:40

CSV right you can just give it any name

play17:42

that you want right you can just give it

play17:44

any name that you want so I'm going to

play17:45

run this one over here and then um Let

play17:49

me refresh it over now you see that it's

play17:51

there right data 1. CSV I've saved it so

play17:53

I've prepared the first data set and

play17:55

I've saved it right we are good to go

play17:56

for the to prepare the second data set

play17:59

right I told you we using different

play18:00

different data sets to do this right so

play18:04

now um the next tutorial we're going to

play18:06

see how we prepare the second data right

play18:08

so see you in the next tutorial

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Data ScienceMachine LearningPython TutorialMovie DataData PreparationCSV FilesPandas LibraryRecommendation EngineData CleaningTutorial Series