Python Pandas Tutorial 5: Handle Missing Data: fillna, dropna, interpolate
Summary
TLDRThis tutorial explores handling missing data in pandas, a Python library. It demonstrates using fill, interpolate, and drop methods on a dataset of NYC weather data with missing values. The video guides through converting string dates to a datetime index, replacing missing values with specified values or forward/backward filling, and using interpolation for estimates. It also covers advanced techniques like axis filling, limit parameter for fill, and inserting missing dates with reindexing, offering a comprehensive guide for data preprocessing.
Takeaways
- π Handling missing data in pandas is crucial when working with datasets that have incomplete values.
- πΎ The tutorial uses New York City's weather data as an example to demonstrate handling missing data.
- π Converting a string date column to a datetime column is done using the 'parse_dates' argument.
- π Setting a column as an index in a DataFrame requires the 'set_index' method with 'inplace=True'.
- β Missing values can be handled using methods like 'fillna', 'interpolate', and 'dropna'.
- π’ 'fillna' can replace all NaN values with a specified value or a dictionary of values for specific columns.
- β‘οΈ The 'ffill' method in 'fillna' carries forward the previous day's value to fill missing data.
- β¬ οΈ 'bfill' is another method in 'fillna' that uses the next day's value to fill missing data.
- π Interpolation methods like linear, time, and others can provide better estimates for missing values.
- βοΈ The 'dropna' method can be used to drop rows or columns with missing values, with options to specify conditions.
- π Missing dates can be inserted into the DataFrame using 'date_range' and 'reindex'.
Q & A
What is the main topic of the tutorial?
-The main topic of the tutorial is how to handle missing data in pandas, a Python library for data analysis.
What kind of data does the CSV file in the tutorial contain?
-The CSV file contains New York City's weather data with some missing values, including data for 2nd and 3rd January.
What are the three methods covered in the tutorial for dealing with missing data in pandas?
-The three methods covered are fillna, interpolate, and dropna.
Why might the tutorial recommend converting a string column to a date column?
-Converting a string column to a date column allows for better data manipulation and analysis, especially when setting the date as an index for a DataFrame.
What does the fillna method do in pandas?
-The fillna method in pandas is used to replace missing values (NaNs) with a specified value or a method for estimation.
How can you specify different fill values for different columns using the fillna method?
-You can specify different fill values for different columns by passing a dictionary to the fillna method, where the keys are the column names and the values are the fill values.
What does the forward fill method do when dealing with missing data?
-The forward fill method carries forward the value from the previous day's non-missing data to fill in the missing values.
What is the purpose of the 'limit' parameter in the fillna method?
-The 'limit' parameter in the fillna method restricts the number of consecutive NaNs to be filled with the specified fill value.
What is interpolation and how is it used in pandas to handle missing data?
-Interpolation is a method used to estimate intermediate values between two known data points. In pandas, the interpolate method can be used to fill missing values with estimated values based on different interpolation methods like linear, quadratic, or time-based.
How can you drop rows with missing data in pandas?
-You can drop rows with missing data in pandas using the dropna method. You can specify parameters like 'how' to determine if rows with any or all missing values should be dropped, and 'thresh' to define the minimum number of non-NA values required to keep a row.
What is the process of re-indexing in pandas and why might you need to do it?
-Re-indexing in pandas is the process of conforming a DataFrame to a new set of labels for its index. You might need to re-index if you want to insert missing dates or align the data with a complete date range.
Outlines
π Handling Missing Data in Pandas
This paragraph introduces the tutorial on managing missing data in pandas, a Python library for data analysis. The script uses a CSV file with New York City's weather data, which contains missing values. The primary focus is on three methods to address these missing values: filling with a specific value, interpolation, and dropping any rows with missing data. The tutorial begins with setting up a Python environment, typically using Jupyter Notebook, and importing the pandas library. It then demonstrates how to read a CSV file into a DataFrame and convert a string column into a date column, setting it as the DataFrame's index.
π§ Customizing Fill Values for Missing Data
The second paragraph discusses advanced techniques for filling in missing data with more accurate guesses. It explains how to use the `fillna` method with a dictionary to specify different fill values for different columns. The example given replaces missing temperature and wind speed values with zero, but uses 'no event' for the event column. The paragraph also touches on the limitations of using zero as a fill value and introduces the concept of forward filling, which carries the previous day's value to fill missing data points.
π Forward and Backward Filling with Limitations
This section delves deeper into the forward fill method, explaining how it can be used to propagate values from the previous day. It also introduces the backward fill method, which does the opposite by copying values from the next day. The paragraph highlights the potential issues with these methods, such as incorrect data representation, and discusses the 'axis' parameter, which allows for horizontal filling. Additionally, it introduces the 'limit' parameter, which can restrict the number of times a value is propagated to handle missing data points.
π Interpolation for Estimating Missing Values
The fourth paragraph covers the interpolation method in pandas, which provides a more sophisticated way to estimate missing values. It describes linear interpolation and how it can be used to fill in missing temperature values by calculating intermediate values based on surrounding data points. The tutorial also mentions other interpolation methods such as quadratic, cubic, and piecewise polynomial, and introduces the 'time' method, which takes the date into account for a more accurate estimate.
ποΈ Dropping Rows with Missing Data
This paragraph discusses the use of the `dropna` method to remove rows that contain missing values. It explains how to drop rows based on the presence of any missing values or when all values are missing. The 'how' parameter is introduced to specify whether to drop rows with any or all missing values. Additionally, the 'threshold' parameter is explained, which allows for dropping rows based on the number of valid values they contain. The paragraph concludes with a method for inserting missing dates into the DataFrame by reindexing with a complete date range.
Mindmap
Keywords
π‘Pandas
π‘Missing Data
π‘Fill NA
π‘Interpolate
π‘Drop NA
π‘CSV File
π‘DataFrame
π‘Index
π‘Reindex
π‘Linear Interpolation
π‘Time-based Interpolation
Highlights
Tutorial overview on handling missing data in pandas.
Introduction to a CSV file with missing values in NYC's weather data.
Explanation of three primary methods to deal with missing values: fill_na, interpolate, and dropna.
Starting a new Jupyter Notebook for the tutorial.
Importing pandas and reading the CSV file into a DataFrame.
Converting the 'day' column to a date column using pandas' `parse_dates` argument.
Setting the 'day' column as the DataFrame index.
Using `fillna()` to replace NaN values with a specified value.
Demonstration of replacing NaN values with different values for different columns.
Using forward fill to carry forward the previous day's value for missing data.
Explaining backward fill to copy the next day's value for missing data.
Introduction of the `axis` parameter to control the direction of value filling.
Utilizing the `limit` parameter to restrict the number of times a value is carried forward.
Interpolation method to estimate missing values based on surrounding data points.
Different interpolation methods available in pandas such as linear, quadratic, and time.
Using `dropna()` to remove rows with missing values based on different conditions.
Parameter `how` to specify dropping rows with all NaN values or at least one NaN.
Parameter `threshold` to determine the minimum number of non-NaN values required to keep a row.
Method to insert missing dates into the DataFrame by reindexing.
Error handling during reindexing and the correct approach to resolve it.
Conclusion of the tutorial with a teaser for the next part focusing on additional techniques for handling missing data.
Transcripts
difference in this tutorial we are going
to look at how to handle missing data in
pandas now often when you are
downloading data from internet or less
say getting it from any other source it
might have missing values as shown in
this CSV file this file contains New
York City's weather data and you can see
that some of these cells are not having
any value in it also it is missing the
data for 2nd and 3rd January ok so when
you're processing this kind of
information in pandas we will see how
you can deal with these missing values
using fill na interpolate and drop any
methods I have more tutorials on how to
handle missing data but this is just to
start and we are only covering these
three methods ok so as usual I'm going
to start my Jupiter not book now if you
don't know what is Jupiter not book I
have a separate tutorial on it but you
can also use any IDE of your choice such
as py charm or not pay plus plus
whatever you prefer
I like Jupiter not book because it is
great with data visualization ok so I'm
going to click on new and start a new
Python notebook and the first thing we
do as usual is import pandas as PD and
then I will read the CSV file that I
just showed you ok and print the data
frame the star that you were saying here
means it was processing it so it read
this csv file successfully into a data
frame now for the purpose of this
tutorial i want to make my day a date
column so let me show you what i mean by
that so when i
then you normally read CSV like this
what if what it's gonna do is it's gonna
read de as a string column you can see
it is a string so whatever you are
seeing here this is nothing but but a
string it's not an excel file okay it's
a CSV file so I want to first convert
that column into a date column and for
doing that you have to use past dates
argument and in that you can say that
past day column as a date type okay and
when you do that let's first print it
you can see that it convert it now by
looking at it you cannot probably figure
out the type so what I do usually is
just so you can see that now the type is
timestamp okay so we're good all right
so I got day as a date/time column now I
want to make this an index for my data
frame and in order to do that you can
just say DF dot set index day as your
index and anyplace equal to true
remember you have to do in place equal
to true otherwise it's not gonna modify
the original data frame but instead it
will return a new data frame okay and
when you do that you got day as your in
now if you have any values and if you
are processing this information then you
have to do special handling you have to
check like if value equal to na then do
the special thing okay often it makes
sense to replace these any values with
some meaningful value or a guess okay so
in this case let's say I want to replace
all any n values with some other value
okay so the first method that we are
going to cover
is fill any okay so what you can do is
be F dot fill any okay and in bracket
you can pass the value that you want any
to replace with okay and I'm not going
to modify my original data frame but
instead to get this back into a new data
frame and when I run it you can see that
all these NN values that it had it
replaced them with zero value you can
see that everything is everything that
was any is zero now okay so this is good
now sometimes having 0 is not probably
the best guess so you want to come up
with a better guess okay for example
here in the case of event what does zero
mean right so maybe you want to use fill
any but you don't want to fill entire
data frame with this value maybe you
want to specify different values for
different columns okay so pandas
supports that also so the way you do it
is again I am going to receive it a new
data frame and inside fill and a method
now you can pass it dictionary okay now
what does this dictionary contain so
dictionary contains name of the column
okay now in temperature column let's say
I want to replace all any values with
zero and in my day not day but wind
speed column I want to replace it with
again zero but my event I want to say no
event okay
and then print new data frame now as you
can see here the temperature and wind
speed is replaced with zero as you can
see here but the event now I have no
event okay so you can just use this
dictionary to fill specific values for a
specific column but still I am not happy
with how I handle missing values here
because see if you are calculating a
mean or something for this temperature
then mean is gonna come really horrible
and if someone looks at data he'll thing
okay on 1st gen Ewell it were it was 32
temperature and the second January it
was zero Fahrenheit right some someone
might think this the temperature went
down by so much but in reality we
actually don't know what was a
temperature and all we are trying to do
is come up with some estimate okay so
then the other way of getting better
estimate would be just to carry forward
the temperature on 1st January here ok
so whatever was the temperature on the
previous day you carry forward and you
do it in a similar way for other two
data types okay so for that you can use
again your fill and a method okay but
here what you will do is use a method
equal to forward fill forward fill you
can specify by typing F fill F fill
means if I have any value then just
carry forward previous day's value okay
so
let's bring that okay cool
now you can see that it just carry
forward the value from the previous day
so forth January had any value but now
it carry forward it forced January's
value here so this looks little better
than just having zero value okay same
thing on 9th January I I didn't have any
event so you look at 9 January now it is
sunny because you carry forward previous
days value you can also use backward
fill meaning carry forward next day's
value it's not good really care for but
you're copying instead of copying
previous day's value you're copying next
day's values so if you do that what's
gonna happen is now for January has a
value from 5th January so now it copied
value from 5th to 4th
ok so you can use be fill method also
now if you go to pandas documentation
you can just Google in pandas fill any
it's gonna show the documentation for
fill any and you can see that we used
back fill be fill and FL you can also
use bad or like wall back fill okay
so you can use all of that you also have
this other argument called
axis so let's see what axis can do for
us so here if I say
excess okay X is equal to columns when
you do X is equal to column what it is
doing now is let me open this CSV file
here so here previously when we were
using backfill it was copying values
vertically like it will go vertically
and copy value from here to here but now
with X is equal to columns it's copying
values horizontally so it's going row by
row and copying value from previous cell
so here look at here it it was 9:00 a.m.
and it copied that nine in to
temperature so you can see this nine is
copied here then the snow was copied
here so you can see if this was no and
this is also snow now so you can based
on what kind of data you are dealing
with you can copy it either horizontally
or vertically okay now if you check the
documentation of fill n/a it has another
interesting property or argument called
limit so let me show you what limit can
do for you so here I am going to replace
this bit forward fill and just kind of
show you so when you have forward fail
let's say in the case of 7 January I had
32 and it will just copy 32 to both of
these missing data points okay now let's
say due to some reason I want to carry
forward this value only once okay so I
want to copy it only here but not here
in that case you can specify limit and
you can say my limit is 1 as far as
copying my valid value to missing value
is concerned okay when you run this you
can see that now 7 January value was 32
it copied that to 8 but 9 still has
any because my limit is one I can copy
it only once okay
same thing here 6 January wind speed was
7 miles per hour and it copied it to
seventh so you can see that 7 January
now also has that value but 8 & 9
January has na ok
if you chained them it to be 2 you will
notice that this 7 is copied here 2
times right 7 & 7 but my 9th January is
still any ok so this is how you can use
your limit parameter
okay now I'm still not happy with the
guests that I'm making because if the
temperature on 1st January was 32 and on
first it was 28 it is likely that
temperature on 4th was in between ok
I mean it's not always guaranteed but
that something you would consider a
better guess okay so we have a method
called interpolate in ponder so let me
just create a new cell and by the way I
am using the shortcuts you can you can
you access all the shortcuts here
so when you say insert cell below the
shortcut is B so that's what I'm using
so I'm here I'm pressing B it's creating
a new cell for me okay
so here D F dot interpolate okay
so then you do D F dot interpolate it's
gonna interpolate the values so if you
look at your new data from here you will
notice that now for the 4th January it
came up with a better case which is a
linear interpolation so if you have
studied linear interpolation or you
basically you will come up with this
value 30 okay so it was 32 28 and you're
gradually transitioning and and having
this intermediate data point okay so
this is probably a better guess
okay and it did the same thing for these
two cells also you can see that 32:24
and here is 32 point 66 33 point 33 so
it's somehow coming up with this it was
33 point of the day so it's using
interpolation linear interpolation and
coming up with this values okay
so again I'm going to go ahead and check
the documentation for interpolate so in
search bar you can type in interpolate
and look at data frame dot interpolate
documentation and you will notice that
in a method if you don't specify
anything it is by default linear but you
can use so many other methods you can
use quadratic cubic and piecewise
polynomial there are so many methods to
specify as far as your interpolation is
concerned okay so I'm going to use time
now so let's see what time can do for us
so here before we do that you will see
that using linear interpolation it came
up with the middle value okay 32 and 28
the middle value is 30 but look at the
date okay late is not in the middle okay
then it is more near towards fifth
January okay so I'm missing second and
third January so 30 still doesn't look
like a better guess it should be
relatively near to this value 28 so when
you use method equal to time you can see
that now it came up with value 29
because now it is considering this time
this date also in coming up with this
value it is realizing that for January
is near to fifth hence the value should
not be excellent middle but it should be
more near to this value okay so this
feature I found to be pretty powerful
whenever you are making a guess or
estimate form is
values okay so far so good
sometimes based on the situation I just
want to let's say drop all the rows with
any values in that case you can use this
method called drop any so you can say DF
drop any okay and I'm just printing the
new data frame so you can see that in my
excel sheet whichever row had any any
value okay it dropped all of them so now
I got only three rows which has a valid
contained in all of the columns okay
sometimes you want to drop the row if it
has at least one nè
okay so here what it is doing is
activities doing that so here if you
have at least one any it is dropping it
but let's say I want to drop only if it
has all any so for example I want to
drop this row but I still want to
preserve these rows because it has at
least some data okay so for that you can
use how parameter and you can say how is
equal to all so now you don't see 9th
January here in this data frame because
it had all the values to be any it has
this date but this date is a index so it
is not considering it is not considering
that in the process of dropping okay and
these values this Rose has some n/a
cells but not everything is any so it's
not dropping that okay now what if I
want to go by non any value so let's say
I want to say that if I have at least
one non any value then keep that row and
drop any other rows so for that you can
use a threshold parameter so you can say
threshold equal to one thousand equal to
one means if I have at least one non any
you then keep the rope okay so when you
run that see what happens is again the
same result 9th January got dropped
because it doesn't have any valid value
everything was on it
okay now let's so it kept the six
January value because it has at least
one valid value so if I change threshold
to be one what it means is all right so
let's run this okay
nine when I sit that's what you go to do
it dropped this particular you can see
it dropped that particular row because
two means I need two valid values in
order to keep the row but I don't have
two valid values I have only one value
the date is not counted because it is
index okay so if I have one value I'm
going to drop it okay so you can use
threshold to drive you're dropping
process by number of valid values that
you have okay last thing that we want to
cover is how do you go about inserting
the missing dates so I don't have 2nd
and 3rd January here and I want to let's
say insert those dates so for that you
will do something like this so here you
will create a date range and using the
date range let's say I have a date range
from 1st January to 11 January so first
January to 11 January I created a date
range so this is your date range and you
pass that to date time index and create
a date time index and then you do
re-indexing in your data frame so I'm
saying DF not reindex using that index
and then you print your data frame again
you have to do in place equal to true
okay I'm getting some error here because
it index got unexpected keyword argument
okay so this is unexpected
so let's see what's going on here okay
so looks like reindex is not accepting
in place as a valid argument so what I
have to do is DF equal to DF dot array
index and when executed you'll see that
I got 2nd and 3rd January rows now I
have any values but again you can use
one of the field and methods to fill
them with some estimated values okay so
that's all we had for this tutorial in
the next tutorial of we will continue on
how to handle missing data using some
other techniques okay until then thank
you very much for watching and if you
liked this tutorial please don't forget
to give it a thumbs up below okay bye
Browse More Related Video
5.0 / 5 (0 votes)