Dataframes Part 02 - 02/03
Summary
TLDRThe video script offers an in-depth tutorial on data manipulation using Python's pandas library. It covers essential operations like subsetting data frames, analyzing data with 'describe', handling missing values with 'fillna' and 'dropna', and removing duplicates. The script also explains merging and joining data frames akin to SQL, and introduces 'groupby' for aggregate operations. It's a comprehensive guide for data analysis, emphasizing practical applications.
Takeaways
- 📊 Use the `describe` function to get a statistical summary of a DataFrame, including mean, standard deviation, min, max, and quartiles.
- 🔍 The `value_counts` function helps to determine the frequency of unique values in a column.
- 📈 To normalize data, you can divide each value by the total count of that column using the `normalize` function.
- 🗂️ Access DataFrame columns using the dot notation (e.g., `DataFrame.column_name`) or by bracket notation (e.g., `DataFrame['column_name']`).
- 🔄 The `drop_duplicates` function is used to remove duplicate rows from a DataFrame.
- 🔗 `merge` and `join` functions are used to combine two DataFrames based on a common column.
- 🔄 `sort_values` sorts the DataFrame based on the values of a specified column.
- 🚫 `dropna` removes rows with missing values, which is useful for data cleaning.
- 🔄 `fillna` fills missing values with a specified value, which is another method for data cleaning.
- 🔑 `rename` allows you to change the names of the columns in a DataFrame.
- 👥 `groupby` performs operations on groups of data, similar to SQL group by, and can calculate mean, max, median, count, etc., for each group.
Q & A
What is the primary purpose of using the 'describe' function in a DataFrame?
-The 'describe' function in a DataFrame is used to get a statistical summary of the dataset. It provides information such as the count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum for each column.
How can you access specific columns in a DataFrame?
-You can access specific columns in a DataFrame using either the column name as an attribute (e.g., `df['column_name']`) or by using the dot notation if the column name is a valid identifier (e.g., `df.column_name`).
What does the 'value_counts' function do in pandas?
-The 'value_counts' function in pandas is used to get the count of unique values in a column. It returns a Series containing counts of unique values sorted in descending order.
How can you identify and handle missing values in a DataFrame?
-Missing values in a DataFrame can be identified using the `isna()` or `isnull()` functions. To handle missing values, you can use the `fillna()` function to fill them with a specified value or `dropna()` to remove rows or columns containing missing values.
What is the difference between 'merge' and 'join' in pandas?
-Both 'merge' and 'join' are used to combine two DataFrames in pandas. The difference lies in the syntax and the default behavior. 'Merge' is more flexible and allows specifying the type of join (inner, left, right, outer), while 'join' uses the index of one DataFrame to merge with the columns of another by default.
How can you drop duplicates from a DataFrame?
-Duplicates in a DataFrame can be dropped using the `drop_duplicates()` function. You can specify a subset of columns to check for duplicates if needed. By default, it keeps the first occurrence and drops the rest.
What does the 'groupby' function do in pandas?
-The 'groupby' function in pandas is used to group rows that have the same value in specified columns and then apply a specified function to each group. It is useful for performing operations on grouped data, such as calculating aggregates.
How can you rename columns in a DataFrame?
-Columns in a DataFrame can be renamed using the `rename` function. You provide a mapping of old column names to new column names. You can choose to modify the DataFrame in place or create a new DataFrame with the renamed columns.
What is the significance of the 'inplace' parameter in pandas functions?
-The 'inplace' parameter in pandas functions determines whether the operation should modify the original DataFrame or return a new DataFrame with the changes. If set to True, the original DataFrame is modified; if False, a new DataFrame is returned.
How can you sort a DataFrame based on specific column values?
-You can sort a DataFrame based on specific column values using the `sort_values()` function. You specify the column name and the sorting order (ascending or descending). This function can also be used to sort by multiple columns.
What does the 'drop' function do in pandas?
-The 'drop' function in pandas is used to remove specified index or column labels from a DataFrame. It can be used to drop rows or columns based on labels or conditions.
Outlines
📊 Data Frame Exploration and Description
The paragraph discusses methods for working with large data frames, especially when dealing with millions of entries. It introduces the 'describe' function to quickly summarize data, providing insights into statistical measures such as mean, standard deviation, minimum, median, Q1, Q3, and maximum. The script also covers the importance of understanding data distribution and how to access and interpret this information. Additionally, it touches on the concept of data standardization and its impact on mean values, as well as the use of 'value_counts' to understand the frequency of values within a column.
🔄 Data Frame Manipulation: Dropping Duplicates and Merging
This section explains how to handle duplicates in a data frame using the 'drop_duplicates' method. It illustrates the process of identifying and removing duplicate entries, either based on entire rows or specific columns. The paragraph also delves into the concepts of merging and joining data frames, akin to SQL operations, using 'pd.merge'. It discusses the importance of renaming columns to ensure compatibility during merges and the various join types available, such as left join, right join, inner join, and how they affect the resulting data frame.
📈 Sorting and Handling Missing Values
The script covers the process of sorting data frames based on specific columns, using 'sort_values' to arrange data in ascending or descending order. It also addresses the challenge of handling missing values, or 'NaN', within data frames. The paragraph explains how to identify missing values using 'isna' or 'isnull', and discusses strategies for dealing with them, such as filling missing values with a specified number or dropping rows/columns that contain missing values. The importance of data cleaning in preparing data for analysis is emphasized.
🔄 Data Frame Modifications: Renaming and Grouping
This part of the script focuses on modifying data frames by renaming columns to improve clarity and understanding of the data. It introduces the 'rename' method and demonstrates how to map old column names to new ones. Additionally, the paragraph explores the 'groupby' function, which allows for the aggregation of data based on certain criteria. It provides examples of how to group data and perform operations such as counting, summing, finding the mean, median, or maximum within each group. The concept of creating new columns based on conditions and then grouping by these new categories is also discussed.
📋 Accessing and Analyzing Data Frame Columns
The final paragraph discusses accessing specific columns within a data frame and analyzing them individually. It explains how to use column names to extract data and perform operations such as calculating the mean, maximum, or median for that column. The script also touches on the use of 'groupby' in conjunction with accessing columns to perform more complex analyses, such as comparing the average age of groups with high versus low cholesterol levels. The paragraph concludes by emphasizing the versatility of data frame operations for in-depth data analysis.
Mindmap
Keywords
💡DataFrame
💡Describe
💡Value Counts
💡Merge
💡Join
💡Drop Duplicates
💡Normalize
💡Group By
💡Missing Values
💡Rename
💡Data Cleaning
Highlights
Data frames with millions of rows can be challenging to work with, so sub-selecting a sample is often necessary.
The 'describe' function provides a quick statistical summary of a data frame's columns.
Mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum are included in the 'describe' function's output.
Value counts can be obtained using the 'value_counts' function to see the frequency of each unique value in a column.
Normalization can be done using the 'value_counts' function to show the proportion of each value in a column.
Accessing specific columns in a data frame can be done using the dot notation or by using the column name as an attribute.
The 'drop_duplicates' function is used to remove duplicate rows from a data frame.
Merging and joining data frames can be done using the 'pd.merge' function in pandas.
Renaming columns in a data frame can be achieved using the 'rename' function with a mapping dictionary.
The 'groupby' function allows for complex data aggregation similar to SQL's GROUP BY clause.
Missing values can be checked using the 'isna' function, which returns a boolean for each value indicating if it's null.
Filling missing values can be done with the 'fillna' function, which replaces NaN values with a specified value.
Dropping rows with missing values can be performed using the 'dropna' function.
Sorting values in a data frame can be achieved with the 'sort_values' function, which orders the data based on column values.
The 'info' function provides a concise summary of the data frame's columns, including data types and null values.
Grouping by multiple columns can be done using the 'groupby' function with a list of column names.
Aggregating data within groups can be performed using various functions such as 'mean', 'median', 'max', and 'count'.
The 'describe' function can be used to get a quick summary of the data frame's columns, including mean, max, and count.
Transcripts
uh sometimes you will work with data
frames that have like millions of rare
could be so maybe you just want to work
on a sample you know the data frame so
you will sub select a simple
um so there are different operation we
can do one data set and I will go
through them now so
um it is there is quite a lot of
function as I do so the first one is
describe so if you do describe basically
so I do DFW backs because it's a bit
nicer there is more stuff to describe
you know you know you don't really know
how is the data frame looking right like
this one you can see it in One Vision
but the other one uh you don't already
know right
um so because you don't really know
you're like oh I would like to get some
information about it so how we get
information about it
um uh it is doing like this right so
here do you see I have all my colleagues
and for all my calendar you have some
description so I have accounts mean a
standard deviation the minimum 25 which
is a person said 25 it's a median the Q3
my quality 75 and my Max
um so here I will have
um my
um my age
uh here I see like the count is the same
because I have the same number of row
for every color then I have the mean of
my age I have the sex Etc it's free and
then uh I got it so I put into account
that uh here
um
[Music]
[Music]
have been standardized right so that's
why you see that the H doesn't really
make sense you know so all the mean
should be around zero uh yeah so that's
uh what we more or less uh see uh here
um
yeah uh and here we see so S1 is
basically if you look at the description
it's like some cholesterol they say as
the cholesterol blood level and some
other uh things
um yeah so this says uh how is it and uh
and you can see minimum Etc maximum and
if I go to my DF restaurants let's say
and I do not describe uh I will see that
I have something a bit different uh so
it just describes the columns that are
numerical right here a colon I got like
string can describe and tell me how many
I have
um yeah so there is uh something very
nice as well it's called value cons so
if I go to my DF restaurant and I do
um restaurants
dot uh value accounts
uh here it will tell me restaurant one I
have one Value Restaurant to two value
etc etc uh so let's say I'm going to my
uh DF dict and I think I'm a colon one I
have four four so yeah we're gonna go to
my DF detect so I'm going to my DF dect
I'm going to my call one and I do value
cons so if I do value counts
uh I can see that you have two value for
four etc etc another uh quite nice stuff
is to do normalize equal three and then
it will show you in software U2 it would
be two divided by the total right so
here I see that uh 28 of my value in my
colon 1 uh R40 and the rest is like 14
uh so you do see that I can access my
colon one uh with something like colon
one like this or I can also access it uh
with something like this dot colon one A
Beat Lab for the dictionary right so
remember when we're having like
dictionary we could do like dictionary
dot something uh to access the keys so
it's a bit the same with the time frame
so this is an attribute of an object
that's why we can use this like dots uh
annotation basically
um then the second stuff is called drop
duplicates
so let's say I will have duplicates you
know I could do DF uh dfdict
uh dot drop duplicates I would do
copying
dot drop duplicates up and if I do drop
duplicates it will and then I'm like is
it the same shape as before you know so
you want to check if there is duplicate
so we do ease this
the same as this so you know if when you
drop the duplicate it is the same as
before when you didn't drop the
duplicates it means there is no
duplicates so it is like you check all
the duplicates could be that you want to
check the duplicates in one column right
so what you can do here is you can do a
subset so you will only look in the
column so we know that in the colon
colon one there is like
um a subset right uh so if I do DF dict
I know that I have twice the same value
in the column one year four and four so
if I do drop duplicate generally I won't
drop any duplicate because all all
couple colon and letter are unique but
then if I just look at my column one
oh we see that uh I have two four so
here if I put a subset I want to drop
the duplicate only based on colon 1 I
will drop one of these two value how do
I drop usually I just skip the first one
so I keep I will cut keep the first I
will get the last Etc uh so this is how
it works to like Drop duplicates and
when I do drop duplicate so we remember
we have different type of function some
function modifies the entry answer
function returns something you know it's
the same as when we're dropping stuff at
least sometimes we return a new list
sometime it modifies the list itself so
um here
um it will return a new dictionaries
that's what we see when we just execute
this so we need to save this value of
this dick somewhere so now I will get my
DF newject which doesn't have duplicate
so the shape of this one if I do the
shape will be one less row because I
here before I was having seven rows and
if I drop to duplicate based on colon 1
I will only have six row I have one row
less because I have only one duplicate
here uh so here's how this like a drop
duplicate is working
then there is something called merge and
John so merge and join are a bit similar
so maybe you remember from some SQL
classes that we can merge and we can
join some tables
so how does it work to merge and join
some table so let's say I have my DF new
addict
they have new decked and I just want to
rename the columns right so I my colons
are like call one and letter so I can
access colon with like this dot colon
um I want to call them all the way so I
want to call the same column and I want
the colon later three so just to have
like new name of color so I have a new
deck that is like this
and I have my dick right so I have my DF
dicks
and what I would like to do I would like
to merge them right
how do I want to merge them so I could
do pd.merge
pdpd.merge so I will merge uh DF dict
and DF new text so I will merge dfdict
dfject and DF new duct so this is uh how
I could do I could merge my two data
frame so I would merge DFT and DF new
text uh how how would I do this I would
like to specify a join you know so I
would Choice name on column one uh on
left on so I will say uh inject you know
I will join in column one left on
left on colon one
and right on
right on
colon
uh so here I basically uh want to join
to be met uh on colon one and collect uh
colon one is like this
so uh here I want to do a merge so on
the left I have the F dict and on the
right I have DF new deep and then I want
to merge in on this column here
on the left data frame it has the name
column one and on the right data frame
it has the name call entry if it has the
same if I have the same column I could
just also do like call one if they were
having the same name right if I wouldn't
have changed your name for instance so
here I have
so why do I get in my merge
uh what is your shape so I have another
frame of shape seven
so how does it work uh I have here my
one two three four four on my four I
will merge three times this value
because if this value and then I can
specify hope you know so I could do a
left John
left John and if I do left join I have
possessed
if I do a right join
then it looks like this I mean it would
be the same right but oh I could do near
John you
if I do need to join uh it also works
like this right so there is like always
a different way of doing your inner join
left of the join Etc so this is how it
works and you could also do like
um a join so the joint doesn't work like
that so just do dfdig dot uh join
Circle so do merch and then you do a
sort of join with your stuff so this is
also a possibility either you do
pd.merge and you put your project or you
dot merge and you put the other one
um yeah so then something uh quite nice
uh it's also to solve the values right
so here let's say I have my DF diabet
and I want to sort it per age so I will
do sort values and when I do sold values
I could put H and then uh the stuff will
be solved by like the bigger edgier and
the higher edge here uh yeah so this is
uh how it works so we have our age I can
also I have like a narrow of values this
is possible and if I want to go back to
my initial data frame I can solve my
index and then I have zero one two three
four five so either I sold the values or
I saw the indexed with the number I have
here
um yeah so two frame we already see so
it's what enabled us to grow from let's
say a series uh to a data frame so this
is also uh quite a practical thing
um and then it happens sometime that we
get novel
uh so sometime in the data frame uh you
will have enough so you could check
ethernet and you do either and you do
like some
and you see so uh let's say in my DF
restaurants I will add DF restaurants I
will uh yeah I have DF restaurants uh
and I will put some N A in it on my gift
or I could just put in my DF dictum so I
will get to new digs and uh the update
where is my dick
dictionary
dictionary
yeah here I can do a NP Dot and up and
um
ah yeah I have to import MP
okay uh so here uh I need to add a more
value yeah so here I have uh my new DF
text which has a non-value in it right
so here in my non-value I will have a
non-value an empty value so I can check
you know izena so izena uh for each
values of my data frame will tell me if
it's null or not I need to find on my uh
other laptop how I do the wave yeah
there
so if I do this I would get the contrary
so everything's at West false is no true
and everything that was false true is no
Force right so here's the contrary of it
is I'm using The Little Wave
um so this is it for the DFD now and now
if I do sum I will get the number of non
non-value in uh my DFD so ease and not
DOT sum and then we see that I have one
later that is null and zero value in my
uh colon Colonel and then I can feel my
value you know so sometimes when it is
important when we do data cleaning uh is
because we wouldn't do like a feeling a
and I would do let's say I want to fill
in there with minus one uh so if I do
fill in there with minus one it's going
to go into my data frame and it's going
to fill my venue with minus one even
though so this is quite practical
because sometimes you have missing data
so let's say
um
you have you know like a missing input
and it has some meaning so you just want
to remove it or you want to feel it so
this is a way of doing data cleaning
right you would just do this feeling a
and you will feel with some values that
you think are right uh so you can do
filene with this stuff and this is
returning a new data frame so you can
save this new data frame or you can use
the argument in place equal true and
then it will not return something and it
will directly modify you the object so
then the gfd will be modified and if you
don't put the inplex equal true so
dfdict won't be modified
then there is another stuff possible
it's called dropping it so if you do
drop in a it's a bit like uh it will
drop all the values that I know uh so
this uh R is a two-way if you want of
dropping data and you could also sell
like subset etc
etc so you could also D I would just
want to drop when this column is not
this is also possible uh so yeah you
have this like Drop n app that is
possible and it's like filling it so
these are two functions to like clean
your data if you want you will check
which value are null and then you will
check if you need to drop them what you
do with it
so Swift scene and then there is another
thing very practical which is called
rename where we can just rename your
colon right
um so let's say instance in my GF
diabetes I have my DF diabet and I want
to do Renee so I have my renames and I
want to provide a map so with my columns
then it provides the all name with the
new name so let's say I was having all
name is one and I mean like is one uh
for me have no meaning so I'll be okay
S1 and I will do I know according to the
democritation is a total cholesterol
uh so I will do like this and it will
return me a new data frame and this
column will not be changed either I'm
creating a new data set either I'm doing
in place equal true so here no if I'm
before I'm doing it if I'm calling my DF
diabet my DF diabet is still equal to
the same thing right it's still S1 but
if I do in place equal true
then my DFW bets I have changed uh the
value in my data frame that's why this
in place equal true mean it's like I'm
changing the value in my data frame so
we have the Strand name we have this job
so we can drop some value we can drop in
a etc we can drop the duplicates and we
also have info so I can do all this like
dot info and then you have in full of
like
um the type so you have the age you have
the type you know it's a float you have
to null and non-null value so here you
see it's like a nice data frame
everything is float uh we have every
every time none their stuff so every
there is no null value no problem with
this uh so this
um is it for the main description on
that asset
um then there is also something very
practical in that asset it's called a
group buy so Group by is a bit uh the
same as a group buy or you will have
um in let's say
um
in SQL so it works a bit the same right
so uh Group by will work uh like uh
let's say here I have my like DF dict oh
I have my Geotech uh DF deck so remember
it's looking like this I have two times
four I will let's say Group by
so I grew up by Michael on one
I could buy my colon one group by Anita
p so this is how it works and then I put
an operation I want to do I want to do
group by count
uh so I'm like
um DF did
um
is equal to this uh and uh
uh so I'm like okay yeah cool do you
have decked colon can't Ledger Etc so it
means uh I it's a bit like my value
count right you have my colon one and
for each of this value I do an operation
I could do you know if it's like value I
can do mean or mean Etc but it doesn't
work because here I I don't have
something so what I will do I will go
for diabetes
uh we got my daughter from David and
then we look as a total cholesterol
right so I will look at total
cholesterol uh and then uh I will check
like if it's like uh greater than a zero
so let's say if it's greater than zero I
have four Central right so I will check
and I will create a new column so I can
create a new colon in my data set so it
will be like a high cholesterol let's
say so let's say that if I have more
than zero of cholesterol it means I'm
more than the mean so
um I have a high cholesterol
and then I could do DF diabet dot Group
by
uh up I do group by I want to group by
high cholesterol
high cholesterol
I grew by her cholesterol and then I do
that mean so I do mean and I will see
that oh uh for the cholesterol the ones
that have high cholesterol the age on
average is bigger than the one who have
low cholesterol so sex oh I have oh it's
like one and zero for giant movements
something like that
um yeah so hey so it works and you do
see different stuff so yeah I think the
main stuff is here is people who have
high cholesterol are a bit older so the
age is greater in average than the other
one
so I could look at the medium we see
them and stuff I can look at you know
what is the mean in each column or what
is like the max
uh or I could look you know uh other
counts how much value how much people
having so you see I have like 240 people
that have flowers and zero cholesterol
and 200 people that have more uh than
cholesterol I mean people Rose in my
data frame right uh if we say in my data
frame well it is like this uh so this is
how it works with a group buy and you
could also uh grow by other stuff right
you could Group by high cholesterol and
uh let's say something as too high so
you know you if you do this
uh yeah we could like hey I has to I has
to and you will keep you like S2 up
up
and you can Group by high cholesterol
and that's too then it is the last time
we need to provide so we use the
brackets and um we their highest tune I
have to
so here you do see we have four
different possibilities we don't have
high cholesterol and we don't have is2
and that counts the number of people so
maybe it just gets a mean uh so it gets
a mean of the age and we do see force
and force minus zero but force and true
are a bit more right but it's still less
than this one these two values are quite
similar I mean force and true higher
than this one uh yes you could also look
at the median the max
etc etc
median marks
so uh this is as well uh stuff that you
could do
um yeah and you can also just access the
column so let's say you're just
interested in the age you will do
um
a
H I need a parenthesis here so I can
just access the age this is a width time
frame uh that frame so I would just do
like two frame and this will look nice
age and something else and I will put it
here and then I will not need to put the
two frame
uh so this is how Jesus Group by so it
is very practical it enables you to do
uh more calculation it works a bit like
in SQL and it performs operation like
this mean Max median count on some group
notes uh that when you do DF diabet you
remember I did some dot sum so you would
have the sum here per colon you could
also do this mean right so you will get
UDF diabetes and you get the mean you
can get the max so it's like the means
the max for each column uh so it's a bit
the stuff you get in the describe right
so you can do this operation all your
data frame or you could also just select
a colon and do H dot mean
5.0 / 5 (0 votes)