Dataframes Part 02 - 03/03
Summary
TLDRThis script discusses advanced techniques in pandas, focusing on the group by operation and its extension with custom functions. It explains how to apply functions like calculating the length or mean of grouped data. The tutorial also covers concatenating data frames, either by stacking or placing them side by side, and the importance of column alignment. Additionally, it delves into data frame filtering using masks and conditions, illustrating how to create sub-data frames based on specific criteria. The script concludes with a mention of an upcoming practical case study involving data exploration with pandas.
Takeaways
- 😀 The script discusses an extension of the 'group by' function in pandas, which allows for the application of custom functions.
- 🔑 It introduces the use of lambda functions to apply calculations like length or mean to grouped data.
- 📚 An example is given where 'group by' is used on a DataFrame, and then a lambda function is applied to calculate the average of a column.
- 🔄 The script covers concatenating DataFrames, either by stacking them on top of each other or placing them side by side.
- 🧩 When concatenating, it's important to ensure that the DataFrames have compatible structures, especially regarding column names.
- 🔍 The concept of filtering DataFrames using boolean masks is explained, which is similar to the 'WHERE' clause in SQL.
- 📊 Filtering can be done on multiple conditions, combining them with logical operators like 'and' or 'or'.
- 🔑 The script explains how to invert a mask using the波浪符号(~), which selects the opposite of the mask's condition.
- 📈 It demonstrates the power of DataFrames in allowing the selection of rows based on conditions without specifying columns.
- 🏙️ An example is provided where filtering is used to select restaurants in specific cities, similar to using 'WHERE IN' in SQL.
Q & A
What is the extension of group by discussed in the script?
-The extension of group by discussed is the ability to apply custom functions, such as calculating the average or length of data within groups.
How does one define a custom function in the context of group by?
-A custom function can be defined using a lambda function, which is applied to the grouped data to perform specific operations like calculating the mean or length.
What is the purpose of using 'lambda x' in the script?
-In the script, 'lambda x' is used to define an anonymous function that can be applied to each group in a group by operation to perform calculations such as the mean or length of the group.
What does the script mean by 'X refers to the whole mini data frame'?
-The script is indicating that within the lambda function, 'X' represents the entire subset of the data frame that has been grouped by the specified criteria.
How can you concatenate data frames in pandas as discussed in the script?
-You can concatenate data frames in pandas using the 'concat' function, specifying the data frames as a list and setting the 'axis' parameter to either 0 (stack vertically) or 1 (stack horizontally).
Why is it necessary to rename columns before concatenating data frames?
-Renaming columns before concatenating is necessary to ensure that the data frames have matching column names if you want to stack them on top of each other. Mismatched column names can cause the data frames to be concatenated side by side instead.
What is meant by 'filtering on a data frame' in the context of the script?
-Filtering on a data frame refers to the process of selecting rows based on certain conditions, such as values being greater than a specified number, using boolean indexing.
How is the filtering process in pandas similar to the WHERE condition in SQL?
-The filtering process in pandas is similar to the WHERE condition in SQL in that it allows for the selection of rows based on specific conditions, using boolean masks to filter the data frame.
What is a mask in the context of data frame operations?
-A mask in the context of data frame operations is a boolean array that is used to filter the data frame, selecting rows where the mask evaluates to True.
How can you invert a boolean mask in pandas?
-You can invert a boolean mask in pandas by using the '~' operator, which flips True values to False and vice versa, effectively selecting the opposite condition.
What is the practical case with pandas mentioned at the end of the script?
-The practical case with pandas mentioned is an exploration of a data frame, which likely involves applying the concepts discussed, such as group by with custom functions, concatenation, and filtering, to analyze and manipulate real-world data.
Outlines
🔍 Exploring GroupBy Extensions and Custom Functions
In this paragraph, the focus is on expanding the use of the GroupBy function in data analysis with custom functions. The narrator demonstrates how to define custom operations using lambda functions, such as calculating the length or mean of grouped data. The use of lambda functions to manipulate mini data frames within GroupBy operations is highlighted, along with practical examples like finding twice the mean of a column. The paragraph also emphasizes the flexibility of applying these custom functions to subsets of data frames and using GroupBy to handle various operations.
🧩 Concatenating Data Frames and Renaming Columns
This paragraph explains how to concatenate data frames, either by stacking them on top of each other or placing them side by side. The author shows how to use the 'concat' function with different axis options to achieve the desired result. They also describe the importance of renaming columns to ensure proper concatenation, as mismatched column names can lead to undesired behavior. The narrator explains how renaming affects only specific data frames without altering the original, and the process of concatenating while keeping data organized.
🛑 Filtering Data Frames with Masks and Conditions
In this paragraph, the author discusses the use of filtering and masking to select specific rows within a data frame based on conditions. The narrator demonstrates how to create masks by applying logical conditions, such as filtering rows where the BMI is greater than zero. This process is likened to the use of filters in Excel or the WHERE clause in SQL. They also discuss combining multiple conditions with logical operators (AND, OR) to refine filtering. Additionally, the use of negation with the tilde (~) operator to get the opposite of a condition is illustrated, and the power of applying these filters across entire data frames is emphasized.
Mindmap
Keywords
💡Group By
💡Custom Function
💡Lambda Function
💡Concatenate
💡Data Frame
💡Filtering
💡Masking
💡Pandas
💡Aggregate Functions
💡Renaming Columns
💡Exploration
Highlights
Introduction to the extension of group by in pandas, emphasizing its practical applications.
Custom function application in group by using Lambda functions for aggregation.
Demonstration of how to apply a custom function to calculate the average of a column within a group.
Explanation of how to use the group by method to group data by a specific column.
The concept of using 'x' in Lambda functions to reference the grouped data frame.
Example of applying a custom function to calculate twice the mean of a column in a group.
Renaming columns for concatenation to ensure data frames have compatible structures.
Using the concat function to combine data frames, either side by side or on top of each other.
The importance of ensuring data frames have the same shape when concatenating side by side.
Explanation of filtering data frames using boolean masks created from conditions.
Practical example of filtering a data frame based on BMI values greater than zero.
The use of the波浪符号 (~) to invert boolean masks for filtering.
Combining multiple conditions using '&' and '|' for advanced data frame filtering.
The power of selecting rows based on conditions without specifying particular columns.
Comparison of data frame operations to SQL queries, highlighting the similarities in filtering and grouping.
预告了接下来的讲座将进行一个使用pandas的DataFrame进行数据探索的实践案例。
Transcripts
so this is it for the group by and is
very practical but this is an extension
to group by that I think it's
um important to look at so this
extension of group by
um is basically a nebulous uh to
um to um oh sorry
um to to have custom function so let's
say in my dict so I'm gonna go where I
Define my addicts the effect
DFT at the start so I go there and I'm
going to put more letters
da
da da dum
and I'm gonna put a zero
okay I need to re-access the DF detect
anywhere up so you know I have a new D
object right with a bit more variety so
I have my DF digged and it looks like
this so now what I want to do I want to
do DF dict
um docs Group by so I want to group it
by colon one let's say so I want to
group it by colon one okay that works
then I want to apply a special function
so how do I apply a special function I
want to apply functions that will give
me here I put the length but let's say I
wanna get the average of the net so I
will do I apply a function X this is
called a Lambda function so I will do my
Lambda I apply a function Lambda
x x and I will do for instance uh like X
or I can just do like
X then
of x
so if you do this uh you will get the
length so here uh the length is a bit
like a count you know so if you do the
same here as you count it will work the
same so if you do accounts uh you will
get a bit the same so you will get your
zero for zero you have two Etc so here
you see that X is a narrow
so if you just return your group by your
colon 1 and you do see or you have these
different arrays and because you just do
Lambda x equal x it doesn't do anything
uh yeah so you could do like X or if you
if you do like
um I don't know yeah when we put lender
X for instance you will have the length
of the different stuff and you can do
Taylor function like let's say I'm
working with my DF diabetes
um do you have diabetes and I grew by I
was having like this like high high
cholesterol right so I have like two
stuff and uh let's say I want to do some
custom function on my number I will do
apply uh Lambda to X
um so I want to look let's say I want to
do the mean of two colon you know I
could do x dot BP
um
x dot b b uh so if I do this I can do
that mean
so I can do the mean of fix.pp and then
I got the same but let's say I want to
do a mean I want to do twice the mean so
here what do I do I have my X
I say oh I want twice the mean of my BP
so I will get my x dot BP and I do the
mean and I do it twice uh so here's the
X refer to the wool a mini data frame so
I basically when I drew a group by high
cholesterol I have two category high in
fold so I have like two mini data frame
and then on this media data frame I want
to access what I want to access a column
and on this column I'm like okay dot
mean and then I access the mean that's
how it works
uh for for this stuff you know how you
can apply function uh if no I do dot BP
dot here I don't need to put a it's a BP
for instance it would do the same so
here just like oh I'm only going to
apply this function to my column uh yeah
so we will have this like BP dot apply
and then you can see you can apply
different stuff
uh so now what we would like to do is we
like to concatenate so I think we have
this like text and we also have this
like DF dignity
I think we have like new decks
so we have DF music right straight to
concatenate data frames they have to
look a bit the same there is different
way of uh concatenating data frame there
is a way to do it uh where we put them
next to each other or we put them one
after each other so
concat function so concat function I
need to specify my two data frame is
list so I have DF neutect and I have the
effect
so okay I have these two and then I'm
like I want to specify them access equal
zero so there is a bunch of stuff uh you
need to be careful uh it's like here I
can't cut them so I have colon 1 2
letter column one later because the
colon don't have the same name it
doesn't concatenate them way so it sends
to put them on top of each other but
because the colon doesn't exist it works
like this so I'm going to rename my
colon I'm of DF new dict I'm going to
rename so it's not going to change my
colon in my DF right because I'm going
to rename I'm gonna do the same as
before rename and I will do like columns
uh Collins and I will have so uh call
one for call two
four call two and then I have letter
[Music]
a yep for later on later
yeah uh no no no no no it's a contrary
I'm gonna rename the other one instead
okay so here if I do this I rename my
colon so on the and I just have colon to
one letter uh so I have this two and I
just stack them in top of each other uh
and now if I want to put them next to
each other this is also a possibility
right
um V and I don't need to rename them
in this case and I do access equal one I
need two colors the parentheses
X is equal one
uh so here when I put access equal one
I'm putting them next to each other
right and you do see because they don't
have the same shape some stuff on them
so it's better if you want to put them
side by side that's the same shape so
you know where it goes always better
just to merge them so you control by how
you merge and basically and then you
have this like uh on top of each other
uh the last stuff uh on the data frame
uh is about the filtering on it so it's
about the masking uh so the filtering is
very very practical so we've seen before
you know like when we created this like
um DF diabet I could look at um let's
say at the BMI
and I could do a dfbmi is greater than
zero so if I do this I see that I read
on an array of true or false
so this is not really python-like right
because here on the left hand side I'm
having what am I being a Siri so this is
what I got and here I'm checking if it's
greater than zero so we understand the
operation I'm doing it's a bit like when
I'm going my Excel and I here I will do
like is greater than zero and then uh I
would do equal
is this greater than zero and I will
have true or false you know so this is a
bit the same or greater than three
letter readers aren't through and then I
would check oh this is not riches and
phrase that says and this is right uh
yeah so this is how it works uh so when
you do this operation here greater than
zero is a bit operation I do here it's
like I'm gonna check for every cell if
it is greater than zero and I extend the
way I'm doing it so this is how it works
I'm having my OTF type it's BMI
um and I'm checking if they are all
greater than zero and here what I'm
creating here is a bit like a mask
uh wear your mask is because like oh
maybe I want to select in my data frame
also that the rows that are having an
index greater
BMI greatest so here let's say I call it
mask one
so here I call it mask one and I created
a mask right so mask one is this stuff
it's true and false and now I'm like
okay do you have diabetes
do you have diabet and I want to apply
my mask one
so here if I'm creating my mask one I
will see I'm like
oh what is the shape of it I do see that
I have way much less data that didn't
die a bit
so what did happen you know so I do DF
diabets.shade I do have 442 and if I
have DF diabetes mask one dot shape I
have 151 so here you have filter some
raw from my DF that you bet startup yeah
figure somewhere okay so I have Into
Summer and uh how how does it work so
here you know before I was selecting a
column when I was uh applying something
and then yeah I put a value of corner I
put a series inside I'm gonna be like
okay what does it mean so when I apply
this like mask one
um I can copy paste it there so we do
see it as it means I want DF diabetes
where DF diabetes of BMI is greater than
zero so here if I look at the BMI
and uh I get them in so the minimum of
my BMI is positive or the value are
greater than zero in this data frame
here uh but in my DFW type data frame if
I look at the BMI
then I look at the mean I will see I
have something negative right so this
does exist
um so this is how it works and what you
should be careful of to um we have this
Condition it's written there
um and then it's enable us to filter the
data frame based on condition so it
works I have a condition here and I do
DFW bets brackets like this uh so you
could also
um have several conditions you know you
could have I want my BMI uh to be
um greater than zero and I won't maybe
the S2 to be uh lower than zero
so if you do this you will combine with
this end so this mean end all you could
do or so or will be this uh top bar I
think or is this bar so R is a single
bar uh this is how it works so here you
have also one Superior first zero and we
could also use the contrary so remember
when I do this like DF diabetes BMI
greater than zero
so here I could do a studio uh with the
little wave and it just gets a contrary
right so if I do so day after here but
that is greater than zero I could do
up
so here everything that was true is no
false and everything that was false is
not true so this is how it works with my
DF diabet so I have it here uh and I can
use you know with stocks this little
wave
Little Wave uh here I will get the rest
so if I do the shape of this one
dot shape
and I get the shape of the one without
the little shape The Little Wave up and
I do DF diabetes dot shape
so I do see that if I get this one and
this one we have the total one so here
is the one with the BMI greater than
zero and here's the one Where's the by
me is not greater than zero it means
that is inferior or equal to zero uh so
we can now create sub data frame and we
do see that when I do this I'm not
selecting a particular column I'm
selecting all my data frame where this
condition is made so I'm selecting all
the row whereas this condition is made
that what makes the power of this data
frame I don't have to say oh I want all
this column you know no I can also only
access S21 or I can only access S2 you
know this is possible Right
um so this is how it works and this is
uh one of the biggest strength you know
because if you do that analysis you have
your data frame and you're like oh I
only want to work where uh let's say so
we have all restaurants you know so we
have a restaurant and it works a bit
like the filter the wear condition in
SQL so since there is a group buy and
know this is a bit like the wear
condition in SQL so as a reminder for
instance DF restaurants
was looking like this so I want DF
restaurant and I could use easing and I
want DF restaurant in uh GF restaurants
Dot restaurant all the city maybe it's
easier to write dot CT and I want to say
is in a Paris
virus
and then it will return all the
restaurant with the city is in Paris I I
had to purchase this um last year yeah
so yeah I got this and I can say I want
the one in Paris and London and then I
will get the one in Paris and London so
the filtering is a bit like this wear
condition so you know in the wear
condition industry values I want this
colon to be greater than zero that's
what we do here I want this colon to be
greater than zero that's why I'm saying
that data frame as a great mix between
uh SQL where you can access you have
like you know this wear condition a bit
with this filtering in this mask
um you got um this concatenation which
corresponds to a new neon in SQL your
success group buys that is a group
buying SQL and you have this bit like
dictionary structure where you have like
keys and stuff so it looks a bit like
this color like a key with like a list
Etc uh so this is all about data frame
and now we're gonna go for a practical
case with pandas where we're going to do
an exploration in the data frame that's
it for now and see you in the next
lecture
5.0 / 5 (0 votes)