Dataframes Part 02 - 03/03

Develhope
14 Oct 202214:50

Summary

TLDRThis script discusses advanced techniques in pandas, focusing on the group by operation and its extension with custom functions. It explains how to apply functions like calculating the length or mean of grouped data. The tutorial also covers concatenating data frames, either by stacking or placing them side by side, and the importance of column alignment. Additionally, it delves into data frame filtering using masks and conditions, illustrating how to create sub-data frames based on specific criteria. The script concludes with a mention of an upcoming practical case study involving data exploration with pandas.

Takeaways

  • 😀 The script discusses an extension of the 'group by' function in pandas, which allows for the application of custom functions.
  • 🔑 It introduces the use of lambda functions to apply calculations like length or mean to grouped data.
  • 📚 An example is given where 'group by' is used on a DataFrame, and then a lambda function is applied to calculate the average of a column.
  • 🔄 The script covers concatenating DataFrames, either by stacking them on top of each other or placing them side by side.
  • 🧩 When concatenating, it's important to ensure that the DataFrames have compatible structures, especially regarding column names.
  • 🔍 The concept of filtering DataFrames using boolean masks is explained, which is similar to the 'WHERE' clause in SQL.
  • 📊 Filtering can be done on multiple conditions, combining them with logical operators like 'and' or 'or'.
  • 🔑 The script explains how to invert a mask using the波浪符号(~), which selects the opposite of the mask's condition.
  • 📈 It demonstrates the power of DataFrames in allowing the selection of rows based on conditions without specifying columns.
  • 🏙️ An example is provided where filtering is used to select restaurants in specific cities, similar to using 'WHERE IN' in SQL.

Q & A

  • What is the extension of group by discussed in the script?

    -The extension of group by discussed is the ability to apply custom functions, such as calculating the average or length of data within groups.

  • How does one define a custom function in the context of group by?

    -A custom function can be defined using a lambda function, which is applied to the grouped data to perform specific operations like calculating the mean or length.

  • What is the purpose of using 'lambda x' in the script?

    -In the script, 'lambda x' is used to define an anonymous function that can be applied to each group in a group by operation to perform calculations such as the mean or length of the group.

  • What does the script mean by 'X refers to the whole mini data frame'?

    -The script is indicating that within the lambda function, 'X' represents the entire subset of the data frame that has been grouped by the specified criteria.

  • How can you concatenate data frames in pandas as discussed in the script?

    -You can concatenate data frames in pandas using the 'concat' function, specifying the data frames as a list and setting the 'axis' parameter to either 0 (stack vertically) or 1 (stack horizontally).

  • Why is it necessary to rename columns before concatenating data frames?

    -Renaming columns before concatenating is necessary to ensure that the data frames have matching column names if you want to stack them on top of each other. Mismatched column names can cause the data frames to be concatenated side by side instead.

  • What is meant by 'filtering on a data frame' in the context of the script?

    -Filtering on a data frame refers to the process of selecting rows based on certain conditions, such as values being greater than a specified number, using boolean indexing.

  • How is the filtering process in pandas similar to the WHERE condition in SQL?

    -The filtering process in pandas is similar to the WHERE condition in SQL in that it allows for the selection of rows based on specific conditions, using boolean masks to filter the data frame.

  • What is a mask in the context of data frame operations?

    -A mask in the context of data frame operations is a boolean array that is used to filter the data frame, selecting rows where the mask evaluates to True.

  • How can you invert a boolean mask in pandas?

    -You can invert a boolean mask in pandas by using the '~' operator, which flips True values to False and vice versa, effectively selecting the opposite condition.

  • What is the practical case with pandas mentioned at the end of the script?

    -The practical case with pandas mentioned is an exploration of a data frame, which likely involves applying the concepts discussed, such as group by with custom functions, concatenation, and filtering, to analyze and manipulate real-world data.

Outlines

00:00

🔍 Exploring GroupBy Extensions and Custom Functions

In this paragraph, the focus is on expanding the use of the GroupBy function in data analysis with custom functions. The narrator demonstrates how to define custom operations using lambda functions, such as calculating the length or mean of grouped data. The use of lambda functions to manipulate mini data frames within GroupBy operations is highlighted, along with practical examples like finding twice the mean of a column. The paragraph also emphasizes the flexibility of applying these custom functions to subsets of data frames and using GroupBy to handle various operations.

05:01

🧩 Concatenating Data Frames and Renaming Columns

This paragraph explains how to concatenate data frames, either by stacking them on top of each other or placing them side by side. The author shows how to use the 'concat' function with different axis options to achieve the desired result. They also describe the importance of renaming columns to ensure proper concatenation, as mismatched column names can lead to undesired behavior. The narrator explains how renaming affects only specific data frames without altering the original, and the process of concatenating while keeping data organized.

10:03

🛑 Filtering Data Frames with Masks and Conditions

In this paragraph, the author discusses the use of filtering and masking to select specific rows within a data frame based on conditions. The narrator demonstrates how to create masks by applying logical conditions, such as filtering rows where the BMI is greater than zero. This process is likened to the use of filters in Excel or the WHERE clause in SQL. They also discuss combining multiple conditions with logical operators (AND, OR) to refine filtering. Additionally, the use of negation with the tilde (~) operator to get the opposite of a condition is illustrated, and the power of applying these filters across entire data frames is emphasized.

Mindmap

Keywords

💡Group By

Group By is a fundamental concept in data manipulation, often used in SQL and pandas for organizing data into groups based on certain criteria. In the context of the video, Group By is used to segment data into groups based on a common key, such as 'colon one' in the script. This allows for aggregate functions to be applied to each group, such as calculating the mean or sum. The video script mentions using Group By to group data and then applying a custom function to each group.

💡Custom Function

A custom function is a user-defined function that performs a specific operation. In the video, the presenter discusses applying custom functions to groups created by the Group By operation. For example, a Lambda function is used to calculate the average of a certain column within each group. This demonstrates how custom functions can be tailored to perform specific calculations or transformations on grouped data.

💡Lambda Function

Lambda functions are small, anonymous functions in Python that are defined on the fly and are typically used for short, simple operations. In the script, the presenter uses a Lambda function to calculate the length of data within each group or to perform calculations like 'twice the mean' of a column. Lambda functions provide a concise way to apply operations to data frames.

💡Concatenate

Concatenation in the context of data frames refers to the process of combining multiple data frames into a single data frame. The video script mentions using the concat function to join two data frames, either by stacking them on top of each other or placing them side by side. This is useful for merging data from different sources or organizing data in a specific structure.

💡Data Frame

A data frame is a 2-dimensional labeled data structure with columns of potentially different types. In the video, data frames are the primary objects being manipulated. They are used to store and organize data, and various operations such as filtering, grouping, and concatenating are performed on them.

💡Filtering

Filtering is the process of selecting a subset of data that meets certain conditions. The video script describes how to use conditions to filter rows in a data frame based on column values, such as selecting rows where the BMI is greater than zero. This is akin to the WHERE clause in SQL and is essential for data analysis and cleaning.

💡Masking

Masking in pandas refers to the use of boolean arrays to filter data. In the script, the presenter creates a mask to filter out rows where the BMI is not greater than zero. Masking allows for conditional data selection, which is a powerful feature for data analysis and manipulation.

💡Pandas

Pandas is a powerful Python library used for data manipulation and analysis. Throughout the video script, pandas is used to demonstrate various data operations such as grouping, filtering, and concatenating data frames. It is a core tool for data scientists and analysts working with Python.

💡Aggregate Functions

Aggregate functions perform a calculation on a group of values and return a single value. In the script, aggregate functions like mean or count are applied to groups created by the Group By operation. These functions are essential for summarizing and analyzing data.

💡Renaming Columns

Renaming columns is a common task in data manipulation where column names are changed to better reflect their content or to standardize them. In the video, the presenter renames columns before concatenating data frames to ensure that the resulting data frame has consistent column names.

💡Exploration

Exploration in the context of the video refers to the process of analyzing and understanding data through various operations like filtering, grouping, and visualizing. The script mentions that after discussing the concepts, the presenter will move on to a practical case where they will explore a data frame using pandas.

Highlights

Introduction to the extension of group by in pandas, emphasizing its practical applications.

Custom function application in group by using Lambda functions for aggregation.

Demonstration of how to apply a custom function to calculate the average of a column within a group.

Explanation of how to use the group by method to group data by a specific column.

The concept of using 'x' in Lambda functions to reference the grouped data frame.

Example of applying a custom function to calculate twice the mean of a column in a group.

Renaming columns for concatenation to ensure data frames have compatible structures.

Using the concat function to combine data frames, either side by side or on top of each other.

The importance of ensuring data frames have the same shape when concatenating side by side.

Explanation of filtering data frames using boolean masks created from conditions.

Practical example of filtering a data frame based on BMI values greater than zero.

The use of the波浪符号 (~) to invert boolean masks for filtering.

Combining multiple conditions using '&' and '|' for advanced data frame filtering.

The power of selecting rows based on conditions without specifying particular columns.

Comparison of data frame operations to SQL queries, highlighting the similarities in filtering and grouping.

预告了接下来的讲座将进行一个使用pandas的DataFrame进行数据探索的实践案例。

Transcripts

play00:05

so this is it for the group by and is

play00:08

very practical but this is an extension

play00:11

to group by that I think it's

play00:14

um important to look at so this

play00:17

extension of group by

play00:19

um is basically a nebulous uh to

play00:24

um to um oh sorry

play00:29

um to to have custom function so let's

play00:32

say in my dict so I'm gonna go where I

play00:36

Define my addicts the effect

play00:38

DFT at the start so I go there and I'm

play00:42

going to put more letters

play00:46

da

play00:49

da da dum

play00:53

and I'm gonna put a zero

play01:00

okay I need to re-access the DF detect

play01:03

anywhere up so you know I have a new D

play01:06

object right with a bit more variety so

play01:08

I have my DF digged and it looks like

play01:10

this so now what I want to do I want to

play01:12

do DF dict

play01:16

um docs Group by so I want to group it

play01:19

by colon one let's say so I want to

play01:21

group it by colon one okay that works

play01:24

then I want to apply a special function

play01:26

so how do I apply a special function I

play01:29

want to apply functions that will give

play01:31

me here I put the length but let's say I

play01:33

wanna get the average of the net so I

play01:36

will do I apply a function X this is

play01:39

called a Lambda function so I will do my

play01:40

Lambda I apply a function Lambda

play01:45

x x and I will do for instance uh like X

play01:52

or I can just do like

play01:56

X then

play01:58

of x

play02:00

so if you do this uh you will get the

play02:03

length so here uh the length is a bit

play02:06

like a count you know so if you do the

play02:07

same here as you count it will work the

play02:10

same so if you do accounts uh you will

play02:13

get a bit the same so you will get your

play02:15

zero for zero you have two Etc so here

play02:18

you see that X is a narrow

play02:20

so if you just return your group by your

play02:24

colon 1 and you do see or you have these

play02:27

different arrays and because you just do

play02:29

Lambda x equal x it doesn't do anything

play02:31

uh yeah so you could do like X or if you

play02:35

if you do like

play02:37

um I don't know yeah when we put lender

play02:38

X for instance you will have the length

play02:40

of the different stuff and you can do

play02:42

Taylor function like let's say I'm

play02:44

working with my DF diabetes

play02:47

um do you have diabetes and I grew by I

play02:51

was having like this like high high

play02:54

cholesterol right so I have like two

play02:55

stuff and uh let's say I want to do some

play02:59

custom function on my number I will do

play03:01

apply uh Lambda to X

play03:05

um so I want to look let's say I want to

play03:08

do the mean of two colon you know I

play03:10

could do x dot BP

play03:13

um

play03:14

x dot b b uh so if I do this I can do

play03:19

that mean

play03:20

so I can do the mean of fix.pp and then

play03:22

I got the same but let's say I want to

play03:24

do a mean I want to do twice the mean so

play03:27

here what do I do I have my X

play03:29

I say oh I want twice the mean of my BP

play03:32

so I will get my x dot BP and I do the

play03:35

mean and I do it twice uh so here's the

play03:38

X refer to the wool a mini data frame so

play03:42

I basically when I drew a group by high

play03:44

cholesterol I have two category high in

play03:46

fold so I have like two mini data frame

play03:48

and then on this media data frame I want

play03:51

to access what I want to access a column

play03:53

and on this column I'm like okay dot

play03:55

mean and then I access the mean that's

play03:57

how it works

play03:59

uh for for this stuff you know how you

play04:01

can apply function uh if no I do dot BP

play04:04

dot here I don't need to put a it's a BP

play04:08

for instance it would do the same so

play04:10

here just like oh I'm only going to

play04:11

apply this function to my column uh yeah

play04:14

so we will have this like BP dot apply

play04:16

and then you can see you can apply

play04:17

different stuff

play04:19

uh so now what we would like to do is we

play04:21

like to concatenate so I think we have

play04:23

this like text and we also have this

play04:26

like DF dignity

play04:28

I think we have like new decks

play04:32

so we have DF music right straight to

play04:34

concatenate data frames they have to

play04:36

look a bit the same there is different

play04:38

way of uh concatenating data frame there

play04:41

is a way to do it uh where we put them

play04:43

next to each other or we put them one

play04:46

after each other so

play04:49

concat function so concat function I

play04:52

need to specify my two data frame is

play04:54

list so I have DF neutect and I have the

play04:57

effect

play04:58

so okay I have these two and then I'm

play05:01

like I want to specify them access equal

play05:03

zero so there is a bunch of stuff uh you

play05:06

need to be careful uh it's like here I

play05:09

can't cut them so I have colon 1 2

play05:11

letter column one later because the

play05:13

colon don't have the same name it

play05:16

doesn't concatenate them way so it sends

play05:18

to put them on top of each other but

play05:21

because the colon doesn't exist it works

play05:24

like this so I'm going to rename my

play05:26

colon I'm of DF new dict I'm going to

play05:28

rename so it's not going to change my

play05:30

colon in my DF right because I'm going

play05:33

to rename I'm gonna do the same as

play05:35

before rename and I will do like columns

play05:39

uh Collins and I will have so uh call

play05:44

one for call two

play05:48

four call two and then I have letter

play05:53

[Music]

play05:55

a yep for later on later

play06:06

yeah uh no no no no no it's a contrary

play06:09

I'm gonna rename the other one instead

play06:14

okay so here if I do this I rename my

play06:18

colon so on the and I just have colon to

play06:20

one letter uh so I have this two and I

play06:23

just stack them in top of each other uh

play06:25

and now if I want to put them next to

play06:28

each other this is also a possibility

play06:29

right

play06:31

um V and I don't need to rename them

play06:35

in this case and I do access equal one I

play06:38

need two colors the parentheses

play06:41

X is equal one

play06:44

uh so here when I put access equal one

play06:46

I'm putting them next to each other

play06:48

right and you do see because they don't

play06:49

have the same shape some stuff on them

play06:51

so it's better if you want to put them

play06:53

side by side that's the same shape so

play06:55

you know where it goes always better

play06:56

just to merge them so you control by how

play06:58

you merge and basically and then you

play07:01

have this like uh on top of each other

play07:04

uh the last stuff uh on the data frame

play07:08

uh is about the filtering on it so it's

play07:11

about the masking uh so the filtering is

play07:14

very very practical so we've seen before

play07:17

you know like when we created this like

play07:20

um DF diabet I could look at um let's

play07:23

say at the BMI

play07:26

and I could do a dfbmi is greater than

play07:29

zero so if I do this I see that I read

play07:33

on an array of true or false

play07:35

so this is not really python-like right

play07:37

because here on the left hand side I'm

play07:39

having what am I being a Siri so this is

play07:42

what I got and here I'm checking if it's

play07:44

greater than zero so we understand the

play07:47

operation I'm doing it's a bit like when

play07:48

I'm going my Excel and I here I will do

play07:51

like is greater than zero and then uh I

play07:54

would do equal

play07:56

is this greater than zero and I will

play07:59

have true or false you know so this is a

play08:01

bit the same or greater than three

play08:03

letter readers aren't through and then I

play08:06

would check oh this is not riches and

play08:07

phrase that says and this is right uh

play08:10

yeah so this is how it works uh so when

play08:12

you do this operation here greater than

play08:13

zero is a bit operation I do here it's

play08:16

like I'm gonna check for every cell if

play08:18

it is greater than zero and I extend the

play08:21

way I'm doing it so this is how it works

play08:24

I'm having my OTF type it's BMI

play08:27

um and I'm checking if they are all

play08:29

greater than zero and here what I'm

play08:31

creating here is a bit like a mask

play08:34

uh wear your mask is because like oh

play08:37

maybe I want to select in my data frame

play08:40

also that the rows that are having an

play08:43

index greater

play08:46

BMI greatest so here let's say I call it

play08:50

mask one

play08:51

so here I call it mask one and I created

play08:53

a mask right so mask one is this stuff

play08:56

it's true and false and now I'm like

play08:58

okay do you have diabetes

play09:02

do you have diabet and I want to apply

play09:05

my mask one

play09:06

so here if I'm creating my mask one I

play09:10

will see I'm like

play09:12

oh what is the shape of it I do see that

play09:14

I have way much less data that didn't

play09:16

die a bit

play09:18

so what did happen you know so I do DF

play09:21

diabets.shade I do have 442 and if I

play09:25

have DF diabetes mask one dot shape I

play09:27

have 151 so here you have filter some

play09:30

raw from my DF that you bet startup yeah

play09:33

figure somewhere okay so I have Into

play09:36

Summer and uh how how does it work so

play09:40

here you know before I was selecting a

play09:42

column when I was uh applying something

play09:44

and then yeah I put a value of corner I

play09:47

put a series inside I'm gonna be like

play09:48

okay what does it mean so when I apply

play09:51

this like mask one

play09:53

um I can copy paste it there so we do

play09:55

see it as it means I want DF diabetes

play09:59

where DF diabetes of BMI is greater than

play10:03

zero so here if I look at the BMI

play10:07

and uh I get them in so the minimum of

play10:11

my BMI is positive or the value are

play10:14

greater than zero in this data frame

play10:16

here uh but in my DFW type data frame if

play10:20

I look at the BMI

play10:21

then I look at the mean I will see I

play10:24

have something negative right so this

play10:26

does exist

play10:28

um so this is how it works and what you

play10:30

should be careful of to um we have this

play10:34

Condition it's written there

play10:36

um and then it's enable us to filter the

play10:40

data frame based on condition so it

play10:41

works I have a condition here and I do

play10:44

DFW bets brackets like this uh so you

play10:48

could also

play10:49

um have several conditions you know you

play10:52

could have I want my BMI uh to be

play10:56

um greater than zero and I won't maybe

play10:58

the S2 to be uh lower than zero

play11:04

so if you do this you will combine with

play11:06

this end so this mean end all you could

play11:09

do or so or will be this uh top bar I

play11:14

think or is this bar so R is a single

play11:16

bar uh this is how it works so here you

play11:19

have also one Superior first zero and we

play11:22

could also use the contrary so remember

play11:24

when I do this like DF diabetes BMI

play11:28

greater than zero

play11:29

so here I could do a studio uh with the

play11:33

little wave and it just gets a contrary

play11:35

right so if I do so day after here but

play11:38

that is greater than zero I could do

play11:42

up

play11:43

so here everything that was true is no

play11:46

false and everything that was false is

play11:47

not true so this is how it works with my

play11:50

DF diabet so I have it here uh and I can

play11:53

use you know with stocks this little

play11:55

wave

play11:56

Little Wave uh here I will get the rest

play11:59

so if I do the shape of this one

play12:03

dot shape

play12:04

and I get the shape of the one without

play12:07

the little shape The Little Wave up and

play12:11

I do DF diabetes dot shape

play12:15

so I do see that if I get this one and

play12:18

this one we have the total one so here

play12:21

is the one with the BMI greater than

play12:23

zero and here's the one Where's the by

play12:26

me is not greater than zero it means

play12:28

that is inferior or equal to zero uh so

play12:32

we can now create sub data frame and we

play12:34

do see that when I do this I'm not

play12:36

selecting a particular column I'm

play12:38

selecting all my data frame where this

play12:40

condition is made so I'm selecting all

play12:42

the row whereas this condition is made

play12:44

that what makes the power of this data

play12:46

frame I don't have to say oh I want all

play12:48

this column you know no I can also only

play12:51

access S21 or I can only access S2 you

play12:54

know this is possible Right

play12:57

um so this is how it works and this is

play12:59

uh one of the biggest strength you know

play13:02

because if you do that analysis you have

play13:04

your data frame and you're like oh I

play13:06

only want to work where uh let's say so

play13:09

we have all restaurants you know so we

play13:10

have a restaurant and it works a bit

play13:12

like the filter the wear condition in

play13:14

SQL so since there is a group buy and

play13:17

know this is a bit like the wear

play13:18

condition in SQL so as a reminder for

play13:21

instance DF restaurants

play13:24

was looking like this so I want DF

play13:26

restaurant and I could use easing and I

play13:29

want DF restaurant in uh GF restaurants

play13:32

Dot restaurant all the city maybe it's

play13:35

easier to write dot CT and I want to say

play13:39

is in a Paris

play13:41

virus

play13:42

and then it will return all the

play13:45

restaurant with the city is in Paris I I

play13:49

had to purchase this um last year yeah

play13:52

so yeah I got this and I can say I want

play13:54

the one in Paris and London and then I

play13:56

will get the one in Paris and London so

play13:58

the filtering is a bit like this wear

play14:01

condition so you know in the wear

play14:03

condition industry values I want this

play14:04

colon to be greater than zero that's

play14:06

what we do here I want this colon to be

play14:08

greater than zero that's why I'm saying

play14:10

that data frame as a great mix between

play14:13

uh SQL where you can access you have

play14:16

like you know this wear condition a bit

play14:18

with this filtering in this mask

play14:20

um you got um this concatenation which

play14:23

corresponds to a new neon in SQL your

play14:25

success group buys that is a group

play14:26

buying SQL and you have this bit like

play14:28

dictionary structure where you have like

play14:30

keys and stuff so it looks a bit like

play14:32

this color like a key with like a list

play14:34

Etc uh so this is all about data frame

play14:36

and now we're gonna go for a practical

play14:39

case with pandas where we're going to do

play14:41

an exploration in the data frame that's

play14:43

it for now and see you in the next

play14:45

lecture

Rate This

5.0 / 5 (0 votes)

الوسوم ذات الصلة
Data SciencePandas TutorialGroupBy ExtensionCustom FunctionsData FilteringMasking TechniquesDataFrame ConcatenationData AnalysisPython CodingStatistical Methods
هل تحتاج إلى تلخيص باللغة الإنجليزية؟