Dataframes Part 02 - 02/03

Develhope
14 Oct 202221:22

Summary

TLDRThe video script offers an in-depth tutorial on data manipulation using Python's pandas library. It covers essential operations like subsetting data frames, analyzing data with 'describe', handling missing values with 'fillna' and 'dropna', and removing duplicates. The script also explains merging and joining data frames akin to SQL, and introduces 'groupby' for aggregate operations. It's a comprehensive guide for data analysis, emphasizing practical applications.

Takeaways

  • 📊 Use the `describe` function to get a statistical summary of a DataFrame, including mean, standard deviation, min, max, and quartiles.
  • 🔍 The `value_counts` function helps to determine the frequency of unique values in a column.
  • 📈 To normalize data, you can divide each value by the total count of that column using the `normalize` function.
  • 🗂️ Access DataFrame columns using the dot notation (e.g., `DataFrame.column_name`) or by bracket notation (e.g., `DataFrame['column_name']`).
  • 🔄 The `drop_duplicates` function is used to remove duplicate rows from a DataFrame.
  • 🔗 `merge` and `join` functions are used to combine two DataFrames based on a common column.
  • 🔄 `sort_values` sorts the DataFrame based on the values of a specified column.
  • 🚫 `dropna` removes rows with missing values, which is useful for data cleaning.
  • 🔄 `fillna` fills missing values with a specified value, which is another method for data cleaning.
  • 🔑 `rename` allows you to change the names of the columns in a DataFrame.
  • 👥 `groupby` performs operations on groups of data, similar to SQL group by, and can calculate mean, max, median, count, etc., for each group.

Q & A

  • What is the primary purpose of using the 'describe' function in a DataFrame?

    -The 'describe' function in a DataFrame is used to get a statistical summary of the dataset. It provides information such as the count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum for each column.

  • How can you access specific columns in a DataFrame?

    -You can access specific columns in a DataFrame using either the column name as an attribute (e.g., `df['column_name']`) or by using the dot notation if the column name is a valid identifier (e.g., `df.column_name`).

  • What does the 'value_counts' function do in pandas?

    -The 'value_counts' function in pandas is used to get the count of unique values in a column. It returns a Series containing counts of unique values sorted in descending order.

  • How can you identify and handle missing values in a DataFrame?

    -Missing values in a DataFrame can be identified using the `isna()` or `isnull()` functions. To handle missing values, you can use the `fillna()` function to fill them with a specified value or `dropna()` to remove rows or columns containing missing values.

  • What is the difference between 'merge' and 'join' in pandas?

    -Both 'merge' and 'join' are used to combine two DataFrames in pandas. The difference lies in the syntax and the default behavior. 'Merge' is more flexible and allows specifying the type of join (inner, left, right, outer), while 'join' uses the index of one DataFrame to merge with the columns of another by default.

  • How can you drop duplicates from a DataFrame?

    -Duplicates in a DataFrame can be dropped using the `drop_duplicates()` function. You can specify a subset of columns to check for duplicates if needed. By default, it keeps the first occurrence and drops the rest.

  • What does the 'groupby' function do in pandas?

    -The 'groupby' function in pandas is used to group rows that have the same value in specified columns and then apply a specified function to each group. It is useful for performing operations on grouped data, such as calculating aggregates.

  • How can you rename columns in a DataFrame?

    -Columns in a DataFrame can be renamed using the `rename` function. You provide a mapping of old column names to new column names. You can choose to modify the DataFrame in place or create a new DataFrame with the renamed columns.

  • What is the significance of the 'inplace' parameter in pandas functions?

    -The 'inplace' parameter in pandas functions determines whether the operation should modify the original DataFrame or return a new DataFrame with the changes. If set to True, the original DataFrame is modified; if False, a new DataFrame is returned.

  • How can you sort a DataFrame based on specific column values?

    -You can sort a DataFrame based on specific column values using the `sort_values()` function. You specify the column name and the sorting order (ascending or descending). This function can also be used to sort by multiple columns.

  • What does the 'drop' function do in pandas?

    -The 'drop' function in pandas is used to remove specified index or column labels from a DataFrame. It can be used to drop rows or columns based on labels or conditions.

Outlines

00:00

📊 Data Frame Exploration and Description

The paragraph discusses methods for working with large data frames, especially when dealing with millions of entries. It introduces the 'describe' function to quickly summarize data, providing insights into statistical measures such as mean, standard deviation, minimum, median, Q1, Q3, and maximum. The script also covers the importance of understanding data distribution and how to access and interpret this information. Additionally, it touches on the concept of data standardization and its impact on mean values, as well as the use of 'value_counts' to understand the frequency of values within a column.

05:03

🔄 Data Frame Manipulation: Dropping Duplicates and Merging

This section explains how to handle duplicates in a data frame using the 'drop_duplicates' method. It illustrates the process of identifying and removing duplicate entries, either based on entire rows or specific columns. The paragraph also delves into the concepts of merging and joining data frames, akin to SQL operations, using 'pd.merge'. It discusses the importance of renaming columns to ensure compatibility during merges and the various join types available, such as left join, right join, inner join, and how they affect the resulting data frame.

10:03

📈 Sorting and Handling Missing Values

The script covers the process of sorting data frames based on specific columns, using 'sort_values' to arrange data in ascending or descending order. It also addresses the challenge of handling missing values, or 'NaN', within data frames. The paragraph explains how to identify missing values using 'isna' or 'isnull', and discusses strategies for dealing with them, such as filling missing values with a specified number or dropping rows/columns that contain missing values. The importance of data cleaning in preparing data for analysis is emphasized.

15:03

🔄 Data Frame Modifications: Renaming and Grouping

This part of the script focuses on modifying data frames by renaming columns to improve clarity and understanding of the data. It introduces the 'rename' method and demonstrates how to map old column names to new ones. Additionally, the paragraph explores the 'groupby' function, which allows for the aggregation of data based on certain criteria. It provides examples of how to group data and perform operations such as counting, summing, finding the mean, median, or maximum within each group. The concept of creating new columns based on conditions and then grouping by these new categories is also discussed.

20:05

📋 Accessing and Analyzing Data Frame Columns

The final paragraph discusses accessing specific columns within a data frame and analyzing them individually. It explains how to use column names to extract data and perform operations such as calculating the mean, maximum, or median for that column. The script also touches on the use of 'groupby' in conjunction with accessing columns to perform more complex analyses, such as comparing the average age of groups with high versus low cholesterol levels. The paragraph concludes by emphasizing the versatility of data frame operations for in-depth data analysis.

Mindmap

Keywords

💡DataFrame

A DataFrame is a 2-dimensional labeled data structure with columns potentially of different types. In the context of the video, DataFrames are used to manipulate and analyze large datasets. The script mentions sub-selecting a DataFrame and performing operations like describe, value counts, and merging, indicating that DataFrames are central to the video's theme of data analysis.

💡Describe

The 'describe' function in pandas provides a concise summary of the main statistics of a DataFrame's columns, including mean, standard deviation, min, max, and quartiles. It's used in the script to quickly understand the distribution of data within a DataFrame, which is crucial for initial data exploration.

💡Value Counts

Value counts is a method used to count the unique occurrences of each value in a column of a DataFrame. In the script, it's used to find out how many times each unique value appears in a dataset, which is helpful for understanding the frequency distribution of data.

💡Merge

Merging in pandas refers to the operation of combining two DataFrames together. The script describes how to merge DataFrames on a specific column, which is akin to a SQL join. This is an essential operation when dealing with relational data and is used to integrate information from different sources.

💡Join

Join is similar to merge and is used to combine rows from different DataFrames. The script explains how to perform a join operation based on a key column, which is a common task in data manipulation when trying to combine related data from different tables.

💡Drop Duplicates

The 'drop duplicates' function is used to remove duplicate rows from a DataFrame, ensuring each row is unique. In the script, it's used to clean the dataset by eliminating redundant data, which is a critical step in data preprocessing.

💡Normalize

Normalization is the process of scaling the values in a dataset to a common scale. The script mentions normalizing values to a range, which is a technique used to standardize the data and make it comparable, especially when dealing with different units or scales.

💡Group By

Group by in pandas is used to group the rows of a DataFrame that have the same value in a specified column. The script uses group by to perform aggregate functions like count, mean, max on grouped data, which is essential for analyzing patterns or trends within subsets of data.

💡Missing Values

Missing values refer to the absence of data in a DataFrame. The script discusses methods to handle missing values, such as filling them with a specific value or dropping rows/columns with missing values. This is an important aspect of data cleaning and preparation.

💡Rename

The 'rename' function in pandas is used to change the labels of the DataFrame's columns or index. In the script, rename is used to give more meaningful names to columns, which improves the readability and manageability of the data.

💡Data Cleaning

Data cleaning involves the process of removing incorrect, incomplete, or irrelevant data from a dataset. The script touches on various data cleaning techniques, such as handling missing values and removing duplicates, which are crucial steps in preparing data for analysis.

Highlights

Data frames with millions of rows can be challenging to work with, so sub-selecting a sample is often necessary.

The 'describe' function provides a quick statistical summary of a data frame's columns.

Mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum are included in the 'describe' function's output.

Value counts can be obtained using the 'value_counts' function to see the frequency of each unique value in a column.

Normalization can be done using the 'value_counts' function to show the proportion of each value in a column.

Accessing specific columns in a data frame can be done using the dot notation or by using the column name as an attribute.

The 'drop_duplicates' function is used to remove duplicate rows from a data frame.

Merging and joining data frames can be done using the 'pd.merge' function in pandas.

Renaming columns in a data frame can be achieved using the 'rename' function with a mapping dictionary.

The 'groupby' function allows for complex data aggregation similar to SQL's GROUP BY clause.

Missing values can be checked using the 'isna' function, which returns a boolean for each value indicating if it's null.

Filling missing values can be done with the 'fillna' function, which replaces NaN values with a specified value.

Dropping rows with missing values can be performed using the 'dropna' function.

Sorting values in a data frame can be achieved with the 'sort_values' function, which orders the data based on column values.

The 'info' function provides a concise summary of the data frame's columns, including data types and null values.

Grouping by multiple columns can be done using the 'groupby' function with a list of column names.

Aggregating data within groups can be performed using various functions such as 'mean', 'median', 'max', and 'count'.

The 'describe' function can be used to get a quick summary of the data frame's columns, including mean, max, and count.

Transcripts

play00:05

uh sometimes you will work with data

play00:07

frames that have like millions of rare

play00:09

could be so maybe you just want to work

play00:11

on a sample you know the data frame so

play00:13

you will sub select a simple

play00:16

um so there are different operation we

play00:18

can do one data set and I will go

play00:21

through them now so

play00:23

um it is there is quite a lot of

play00:26

function as I do so the first one is

play00:28

describe so if you do describe basically

play00:31

so I do DFW backs because it's a bit

play00:33

nicer there is more stuff to describe

play00:35

you know you know you don't really know

play00:37

how is the data frame looking right like

play00:40

this one you can see it in One Vision

play00:42

but the other one uh you don't already

play00:45

know right

play00:46

um so because you don't really know

play00:48

you're like oh I would like to get some

play00:50

information about it so how we get

play00:52

information about it

play00:54

um uh it is doing like this right so

play00:57

here do you see I have all my colleagues

play00:59

and for all my calendar you have some

play01:00

description so I have accounts mean a

play01:03

standard deviation the minimum 25 which

play01:06

is a person said 25 it's a median the Q3

play01:10

my quality 75 and my Max

play01:15

um so here I will have

play01:18

um my

play01:19

um my age

play01:21

uh here I see like the count is the same

play01:23

because I have the same number of row

play01:25

for every color then I have the mean of

play01:27

my age I have the sex Etc it's free and

play01:31

then uh I got it so I put into account

play01:34

that uh here

play01:37

um

play01:37

[Music]

play01:44

[Music]

play01:45

have been standardized right so that's

play01:48

why you see that the H doesn't really

play01:50

make sense you know so all the mean

play01:52

should be around zero uh yeah so that's

play01:56

uh what we more or less uh see uh here

play02:02

um

play02:02

yeah uh and here we see so S1 is

play02:06

basically if you look at the description

play02:07

it's like some cholesterol they say as

play02:10

the cholesterol blood level and some

play02:12

other uh things

play02:15

um yeah so this says uh how is it and uh

play02:21

and you can see minimum Etc maximum and

play02:23

if I go to my DF restaurants let's say

play02:26

and I do not describe uh I will see that

play02:29

I have something a bit different uh so

play02:31

it just describes the columns that are

play02:33

numerical right here a colon I got like

play02:35

string can describe and tell me how many

play02:38

I have

play02:39

um yeah so there is uh something very

play02:42

nice as well it's called value cons so

play02:44

if I go to my DF restaurant and I do

play02:47

um restaurants

play02:50

dot uh value accounts

play02:53

uh here it will tell me restaurant one I

play02:56

have one Value Restaurant to two value

play02:57

etc etc uh so let's say I'm going to my

play03:01

uh DF dict and I think I'm a colon one I

play03:05

have four four so yeah we're gonna go to

play03:07

my DF detect so I'm going to my DF dect

play03:09

I'm going to my call one and I do value

play03:12

cons so if I do value counts

play03:15

uh I can see that you have two value for

play03:18

four etc etc another uh quite nice stuff

play03:21

is to do normalize equal three and then

play03:24

it will show you in software U2 it would

play03:26

be two divided by the total right so

play03:28

here I see that uh 28 of my value in my

play03:31

colon 1 uh R40 and the rest is like 14

play03:35

uh so you do see that I can access my

play03:38

colon one uh with something like colon

play03:40

one like this or I can also access it uh

play03:44

with something like this dot colon one A

play03:47

Beat Lab for the dictionary right so

play03:49

remember when we're having like

play03:50

dictionary we could do like dictionary

play03:53

dot something uh to access the keys so

play03:56

it's a bit the same with the time frame

play03:58

so this is an attribute of an object

play04:00

that's why we can use this like dots uh

play04:03

annotation basically

play04:05

um then the second stuff is called drop

play04:08

duplicates

play04:09

so let's say I will have duplicates you

play04:11

know I could do DF uh dfdict

play04:15

uh dot drop duplicates I would do

play04:18

copying

play04:19

dot drop duplicates up and if I do drop

play04:22

duplicates it will and then I'm like is

play04:24

it the same shape as before you know so

play04:26

you want to check if there is duplicate

play04:28

so we do ease this

play04:31

the same as this so you know if when you

play04:34

drop the duplicate it is the same as

play04:35

before when you didn't drop the

play04:37

duplicates it means there is no

play04:38

duplicates so it is like you check all

play04:40

the duplicates could be that you want to

play04:43

check the duplicates in one column right

play04:45

so what you can do here is you can do a

play04:47

subset so you will only look in the

play04:50

column so we know that in the colon

play04:51

colon one there is like

play04:54

um a subset right uh so if I do DF dict

play04:57

I know that I have twice the same value

play04:59

in the column one year four and four so

play05:02

if I do drop duplicate generally I won't

play05:05

drop any duplicate because all all

play05:07

couple colon and letter are unique but

play05:10

then if I just look at my column one

play05:12

oh we see that uh I have two four so

play05:16

here if I put a subset I want to drop

play05:18

the duplicate only based on colon 1 I

play05:21

will drop one of these two value how do

play05:24

I drop usually I just skip the first one

play05:25

so I keep I will cut keep the first I

play05:28

will get the last Etc uh so this is how

play05:31

it works to like Drop duplicates and

play05:34

when I do drop duplicate so we remember

play05:36

we have different type of function some

play05:38

function modifies the entry answer

play05:40

function returns something you know it's

play05:42

the same as when we're dropping stuff at

play05:44

least sometimes we return a new list

play05:45

sometime it modifies the list itself so

play05:49

um here

play05:50

um it will return a new dictionaries

play05:52

that's what we see when we just execute

play05:54

this so we need to save this value of

play05:56

this dick somewhere so now I will get my

play05:59

DF newject which doesn't have duplicate

play06:02

so the shape of this one if I do the

play06:04

shape will be one less row because I

play06:08

here before I was having seven rows and

play06:10

if I drop to duplicate based on colon 1

play06:12

I will only have six row I have one row

play06:14

less because I have only one duplicate

play06:16

here uh so here's how this like a drop

play06:19

duplicate is working

play06:20

then there is something called merge and

play06:23

John so merge and join are a bit similar

play06:25

so maybe you remember from some SQL

play06:28

classes that we can merge and we can

play06:30

join some tables

play06:32

so how does it work to merge and join

play06:35

some table so let's say I have my DF new

play06:38

addict

play06:39

they have new decked and I just want to

play06:41

rename the columns right so I my colons

play06:44

are like call one and letter so I can

play06:46

access colon with like this dot colon

play06:48

um I want to call them all the way so I

play06:50

want to call the same column and I want

play06:52

the colon later three so just to have

play06:55

like new name of color so I have a new

play06:57

deck that is like this

play06:59

and I have my dick right so I have my DF

play07:02

dicks

play07:03

and what I would like to do I would like

play07:05

to merge them right

play07:07

how do I want to merge them so I could

play07:09

do pd.merge

play07:13

pdpd.merge so I will merge uh DF dict

play07:16

and DF new text so I will merge dfdict

play07:20

dfject and DF new duct so this is uh how

play07:25

I could do I could merge my two data

play07:26

frame so I would merge DFT and DF new

play07:29

text uh how how would I do this I would

play07:33

like to specify a join you know so I

play07:35

would Choice name on column one uh on

play07:38

left on so I will say uh inject you know

play07:42

I will join in column one left on

play07:45

left on colon one

play07:48

and right on

play07:53

right on

play07:54

colon

play07:56

uh so here I basically uh want to join

play08:00

to be met uh on colon one and collect uh

play08:04

colon one is like this

play08:06

so uh here I want to do a merge so on

play08:09

the left I have the F dict and on the

play08:12

right I have DF new deep and then I want

play08:15

to merge in on this column here

play08:17

on the left data frame it has the name

play08:19

column one and on the right data frame

play08:21

it has the name call entry if it has the

play08:24

same if I have the same column I could

play08:27

just also do like call one if they were

play08:30

having the same name right if I wouldn't

play08:31

have changed your name for instance so

play08:33

here I have

play08:34

so why do I get in my merge

play08:37

uh what is your shape so I have another

play08:38

frame of shape seven

play08:41

so how does it work uh I have here my

play08:44

one two three four four on my four I

play08:47

will merge three times this value

play08:49

because if this value and then I can

play08:52

specify hope you know so I could do a

play08:54

left John

play08:56

left John and if I do left join I have

play08:59

possessed

play09:01

if I do a right join

play09:05

then it looks like this I mean it would

play09:07

be the same right but oh I could do near

play09:09

John you

play09:11

if I do need to join uh it also works

play09:14

like this right so there is like always

play09:16

a different way of doing your inner join

play09:18

left of the join Etc so this is how it

play09:20

works and you could also do like

play09:23

um a join so the joint doesn't work like

play09:26

that so just do dfdig dot uh join

play09:31

Circle so do merch and then you do a

play09:34

sort of join with your stuff so this is

play09:36

also a possibility either you do

play09:38

pd.merge and you put your project or you

play09:41

dot merge and you put the other one

play09:44

um yeah so then something uh quite nice

play09:46

uh it's also to solve the values right

play09:49

so here let's say I have my DF diabet

play09:52

and I want to sort it per age so I will

play09:54

do sort values and when I do sold values

play09:57

I could put H and then uh the stuff will

play10:00

be solved by like the bigger edgier and

play10:02

the higher edge here uh yeah so this is

play10:05

uh how it works so we have our age I can

play10:08

also I have like a narrow of values this

play10:11

is possible and if I want to go back to

play10:13

my initial data frame I can solve my

play10:15

index and then I have zero one two three

play10:18

four five so either I sold the values or

play10:20

I saw the indexed with the number I have

play10:22

here

play10:23

um yeah so two frame we already see so

play10:25

it's what enabled us to grow from let's

play10:28

say a series uh to a data frame so this

play10:31

is also uh quite a practical thing

play10:35

um and then it happens sometime that we

play10:37

get novel

play10:38

uh so sometime in the data frame uh you

play10:41

will have enough so you could check

play10:43

ethernet and you do either and you do

play10:45

like some

play10:47

and you see so uh let's say in my DF

play10:51

restaurants I will add DF restaurants I

play10:55

will uh yeah I have DF restaurants uh

play10:59

and I will put some N A in it on my gift

play11:02

or I could just put in my DF dictum so I

play11:05

will get to new digs and uh the update

play11:09

where is my dick

play11:12

dictionary

play11:14

dictionary

play11:16

yeah here I can do a NP Dot and up and

play11:21

um

play11:23

ah yeah I have to import MP

play11:32

okay uh so here uh I need to add a more

play11:36

value yeah so here I have uh my new DF

play11:40

text which has a non-value in it right

play11:43

so here in my non-value I will have a

play11:46

non-value an empty value so I can check

play11:48

you know izena so izena uh for each

play11:52

values of my data frame will tell me if

play11:54

it's null or not I need to find on my uh

play11:59

other laptop how I do the wave yeah

play12:02

there

play12:03

so if I do this I would get the contrary

play12:05

so everything's at West false is no true

play12:08

and everything that was false true is no

play12:10

Force right so here's the contrary of it

play12:12

is I'm using The Little Wave

play12:14

um so this is it for the DFD now and now

play12:17

if I do sum I will get the number of non

play12:21

non-value in uh my DFD so ease and not

play12:25

DOT sum and then we see that I have one

play12:28

later that is null and zero value in my

play12:31

uh colon Colonel and then I can feel my

play12:35

value you know so sometimes when it is

play12:37

important when we do data cleaning uh is

play12:41

because we wouldn't do like a feeling a

play12:43

and I would do let's say I want to fill

play12:45

in there with minus one uh so if I do

play12:47

fill in there with minus one it's going

play12:49

to go into my data frame and it's going

play12:51

to fill my venue with minus one even

play12:53

though so this is quite practical

play12:55

because sometimes you have missing data

play12:57

so let's say

play12:58

um

play12:59

you have you know like a missing input

play13:02

and it has some meaning so you just want

play13:05

to remove it or you want to feel it so

play13:07

this is a way of doing data cleaning

play13:09

right you would just do this feeling a

play13:10

and you will feel with some values that

play13:12

you think are right uh so you can do

play13:14

filene with this stuff and this is

play13:17

returning a new data frame so you can

play13:19

save this new data frame or you can use

play13:22

the argument in place equal true and

play13:24

then it will not return something and it

play13:26

will directly modify you the object so

play13:28

then the gfd will be modified and if you

play13:31

don't put the inplex equal true so

play13:33

dfdict won't be modified

play13:35

then there is another stuff possible

play13:37

it's called dropping it so if you do

play13:40

drop in a it's a bit like uh it will

play13:43

drop all the values that I know uh so

play13:46

this uh R is a two-way if you want of

play13:49

dropping data and you could also sell

play13:50

like subset etc

play13:52

etc so you could also D I would just

play13:54

want to drop when this column is not

play13:56

this is also possible uh so yeah you

play13:59

have this like Drop n app that is

play14:00

possible and it's like filling it so

play14:03

these are two functions to like clean

play14:04

your data if you want you will check

play14:06

which value are null and then you will

play14:08

check if you need to drop them what you

play14:09

do with it

play14:10

so Swift scene and then there is another

play14:13

thing very practical which is called

play14:15

rename where we can just rename your

play14:17

colon right

play14:18

um so let's say instance in my GF

play14:21

diabetes I have my DF diabet and I want

play14:24

to do Renee so I have my renames and I

play14:26

want to provide a map so with my columns

play14:28

then it provides the all name with the

play14:30

new name so let's say I was having all

play14:32

name is one and I mean like is one uh

play14:35

for me have no meaning so I'll be okay

play14:37

S1 and I will do I know according to the

play14:40

democritation is a total cholesterol

play14:45

uh so I will do like this and it will

play14:47

return me a new data frame and this

play14:49

column will not be changed either I'm

play14:53

creating a new data set either I'm doing

play14:55

in place equal true so here no if I'm

play14:58

before I'm doing it if I'm calling my DF

play15:00

diabet my DF diabet is still equal to

play15:03

the same thing right it's still S1 but

play15:05

if I do in place equal true

play15:08

then my DFW bets I have changed uh the

play15:12

value in my data frame that's why this

play15:14

in place equal true mean it's like I'm

play15:16

changing the value in my data frame so

play15:19

we have the Strand name we have this job

play15:21

so we can drop some value we can drop in

play15:23

a etc we can drop the duplicates and we

play15:25

also have info so I can do all this like

play15:28

dot info and then you have in full of

play15:31

like

play15:32

um the type so you have the age you have

play15:34

the type you know it's a float you have

play15:36

to null and non-null value so here you

play15:38

see it's like a nice data frame

play15:40

everything is float uh we have every

play15:43

every time none their stuff so every

play15:45

there is no null value no problem with

play15:48

this uh so this

play15:50

um is it for the main description on

play15:53

that asset

play15:54

um then there is also something very

play15:57

practical in that asset it's called a

play15:59

group buy so Group by is a bit uh the

play16:02

same as a group buy or you will have

play16:05

um in let's say

play16:07

um

play16:10

in SQL so it works a bit the same right

play16:12

so uh Group by will work uh like uh

play16:16

let's say here I have my like DF dict oh

play16:20

I have my Geotech uh DF deck so remember

play16:23

it's looking like this I have two times

play16:25

four I will let's say Group by

play16:29

so I grew up by Michael on one

play16:33

I could buy my colon one group by Anita

play16:35

p so this is how it works and then I put

play16:38

an operation I want to do I want to do

play16:40

group by count

play16:42

uh so I'm like

play16:44

um DF did

play16:46

um

play16:47

is equal to this uh and uh

play16:51

uh so I'm like okay yeah cool do you

play16:54

have decked colon can't Ledger Etc so it

play16:58

means uh I it's a bit like my value

play17:00

count right you have my colon one and

play17:01

for each of this value I do an operation

play17:03

I could do you know if it's like value I

play17:06

can do mean or mean Etc but it doesn't

play17:09

work because here I I don't have

play17:11

something so what I will do I will go

play17:13

for diabetes

play17:15

uh we got my daughter from David and

play17:17

then we look as a total cholesterol

play17:19

right so I will look at total

play17:20

cholesterol uh and then uh I will check

play17:23

like if it's like uh greater than a zero

play17:28

so let's say if it's greater than zero I

play17:30

have four Central right so I will check

play17:33

and I will create a new column so I can

play17:35

create a new colon in my data set so it

play17:37

will be like a high cholesterol let's

play17:39

say so let's say that if I have more

play17:41

than zero of cholesterol it means I'm

play17:45

more than the mean so

play17:47

um I have a high cholesterol

play17:49

and then I could do DF diabet dot Group

play17:53

by

play17:54

uh up I do group by I want to group by

play17:57

high cholesterol

play17:59

high cholesterol

play18:02

I grew by her cholesterol and then I do

play18:04

that mean so I do mean and I will see

play18:07

that oh uh for the cholesterol the ones

play18:10

that have high cholesterol the age on

play18:12

average is bigger than the one who have

play18:14

low cholesterol so sex oh I have oh it's

play18:18

like one and zero for giant movements

play18:19

something like that

play18:21

um yeah so hey so it works and you do

play18:23

see different stuff so yeah I think the

play18:26

main stuff is here is people who have

play18:27

high cholesterol are a bit older so the

play18:31

age is greater in average than the other

play18:33

one

play18:34

so I could look at the medium we see

play18:37

them and stuff I can look at you know

play18:39

what is the mean in each column or what

play18:41

is like the max

play18:43

uh or I could look you know uh other

play18:46

counts how much value how much people

play18:48

having so you see I have like 240 people

play18:51

that have flowers and zero cholesterol

play18:53

and 200 people that have more uh than

play18:55

cholesterol I mean people Rose in my

play18:57

data frame right uh if we say in my data

play19:00

frame well it is like this uh so this is

play19:02

how it works with a group buy and you

play19:03

could also uh grow by other stuff right

play19:06

you could Group by high cholesterol and

play19:08

uh let's say something as too high so

play19:12

you know you if you do this

play19:14

uh yeah we could like hey I has to I has

play19:18

to and you will keep you like S2 up

play19:22

up

play19:23

and you can Group by high cholesterol

play19:25

and that's too then it is the last time

play19:27

we need to provide so we use the

play19:28

brackets and um we their highest tune I

play19:33

have to

play19:34

so here you do see we have four

play19:36

different possibilities we don't have

play19:38

high cholesterol and we don't have is2

play19:40

and that counts the number of people so

play19:42

maybe it just gets a mean uh so it gets

play19:44

a mean of the age and we do see force

play19:46

and force minus zero but force and true

play19:48

are a bit more right but it's still less

play19:51

than this one these two values are quite

play19:53

similar I mean force and true higher

play19:55

than this one uh yes you could also look

play19:58

at the median the max

play20:00

etc etc

play20:02

median marks

play20:05

so uh this is as well uh stuff that you

play20:08

could do

play20:10

um yeah and you can also just access the

play20:12

column so let's say you're just

play20:13

interested in the age you will do

play20:16

um

play20:16

a

play20:20

H I need a parenthesis here so I can

play20:22

just access the age this is a width time

play20:24

frame uh that frame so I would just do

play20:26

like two frame and this will look nice

play20:30

age and something else and I will put it

play20:33

here and then I will not need to put the

play20:35

two frame

play20:36

uh so this is how Jesus Group by so it

play20:40

is very practical it enables you to do

play20:43

uh more calculation it works a bit like

play20:45

in SQL and it performs operation like

play20:48

this mean Max median count on some group

play20:50

notes uh that when you do DF diabet you

play20:54

remember I did some dot sum so you would

play20:56

have the sum here per colon you could

play20:59

also do this mean right so you will get

play21:02

UDF diabetes and you get the mean you

play21:05

can get the max so it's like the means

play21:07

the max for each column uh so it's a bit

play21:09

the stuff you get in the describe right

play21:11

so you can do this operation all your

play21:13

data frame or you could also just select

play21:15

a colon and do H dot mean

Rate This

5.0 / 5 (0 votes)

Ähnliche Tags
Data AnalysisPythonPandasDataFrameData CleaningData ScienceMergeJoinGroupByData Handling
Benötigen Sie eine Zusammenfassung auf Englisch?