Dataframes Part 02 - 01/03

Develhope
14 Oct 202213:57

Summary

TLDRThe script is a tutorial on handling data frames in Python using Pandas. It covers loading data, selecting specific columns, and accessing rows. The instructor demonstrates creating data frames from existing data and dictionaries, and explains the difference between series and data frames. They also show how to modify indices, use methods like 'head' and 'tail' for viewing rows, and utilize 'shape' to determine the dimensions of a data frame.

Takeaways

  • πŸ“Š **Data Frame Loading**: The script starts by loading a dataset and selecting only the data part of it into a DataFrame called 'diabetes'.
  • πŸ”‘ **DataFrame Identification**: It's emphasized that a DataFrame is identified by an index and a structure similar to an Excel sheet with rows and columns.
  • πŸ”’ **DataFrame Structure**: The 'diabetes' DataFrame contains specific columns like age, sex, BMI, BP, and other probabilities.
  • πŸ“š **Building DataFrames**: DataFrames can be constructed from existing libraries, APIs, dictionaries, or CSV files.
  • πŸ› οΈ **DataFrame from Dictionary**: A DataFrame can be built from a dictionary where keys are column names and values are lists of data.
  • πŸ“ˆ **Series vs DataFrame**: Accessing a single column from a DataFrame results in a Series, while accessing multiple columns retains the DataFrame structure.
  • πŸ”‘ **Accessing Columns**: Columns in a DataFrame can be accessed using the dot notation similar to accessing keys in a dictionary.
  • πŸ” **Accessing Rows**: Rows in a DataFrame can be accessed using the `.loc` or `.iloc` methods, with `.loc` using index labels and `.iloc` using integer positions.
  • πŸ“Š **Displaying Data**: The `.head()` and `.tail()` methods are used to display the first or last few rows of a DataFrame, which is useful for quick data inspection.
  • πŸ“ **DataFrame Shape**: The `.shape` attribute provides the dimensions of the DataFrame, indicating the number of rows and columns.

Q & A

  • What does 'dot data' refer to in the context of loading a dataset?

    -In the context of loading a dataset, 'dot data' refers to accessing the 'data' attribute of an object, which typically contains the actual data within a dataset, excluding additional information such as descriptions.

  • How is a DataFrame represented visually in Python's pandas library?

    -A DataFrame in pandas is visually represented with an index and columns, similar to an Excel sheet. It has a gray and white line display to indicate the rows and columns, with the first five and last five rows shown by default when the DataFrame is too large to fully display.

  • What is the significance of the index in a pandas DataFrame?

    -The index in a pandas DataFrame is significant as it labels the rows and allows for efficient data retrieval. By default, it starts at 0 and increments by 1, but it can be customized to start at different values or use different labels.

  • How can you create a DataFrame from a dictionary in pandas?

    -You can create a DataFrame from a dictionary by using the `pd.DataFrame()` function, where the dictionary's keys become the column names and the values become the data in the columns.

  • What must be true for all arrays when creating a DataFrame from a dictionary?

    -When creating a DataFrame from a dictionary, all arrays (lists of values for each column) must have the same length, otherwise pandas will raise an error because it requires uniformity in the size of the data.

  • What is the difference between a Series and a DataFrame in pandas?

    -A Series is a one-dimensional labeled array that behaves like a column in a DataFrame. A DataFrame is a two-dimensional labeled data structure with columns that can be of different types. Selecting a single column from a DataFrame results in a Series.

  • How do you access a single column from a DataFrame?

    -To access a single column from a DataFrame, you use the DataFrame name followed by the column name in square brackets, similar to accessing a key in a dictionary.

  • What is the 'iloc' function used for in pandas DataFrames?

    -The 'iloc' function in pandas is used for integer-location based indexing and selection by position. It allows you to access rows by their integer index, which is useful when you don't know the label of the row but know its position.

  • How can you view the first few rows of a DataFrame using a method?

    -You can view the first few rows of a DataFrame using the 'head()' method. By default, it shows the first five rows, but you can specify a different number to see more or fewer rows.

  • What does the 'shape' attribute of a DataFrame return and what does it represent?

    -The 'shape' attribute of a DataFrame returns a tuple where the first element is the number of rows and the second element is the number of columns, representing the dimensions of the DataFrame.

Outlines

00:00

πŸ“Š Data Frame Initialization and Exploration

The speaker begins by discussing the process of loading a dataset into a DataFrame, specifically mentioning the exclusion of unnecessary data like descriptions. They focus on extracting the core data and storing it in a DataFrame named 'diabetes'. The speaker then explains how DataFrames are visualized, comparing them to Excel sheets and highlighting features like indexing and line visibility. They touch on the concept of DataFrame size, explaining how Python displays data with the first and last few lines when the dataset is too large. The lecture also introduces another DataFrame 'DF restaurants' for comparison. The speaker then delves into constructing DataFrames from dictionaries, demonstrating the process with an example dictionary and explaining the importance of matching array sizes when creating DataFrames from dictionaries. They also mention the use of the 'pd' nickname for the pandas library and the creation of DataFrames from CSV files or APIs.

05:00

πŸ”‘ Accessing DataFrame Elements

In this section, the speaker discusses how to access elements within a DataFrame, drawing parallels with dictionary access methods. They explain the difference between accessing a single column (resulting in a Series) and multiple columns (which still results in a DataFrame). The speaker emphasizes the importance of understanding the type of object being manipulated, whether it's a Series or a DataFrame, especially when performing operations between them. They also cover accessing specific rows within a DataFrame using the '.loc' method and how to modify the index of a DataFrame. The speaker provides examples of accessing rows by index name and by position, highlighting the difference between '.loc' and '.iloc' for accessing rows.

10:01

πŸ”Ž Advanced DataFrame Navigation

The speaker continues by introducing advanced methods for navigating DataFrames, such as using '.iloc' for accessing rows by their position and '.head()' for viewing the first few rows of a DataFrame. They also mention the '.tail()' method for accessing the last few rows. The section covers the '.shape' attribute, which provides the dimensions of the DataFrame, and the speaker provides a practical example of creating a function to print the shape of a DataFrame. The speaker concludes by emphasizing the importance of knowing the size and structure of a DataFrame for efficient data manipulation and analysis.

Mindmap

Keywords

πŸ’‘DataFrame

A DataFrame is a 2-dimensional labeled data structure with columns potentially of different types. In the context of the video, it is a fundamental concept used for data manipulation and analysis. The script describes how to create a DataFrame from various sources like a dictionary or a CSV file, and how to manipulate it using methods like `.head()` or `.tail()`.

πŸ’‘Pandas

Pandas is an open-source Python library used for data manipulation and analysis. It provides data structures and functions needed to manipulate structured data, making it a key tool for data scientists. The script mentions using Pandas to create DataFrames and Series, which are central to the video's demonstration.

πŸ’‘Series

A Series in Pandas is a one-dimensional labeled array capable of holding any data type. In the script, it is mentioned as the result of selecting a single column from a DataFrame, which is a crucial distinction because Series and DataFrames have different methods and properties.

πŸ’‘Index

An Index in Pandas is used to label the rows in a DataFrame. The script explains how to access specific rows using the index labels, and how the index can be manipulated or changed to suit different data structures or preferences.

πŸ’‘iloc

iloc is a Pandas function used for integer-location based indexing / selection by position. The script uses `iloc` to access rows by their integer position, which is essential for data manipulation when the index labels are unknown or when shuffling data.

πŸ’‘head

The `head` method in Pandas returns the first n rows for the object based on position. It's used in the script to quickly view the first few entries of a DataFrame, which is helpful for getting an initial sense of the data's structure and contents.

πŸ’‘tail

Similar to `head`, the `tail` method returns the last n rows for the object. The script uses `tail` to demonstrate how to view the last entries of a DataFrame, which is useful for checking the end of a dataset or for datasets that have been manipulated.

πŸ’‘shape

The `shape` attribute in Pandas returns a tuple representing the dimensionality of the DataFrame, specifically the number of rows and columns. The script uses `shape` to describe the size of the DataFrames, which is important for understanding the scale of the data being worked with.

πŸ’‘Dictionary

A dictionary in Python is a collection of key-value pairs. In the script, dictionaries are used to create DataFrames, demonstrating how data can be structured and then converted into a format suitable for analysis with Pandas.

πŸ’‘Data Manipulation

Data Manipulation refers to the process of transforming, modifying, or reorganizing data. The script covers various aspects of data manipulation in Pandas, including accessing and modifying DataFrame rows and columns, reshaping data, and selecting subsets of data.

πŸ’‘API

An API (Application Programming Interface) allows different software applications to communicate with each other. The script mentions loading data from an API, which is a common method for acquiring data for analysis in data science projects.

Highlights

Loading a dataset and extracting data using '.data'

Creating a DataFrame named 'diabetes'

Explanation of DataFrame structure and display

Understanding DataFrame indices and line visibility

Building a DataFrame from a dictionary

Importing Pandas library with a nickname 'PD'

Creating a DataFrame from a dictionary with unequal sizes

Accessing DataFrame columns using column names

Difference between Series and DataFrame

Transforming a Series into a DataFrame using 'to_frame()'

Accessing DataFrame rows using '.loc'

Using '.iloc' to access DataFrame rows by position

Difference between '.loc' and '.iloc' for accessing rows

Using 'head' to display the first few rows of a DataFrame

Using 'tail' to display the last few rows of a DataFrame

Determining DataFrame shape with '.shape'

Creating a function to print DataFrame shape

Transcripts

play00:05

when we load this data set we put the

play00:08

ice frame and go through and we do see

play00:10

that here we have some data right but

play00:12

there is also over data like the

play00:14

description Etc so we just want the data

play00:17

in it so what we do do is like dot data

play00:20

so this is and then we got the data

play00:22

frame so we're going to have our data

play00:24

frame diabetes uh that is going to be

play00:26

equal to this and then I have my data

play00:29

frame diabetes

play00:30

so this is uh how it works I'm gonna put

play00:33

a bit more plus so you see more so here

play00:36

we have a data frame DF diabet and we do

play00:38

see how you have another one so you know

play00:40

it is a data frame because it has this

play00:42

index that we see here it has this gray

play00:44

and this white line so you can't see

play00:46

better if you want so it is like a nice

play00:48

display you look a bit like you know

play00:50

this Excel table an Excel sheet for

play00:52

instance

play00:54

and here you do see like there is 441

play00:56

from 0 to 441 so 442. two lines uh in

play01:01

here you don't see the middle so here is

play01:03

like the first part of your column and

play01:05

here has the last line of your colon and

play01:07

because it is too big uh python decided

play01:10

to just you know display the first five

play01:11

line and the last five line together so

play01:14

you could you know go see how it is

play01:16

inside so this is how um so that we can

play01:19

build this data diabetes so in the first

play01:22

lecture about that I said we will use

play01:25

this uh little DF restaurants example

play01:27

but we will also use this DF diabet so

play01:30

you see what is here you get the age sex

play01:33

the BMI is a BP and then a different

play01:35

probabilities

play01:38

um so this is uh how it works

play01:41

um uh a good order different there is

play01:44

another possibility uh to build at a

play01:47

frame as like uh it can

play01:50

um be made from

play01:54

um

play01:54

from a dictionary so let's say I'm

play01:57

building a dictionary so I'm going to

play01:58

write a dictionary you know and it's

play02:00

going to be like call one and in my

play02:02

colon one uh I will get like

play02:05

um one two three four five six seven

play02:08

eight nine ten zero so let's say you

play02:10

have this this is going to be my dick

play02:12

one uh and then I will want to build

play02:16

data frame from this dick so we're just

play02:17

going to be one column how would I do

play02:20

that I go to my library pandas so PD

play02:23

because that is a nickname I gave it

play02:25

when I import

play02:26

from the I know Panda data frame I

play02:28

forget something Panda data frame yeah

play02:31

from sixth so I drew from dect and my

play02:34

dect is uh dict uh one basically so here

play02:37

I am you know I have PD I want to create

play02:39

what I want to create a data frame and

play02:42

then I have my dict one uh so then what

play02:45

is the output the upper design my colon

play02:47

one and the value then okay if I add

play02:49

another column uh so I need to put it as

play02:52

a string so it's going to be like number

play02:55

maybe it's going to be like later

play02:59

and in my letter I will have something

play03:02

like a

play03:07

a

play03:08

B

play03:11

C

play03:12

C

play03:15

d

play03:16

e how many one two three four five six

play03:19

seven one two three four five six seven

play03:23

uh so yeah I need to put this in the end

play03:26

uh so here I can build another

play03:28

dictionary if they don't have the same

play03:30

size then sometimes I have issues right

play03:33

because it will tell me this one and

play03:35

this one don't have the same size right

play03:37

that's what it is all array must have

play03:39

the same name so when you build from a

play03:40

dictionary and you put the name of the

play03:42

first column

play03:43

you put the content as a list and all

play03:47

this is like arrays have to get the same

play03:49

size right

play03:51

um as uh it is not possible to build

play03:53

your your URL and you can see in your

play03:56

project you have the orientation like

play03:57

the string as a colon Etc this is a

play03:59

different parameter frame so this is

play04:01

another way uh we're gonna call it

play04:03

dfdict

play04:05

the F the project

play04:08

and this is how it works so we have uh

play04:11

OD update uh and now I can call it again

play04:15

so you can check ift object and DFW bits

play04:18

and my GF restaurant so I have like all

play04:21

of them I'm happy

play04:23

um so here is like we build a sunset

play04:26

from this like load from some things

play04:27

that already exist in some library from

play04:29

this API I get it or I can build it from

play04:31

a dictionary or can build it from a CSV

play04:33

and then you can explore the option and

play04:35

read the documentation to see how you

play04:37

can build them differently

play04:39

so you see like the structure uh it's

play04:42

like this colon in this rows what you're

play04:45

asking me is like when I'm calling my DF

play04:47

dict I'm accessing all of my data frame

play04:51

but maybe I can only access a part a

play04:53

small part of my data frame so I would

play04:56

like to access maybe just the First

play04:57

Column so how do I do that so to access

play05:00

the First Column I will just do call one

play05:02

a bit like four dictionary right so you

play05:07

remember for dictionary if I want to

play05:09

access one colon you know if I want to

play05:11

do dict one of colon one what I will do

play05:13

I will do dict one of colon 1.

play05:17

uh so you do see uh when I do deep on a

play05:20

colon one the content here is the last

play05:22

but when I do DF date of colon 1 because

play05:25

DFT is a data frame I have the same

play05:28

content but here what we get is called a

play05:31

series so this is a data frame and this

play05:33

is only if you select only a single

play05:36

colon from a data frame then it is

play05:38

called a Siri so this is how you will

play05:40

access the color if not you want to

play05:43

access a different column uh so this

play05:46

could happen you know when you do like

play05:47

uh DF diabats

play05:50

um so we get like age and sex

play05:54

age and sex

play05:57

so if you get DF diabet and you get

play06:00

agent sex up uh you will need to put

play06:03

this uh double bracket if you want to

play06:06

select two colors and then the odd part

play06:08

because it's not a single column is a

play06:10

double column then this is a data frame

play06:13

uh as a knot uh you can easily transform

play06:17

this as a data frame you just do there

play06:20

is a method come to frame and then this

play06:22

is a data frame so you do see the

play06:24

difference between a series who is like

play06:27

uh region like this without this like

play06:30

gray and white stuff and then a two

play06:32

frame so this is important to know what

play06:34

type are you manipulating because if you

play06:37

want to do operation let's say between

play06:38

data frame you need to know you need to

play06:40

know that if you want to aggregate two

play06:41

column or something it has to be two

play06:43

column of the same type you know two

play06:45

series or two data frame if you want to

play06:46

mix this one with this one and you want

play06:48

to do an operation between this they

play06:50

have to be data frame basically

play06:52

uh so this is SolidWorks this is data

play06:54

frame and you can use this like two

play06:56

frame function to bring a single column

play06:58

that is a series to a data frame uh so

play07:02

yeah this is how it works to access a

play07:04

color

play07:05

so it's a bit like you know in Excel you

play07:08

can just select one colon Etc and we do

play07:10

a bit like in this dictionary so syntax

play07:12

is a bit the same as for the dictionary

play07:14

type

play07:16

um then what is new is like when I got

play07:18

my DF diabet

play07:21

um I want to access maybe a single row

play07:23

you know so what we will do

play07:26

um in um in a list you know I would like

play07:29

to do DF diabet of zero

play07:32

but DFI a bit of zero wouldn't work if

play07:36

it was a colon zero so how do I do this

play07:38

I do Dot Lock I do that lock and then

play07:42

when I do DF diabets.lock of zero I go

play07:45

to the index that t there and I select

play07:48

the row that is having a name zero it

play07:52

could be that my index is different

play07:54

right

play07:55

so to access my index I will do

play07:58

dfdiabets.index

play08:00

dot index

play08:02

uh and I will see it's like orange index

play08:04

stats Etc no I could also go to my ZF

play08:07

dict and I can also dot index and I see

play08:10

is data zero is stop F7 and maybe I just

play08:14

want to modify it you know so

play08:16

um

play08:18

uh I want to do uh so it started at zero

play08:21

and let's say I want to start it at uh

play08:24

seven value one two three four five six

play08:27

seven

play08:28

oh yeah so no I change it so if I have

play08:32

my dick dot index

play08:35

index is like this

play08:37

and and if I did DF dict I will say that

play08:41

you know before I was having this like

play08:44

zero one two three four five six by

play08:45

default it's a notary increment and

play08:48

start with zero if not specified

play08:49

otherwise uh in here I set an index you

play08:52

know I want I say I want as an index two

play08:55

three four five six seven eight so no my

play08:57

index is two three four five six seven

play09:00

eight perfect column one later so you

play09:02

know if I do dfdig.log to zero this

play09:06

would have worked before but no it told

play09:08

me zero is not an index you know key raw

play09:11

is zero meaning is not index so no I

play09:13

look again and I'm like hmm this first

play09:15

row is having a name the name of my

play09:18

first row is two so I need to do log of

play09:21

2 and if I do lock off two you will see

play09:24

that I'm accessing this first line if I

play09:26

want to access this line here I will do

play09:28

Locker 5.

play09:30

perfect so we use uh this uh this

play09:34

bracket to access the colon also double

play09:36

brackets if you want several colon and

play09:38

if I only want to access a line or

play09:40

several line I could do log five and I

play09:42

could also put like different numbers in

play09:44

there

play09:45

um but you're like um I want to access

play09:47

the line that is the first one no matter

play09:50

what and I don't know the name then

play09:52

there is a function that is a bit like

play09:53

the lock and it's called e-lock so for

play09:56

e-lock

play09:58

um in pandas you just want to put the

play10:00

position you know

play10:03

um so if you go there and you do Dove

play10:05

illock of zero this is working then you

play10:08

got the first one if you want the second

play10:10

row you will do this one right

play10:14

um so you can really play uh with e-lock

play10:16

and you can put different one if you

play10:17

want the two first one you put zero and

play10:20

one and then you will get the two first

play10:21

one uh so this will be the same

play10:24

us doing I want

play10:27

um the lodge so the log of 0 and 1 is

play10:30

the name are two and three so then I

play10:32

will do two and three

play10:34

and this will get me the same right so

play10:36

if I use e-lock to say it is the index

play10:39

of the location if you want or I use a

play10:42

lock and I'm looking for the stuff so

play10:44

this will always return the same thing

play10:46

let's say I'm like shuffling my data

play10:49

frame or you know I'm Shuff shuffling

play10:51

rolls around this could be different

play10:53

they say exactly know what I get as a

play10:55

result right uh yeah so this is a

play10:58

difference between the lock and Ayla and

play11:00

how I can access row and how I can

play11:02

access columns

play11:04

so uh let's say here I have this a big

play11:07

data frame that is my DFW about you know

play11:10

and I'm like you know I can plot

play11:12

everything Etc either is a function

play11:15

because like display and we can use as

play11:17

display as well to display everything

play11:20

um but there is something very practical

play11:22

so it is called head so head it's like a

play11:25

method so that applies to my data frame

play11:28

diabet and how does it work it's

play11:30

basically how we love my head uh and you

play11:33

just show me the five step Pro if I put

play11:35

five it's five uh by default it's five I

play11:38

can put more if I want to see the top 10

play11:40

rows I do dot head of 20 and then I see

play11:43

the top 10 row if I put 20 then I will

play11:48

get the top 20 row but I will see

play11:50

another practical because it can only

play11:52

show a 10 volts Max

play11:54

uh and then uh there is a thing called

play11:57

tail and tail as you see show me the

play11:59

last one so if I do tile of 10 then I

play12:03

will get my 10 uh last row so this is uh

play12:07

horses stuff is working uh then we have

play12:11

um a thing called the shape so the shape

play12:15

is like how big it is you know so if I

play12:17

do my DF dot diabeto shape it will give

play12:19

me two number the first number will be

play12:22

the number of rows so this is my number

play12:25

of row Heights 442 row and I have 10

play12:28

colon if I go to my DF dict and I do dot

play12:31

shape I will Got7 so no you know I could

play12:34

write a function Dev print or I could

play12:37

just print

play12:39

uh deaf prints shape so if I do deaf

play12:44

print shape I put a DF as an entrance

play12:47

and then I will print

play12:50

I will print I put an F string as we all

play12:53

remember how to do F and two three

play12:57

I put an F string and I knew I use the

play13:00

data frame have has and here I do

play13:05

DF dot shape of zero so this is going to

play13:09

be the number of rows rows and

play13:13

DF dot shape of one and this is the

play13:17

number of colors

play13:20

uh so now I do print shape of my DF

play13:26

diabetes and here's a data Frameworks 40

play13:29

40 rows uh maybe I can just put that

play13:32

here so it's a bit nicer so that's a

play13:34

frame has uh 442 row and 10 column now

play13:38

if I print shape of something else so a

play13:40

print shape of my addict up here I will

play13:44

see I have seven rows and two colon

play13:46

seven rows and two columns uh yeah so

play13:49

this is to print the shape of the data

play13:51

frame and know how big it is

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Data AnalysisPython TutorialPandas LibraryDataFrame ManipulationData ScienceCoding SkillsData StructuresMachine LearningData VisualizationProgramming