EDA - part 1
Summary
TLDRIn this Python class lecture, the focus is on a practical application of pandas for data manipulation and visualization. The lecturer guides through a real-life case study using a house prices dataset. Key topics include data cleaning, exploratory data analysis, and creating various graphs using matplotlib and seaborn libraries. The session covers handling missing values, analyzing the impact of different features on sale prices, and introduces basic plotting techniques.
Takeaways
- 📊 This lecture focuses on practical case studies using pandas for data manipulation and visualization with a real estate dataset.
- 🏠 The dataset explores various factors affecting house prices, emphasizing the importance of data cleaning and exploratory data analysis (EDA).
- 📈 Visualization is a key component, teaching how to create different types of graphs to represent data insights.
- 📂 The lecture demonstrates how to import data from a CSV file, emphasizing the use of relative paths for file locations.
- 🔍 Data exploration techniques such as `head()`, `tail()`, and `shape` are covered to understand the dataset's structure and contents.
- 🧹 The importance of data cleaning is highlighted, including checking for and handling duplicates using `drop_duplicates()`.
- 📊 `describe()` function is used to get an overview of the dataset's statistics, helping to understand data distribution and identify outliers.
- 🕵️♂️ The script discusses checking for missing values using `isnull()` and `sum()`, which is crucial for accurate data analysis.
- 📉 A demonstration on plotting missing values using matplotlib to create bar graphs, showing which features have missing data.
- 🏡 The lecture explores how to fill missing values, using techniques like filling missing alley types with 'No Alley' as an example.
- 📊 Grouping data by categories (like Alley, Fence, or Bedroom) and calculating statistics (like mean or median) to find patterns or relationships with the sale price.
Q & A
What is the main focus of the lecture series on Python classes?
-The main focus of the lecture series is to explore practical cases with Python, specifically using the pandas library for data manipulation and visualization.
What dataset is used in the lecture for practical case studies?
-The dataset used in the lecture is about house prices, which depends on a variety of factors, and is intended for data cleaning and exploratory data analysis.
How can one obtain the house price dataset mentioned in the lecture?
-The house price dataset can be obtained from sources like Kaggle or Google Dataset. The lecturer saved a CSV file from the web for the lecture.
What is exploratory data analysis and why is it important?
-Exploratory data analysis (EDA) is the process of using statistics and visualizations to discover patterns in data. It's important for understanding the characteristics of a dataset and informing further analysis or modeling.
How does one check for duplicates in a pandas DataFrame?
-One can check for duplicates in a pandas DataFrame using the `drop_duplicates()` method. This method removes duplicate rows, and if no duplicates are found, the shape of the DataFrame remains the same.
What is the significance of checking for null values in a dataset?
-Checking for null values is significant because it helps identify missing data which can affect the accuracy of statistical analysis. It's a part of data cleaning to ensure the quality of the dataset.
How can one visualize the count of missing values in different columns of a DataFrame?
-One can visualize the count of missing values using a bar plot with matplotlib. The columns can be on the x-axis and the count of missing values on the y-axis.
What does the term 'normalize' mean in the context of value counts?
-In the context of value counts, 'normalize' means to scale the counts to represent proportions rather than absolute numbers, providing a percentage distribution of the unique values.
How can one analyze the impact of different parameters on the sale price in the dataset?
-One can analyze the impact of different parameters on the sale price by using group by operations to calculate statistics like mean or median within categories defined by those parameters.
What libraries are mentioned for data visualization in Python?
-The libraries mentioned for data visualization in Python are matplotlib, seaborn, and plotly.
What is the role of seaborn in data visualization compared to matplotlib?
-Seaborn is built on top of matplotlib and is designed to provide more attractive and informative statistical graphics. It is easier to use for creating complex visualizations and is well adapted for working with pandas DataFrames.
Outlines
📊 Introduction to Practical Python Pandas Case Study
This paragraph introduces the final chapter of a lecture series on Python classes, focusing on a practical case study using the pandas library. The lecturer recaps the previous lecture on pandas, emphasizing the structure and manipulation of data frames. The current lecture aims to apply these concepts to real-life data, specifically house prices, and introduces the concept of data visualization. The data set is sourced from the internet, likely from platforms like Kaggle or Google Datasets. The lecturer outlines the plan for the lecture, which includes data cleaning, exploratory data analysis (EDA), and various types of graph creation. The process starts with importing necessary libraries and loading data from a CSV file located in a 'Data houses' folder.
🔍 Exploring and Cleaning the Data Set
The lecturer delves into exploring the data set by using methods like `head()` to view the first few entries and `tail()` for the last entries. The data's structure is examined using `shape`, revealing the number of columns and rows. Attention is given to potential data cleaning tasks, such as checking for duplicate entries using `drop_duplicates()`. The lecturer also discusses the importance of understanding data characteristics using `describe()` to get statistical insights like mean, median, and distribution of values. The data types of columns are checked with `dtypes`, and the presence of null values is assessed with `isnull()`. The paragraph concludes with a discussion on handling missing values, suggesting the creation of a new DataFrame to focus on columns with missing data.
📊 Visualizing Missing Data with Matplotlib
This section discusses the visualization of missing data using Matplotlib. The lecturer decides to create a bar plot to represent the number of missing values in each column. The process involves using `plt.bar()` to plot the missing values, with adjustments to the figure size for better readability. The lecturer also explains how to rotate the x-axis labels for clarity and adds a title to the plot. Customization options such as changing bar colors and adding labels are also covered. The visualization aims to show the count of missing values across different columns in the DataFrame.
📈 Handling Missing Values and Initial Data Exploration
The lecturer continues with strategies for handling missing values, often found in columns like 'PoolQC', 'MiscFeature', and others. They demonstrate how to fill missing values with a specific value using `fillna()` and discuss the implications of such actions on data analysis. The exploration of the data set progresses with looking at unique values and their counts using `value_counts()`, with a recommendation to use `normalize=True` for large data sets. The lecturer also touches on grouping data by certain features to analyze average sale prices and how different characteristics like 'Alley', 'LotShape', and 'LandContour' might affect these prices.
🏡 Analyzing Sale Prices and House Characteristics
The focus of this paragraph is on analyzing how house characteristics like the type of alley, fence, and number of bedrooms might influence sale prices. The lecturer uses group by operations to calculate the average sale price within different categories and discusses the insights gained from these calculations. For example, houses with a 'Paved' alley seem to have higher average sale prices compared to those with 'Gravel' alleys. The exploration also includes looking at the distribution of sale prices relative to the number of bedrooms, suggesting that the number of bedrooms may not have a straightforward correlation with sale price.
📈 Further Data Exploration and Upcoming Graphing Techniques
In the concluding paragraph, the lecturer summarizes the data exploration done so far, emphasizing the importance of asking the right questions and looking for meaningful patterns in the data. They also provide a sneak peek into the next part of the lecture, which will cover various graphing techniques in Python using libraries like Matplotlib, Seaborn, and possibly Cboard. The goal is to move beyond simple plots to create more complex and informative visualizations that can help in understanding the data better.
Mindmap
Keywords
💡pandas
💡DataFrame
💡visualization
💡exploratory data analysis (EDA)
💡CSV
💡data cleaning
💡matplotlib
💡missing values
💡normalize
💡group by
💡Seaborn
Highlights
Introduction to the last chapter on Python classes focusing on practical case with pandas.
Overview of the lecture's goal: to analyze house prices data set with visualizations.
Explanation of where to find data sets for analysis, such as Kaggle and Google Datasets.
Demonstration of loading a CSV file into a pandas DataFrame.
Use of `head()` to view the first few lines of the data set.
Checking the shape of the data frame to understand its dimensions.
Importance of data cleaning and checking for duplicates.
Using `describe()` to get an overview of the data's statistics.
Checking data types and handling missing values in the data set.
Explanation of how to visualize missing values using a bar plot.
Customizing plot aesthetics like title, labels, and colors.
Strategies for dealing with missing values, such as filling them with a specific value.
Analyzing the distribution of sale prices and its relation to different features.
Using `groupby()` to explore how sale price varies by different categorical variables.
Exploring the impact of features like 'Alley' on the sale price.
Comparing the average sale price across different categories like 'BsmtQual'.
Discussing the importance of asking the right questions during data analysis.
Introduction to the second part of the lecture focusing on graphing in Python.
Overview of different types of plots that will be covered in the lecture.
Transcripts
[Music]
hello everyone and in this part of the
lecture which is the last chapter for
the python classes uh we're gonna go
about around a practical case with
pandas so in the last lecture we've seen
how it was looking to work with pandas
so we've seen the structure of data
frame how we can do how we can
manipulate those objects
in this part we still couldn't
manipulate them in real life so we can
rehearse the concept we've already seen
basically but we're going to add another
layer the other layer is a visualization
so we're going to learn how to do graph
um we're going to resist along to a real
data set so the data set we're going to
explore is about some house prices so
house prices depends on a variety of
things and
uh we're gonna explore this data set to
some data cleaning see how it works and
we're gonna do this data exploration uh
data exploration a little nickname
foreign
exploratory data analysis and we also
going to see how to do graph different
kind of graph simple one etc etc
um so this is how it works uh what data
should be used uh so as explained you
know uh in another lecture maybe if you
did SQL classes we're going to use data
that is a part of a horse price data set
so to find data you can go on kaggle you
can go on Google data set Etc there is
lots of data set available
so what I did first
is I saved this CSV somewhere so I save
a CSV file I download it from the web I
save it from the website and then I'm
able
um to see what's happening so this
lesson uh we're gonna basically go
through a wool notebook so that's okay
we're going to throw in a book I'm going
to explain how we do graph or we can
explain graph and how basically this
exploratory data analysis is working so
that's the main part of this lecture
what is if you want a practical case of
what we learned before and how it will
look now
so uh I created a bit of a skeleton of a
notebook right but there is uh other
stuff that can be done so I just have a
bit of a plan here so uh what we want to
do as a goal is we want to analyze our
data so we want to analyze our data and
first when we start a notebook we need
to import some libraries so I import
some libraries the one we know for new
are this one and this one this one a
visual one but I will go back to this
Library when we use them
um
so we can execute it and then
I can restart my channel maybe Aristotle
so I executed this stuff into restart so
it's fresh and then I do DF so to charge
my data I use the same as before ifpd
that refer to pandas and then I got this
dot rate CSV and I save my data in a
file in a folder called Data houses
slash train CSV so this is a relative
path to my CSV file the file is called
train.csv and it's in the folder data
house so yes this is a relative path
from the point of this notebook so let's
say if this notebook if my notebook
uh so let's say I'm gonna open a
markdown cell to explain how it works
um so let's say that here activate my
keyboard uh so we have this mode on sale
um and so let's say I have my documents
so let's say it's like this right
documents
so let's say you have my documents
folder and in my documents folder I have
this one so my EGR prices so this
notebook you know Eda prices
Ada process idea prices yeah
Eda Eda
Eda
prices dot e by n b e b so this is what
we got right and then I have a folder
here
that are called Data houses
and in this folder in this folder so in
the folder
I have something that I call train dot
CSV
so yeah so either price and the folder
data houses are in the same
um in the same folder if for instance
this trained on CSV were there if it's a
data what there then I will remove this
data horses.train right because it's the
same it's like a relative path
so this is how it works my train this is
not the case so I have my idea my photo
data host and in my photo data host I
have this train.csc so I can import this
trained on CSV so I have a GF
then I want to check how my data is
looking like right so there is different
thing I can do to check how the data is
looking like
different uh stuff you can do so
remember we can just do dheav.head and
then we will get the first five line so
we can get a bit you know like
um
checking what is my data looking like so
you have an index I have simpler so need
lot Frontage area there is some
situatives I think this is a type of
straight if it is in La if it wizard
shape is there is a contour is there is
like some pool some utilities some fence
so you have lots of features for the
horses and then you get to sell
condition normal abnormal and you get
the same price
to in this analysis what we're
interested at is a sale price and we're
interested at or how the sale price vary
depending on some parameter right so we
do see we have a lot we don't know all
of them what it means
um but we know that there is also a
description file that is provided with
the data I'm not showing it here because
I don't want to just speak about it that
I will speak about the color and that
makes sense when we look at them more or
less
uh so this is how we can check how the
data is looking like with this head and
remember by default is the five first
line if I want to see the top 10 iPad to
10 here and I'm able uh to see more
stuff there is also these functions that
we call tail and uh tail uh we able to
see the tail so the last one another
stuff I can do
um to check or with my data I could look
at the shape right so I could do the F
shape and I can see that I have 81
columns so I have 80 features
um I mean there is a IG colon and a
sales price column so maybe 79 features
that I will look at and I have 1460 row
so as it says answering the question how
much data you know
and then I'm like either replicates you
know so how do I do I can check drop
duplicates these are duplicates in my
data they say it's part of the data
cleaning you know if I was having here a
data cleaning part
this is what I will do you know first
you know I would check for duplicates
why because if then I do some counts and
I'll do some
um
um if I do some a statistic if I have
duplicates it's not going to be accurate
right so I can do drop duplicates and I
do dot shape right so if I get a new
data frame and it has the same shape as
before I know that will not be duplicate
so if I drop the duplicate for my data
frame I got the same shape as before
meaning I have no
um no duplicate rows so it happens
sometimes we have duplicates when for
instance
um you you start
with a data there is a process that is
filling your data and there is a fail
somewhere and then you restart it and
you have data duplicate so this is some
things that could happen or you save it
twice or you concatenate some stuff so
yeah there is reason why they could have
duplicate uh you just have twice
whatever or you merge in a different
stuff so yeah copies that you have
duplicates so you just want to check how
it works then we might want to check you
know the characters stick in the data so
relax maybe there is
um
so this is something or we can check the
characteristic so we can do uh describe
up so we can do describe and we can see
that we have 460 we have a mean for the
ID we can see like the lot area on
average it's like 10 000 accuracy so
meter Square maybe uh it's close to the
median and we see like
um half like from this one to this one
75 of them are below 11 000 so you know
like 75 of them are like still close to
the average right the average between 10
000 but you got like very huge
disparities if you look at the lottery
then you get the conditions or you're
built so in average it's 2071 and the
most recent houses are constructing
tougher 2010.
and you get something and you if you're
interested in the sale price because
it's a colon we're interested in we do
see that the average it's like 80k and
uh the most expensive one are like less
than a million still or yeah uh one two
three yeah it's less than me 700k
um yeah so this is a possibility for uh
this house prices and this is if you
want a bit uh so overview we have uh on
it
um then as I read something else we can
check the data type so we can use this
like d-types so we have different
columns uh yeah
his parents as and we can see that we
have some integers that we have some
objects uh but most of them are like
intake when it's like an object usually
it means like you know it's like usually
it's a string or there is a missing a
mix for instance so the cell type is
um if we remember I think cell type is
like a lot of NR or series like yeah
lots of like string like text so in this
case it will be an object it's only
integer if everything in the colon is an
integer then it will return an intake
then we can also check the null value
you know we're like
is there some value missing so you do
see here is that in like fans
miscellaneous features miscellaneous
values Etc we do have a lot of like a
lay none babe Etc we do have a lot of
um values that are missing you like it
does it represent a lot you know so I'm
going to check some null value so for
this I have my DF and I can do easy so
remember when we do is an a I check for
every
um
entity like row and colon every element
a mic table if it's not and then I do a
true or false and then if I do a sum
it will sum per column so I will see
that for first sum it will be a lot for
over node
um so
this is important to do at first uh why
because it's important to know what
value I'm missing for instance and then
we can check uh what to do with the
missing value right why are some value
missing so for this uh we're gonna go
through a college so you know we have if
we do df.colon we have all the possible
colon here so we see there is all this
possible currents and we like well
um
maybe we can check
what's values are missing and I want to
check which one are the missing value so
for this uh first
um I want to get the ones that are
missing you know so I'm going to create
a data frame where um I got values that
I'm missing so missing I have my DF
let's say I have my DF and I have my
izenae and I do my sum so this is what I
made before and this is going to be my
missing
so missing is going to look like bits
like a data frame right
so missing is a bit like this index with
a colon I think
um and then I'm gonna do missing and I
just want to get one where I got missing
value right if I have zero I'm not
interested in it so here I got my
missing
I'm not oh quite interesting you know I
see that there is a lot of fans
miscellaneous feature pool fireplace
that stuff and I'm like well I would
like to sort my values right I can do
sold values if I do sweet values here
yeah so one was only one missing which
is electrical and I have the one with a
lot of stuff what does it mean when I
got a lot of this missing value uh when
I got a lot of this missing value it
mean
um that you know maybe it's not
specified if I'll have pool QC maybe
just mean I don't have a pool you know
so I can go here in my pool
maybe a heavy due and I see that well
maybe if I have pull it just means zero
I don't have it you know so I'm like
okay I want to look at you know the
value for the pool so I'm going to be
like df.pull
DF dot pull QC
up and I'm looking like unique so if I
do dot unique I would like the unique
value you know so either I got nothing
here are things there's exterior
um in the ground or something like that
so I have like different value for the
description of my pool if I have Ali I
do see that Ali unique gravel or paved
and none meaning uh if I have nothing
meaning I have just no Ali so if there
is no Ali I can't specify if it's paved
or not basically
so this is it so we're gonna go through
our first uh plot so here we know we
have missing value and we would like to
represent this value as a bar so we
would like this to be the abscess and
this value to be the colon in the bar so
for this we will use
um a library called multiple clip so
matplotlib is a library that we imported
here so here we did import
matplotlid.pi plot as PLT so now we're
going to use this PLT that we have here
to plot the graph so here you have this
missing salt values so what I want to do
I have my PLT I do dot bar because I
want a bar plot so I want a bar plot and
I want to get my media I want to plot
this I will block this it's interesting
okay
yay Heights Dodge index
so if I do dot index here
I will have the value here right and if
I do dot values
I will have my values so my abscess if I
do my index is going to be this so here
um when I do my plt.bar I have to
specify your X and eight so an abscess
and the node in it so here this would be
my abscess and then I did my ordinate so
when I did my ordinate I put here my
values
so this is how it works and here as you
see uh I have uh something like this but
I see here that I don't have you know
they're all stuck together so what I can
do two possibilities I can say to my
figure to be bigger you know so I say
prt.figures this means I'm going to
create a figure
and my figure is having some
particularity so my figure is having
some particularity let's say I can say
which size I want so if I put 15 5 mean
it's going to be 15 long length and like
five of eight so if I do this well I do
see this a bit better right because uh
you know I'm able to read them a bit but
it's still difficult so I'm like oh can
I do this you know so there is in PLT so
PLT is a celebrity that we use uh first
to trace graph there is something called
egg sticks
so there is you know different stuff
that you can hear and extens so xdx is
referring to the ticks on the x-axis so
this is in graph so x-axis and this is
the y-axis so I have my X text and what
I want to do is I want to rotate the
rotation
um and I want to rotate my X text so
here they are flat and if I rotate them
from an angle of 90 degrees I look like
this
to avoid all this kind of real with
return we can just do PLT and we've said
PLC dojo we have our figure so here we
do see that the most less mystical
feature with some missing value but not
too much electrical and the one with
most are something like fence Ali pool
miscellaneous feature Etc
um so here we have and relax well maybe
I want to put the title so to put the
title I will do plt.title and I will put
a string so I will put a string and I
will say what is my graph show me what
is my graph showing uh to answer the
question my graph is showing a missing
value in the different color so we're
gonna say count of missing values in our
data frame
you know that right yeah and then we do
see this is here uh this uh PLT dot
title a bit like method you know you can
specify other stuff you could put like
uh something like font size
I think not fondact and then you will
put maybe like Ariel you know or real I
don't know if it exists uh no no like
but yeah you can get the contact or like
some font size Etc or you can also
change colors so let's say you want to
change the colors where would you go
let's say you want your bar to have
another color you will go there and you
will add this is this line is producing
the bar you remember we did it together
and let's say I want to color a red I
will do color equal red and then it will
be red if I want the color to be green
here I go there but if I want if there
is a color that is like not known it's
gonna have a mirror right and it's going
to be the color has to be pasted as a
list uh so I'm like okay I know
um I'm gonna put purple purple purple
as color you can also provide like AGB
values you get also uh provide a HTML
and X called and stuff so you can really
provide uh in Python lots of stuff so
your graph are customizable you can add
um label as well right so I create my
stuff I'm like okay I need my Legend to
be turned this is about the X text and I
can add a title and I can also add a
label so I can also add xlabel and my X
label is like what is there so what is
there is like Columns of my data frame
on the x-axis right Columns of the data
frame so this is my Columns of the data
frame and we do see it there and then I
have my PLT I'm going to do dot y level
y level I mean it's it's clear from the
title that is a count of missing value
but to show you I'm still going to show
you how this is working so I have my PLT
label uh and for this plg label we're
gonna put count of missing value so we
put count off missing value there so we
have this uh so here we have
um total graph with like a title we are
able to manage you know this like
different X text
and we have this column of the data
frame control of missing value etc etc
so this is how it works so uh what to do
with the missing value so most of the
time and I mean lacks of subjected fiber
attribute like missing pool fans no
garage and basements uh so yeah so we do
see
um that we we could like let's say I
want to put zero you know for instance
uh nothing no offense so let's say you
know you remember we're having all like
a filene so what I could do is I could
do the if dot filene I know which uh
what I want to do you know so you could
do like Oh I wanna name how I want to
feel my particular value so you could
have you want a subset you know you
don't have uh when you do a feeling a
um
to uh to do your stuff so two different
stuff either you could do DF dot
Ula so it would fill a name with like no
Ali
and then if you do you dot value counts
let's say value counts
up
so we see like most of the
um stuff in data frame I have no Ali you
could normalize as well
up normalize equal true and you see like
in 93 of the cases there is no alley in
three percent of the case says I have
gravel and in only 2.8 percent of the
case uh the alley is baked
um yeah so this is how it works this is
not filling my array because I haven't
say I haven't say like uh this in place
equal true etc etc right if you put the
implicit called true it would change if
you don't put the implicit control you
create a new object and then you have to
change old object if you want or you
just copy your new data frame Etc these
are all the stuffs that are I will say
yeah possible
um so this is it to do the stuff uh here
we do see a different colon
um then uh something else that we're
interested at is how much the sale price
is
um let's say varying right
um soon
um we could you look at the unique value
in a colon right so here that's what we
do when I was doing this dot unique you
can look what are the different
possibilities when you do a value counts
you have more than the three
possibilities right you have bitless
repartition for big data set I recommend
using this normalizer called true
because it gives you a better picture
you already know your data set is 1400
there is thousand four hundred rows so
there's no duplicates and then if you do
this like dot value counts you will have
access to all the different values the
different unique values that I'll take
on in this column and then you know the
percentages so you know how it is
reported
um yeah
um so what we can also look at if we do
some data exploration is you know I'm
interested into this alley that I feel
with no Ali and I'm like or as I said is
the sales page the sales price
average surprise
um defer within those groups you know so
I'm having this and I'm like okay I have
this Ali
um and what I want to do is I want to
group my DF pies up so I'm doing what I
have my DF and I do group by
so why do group by but here you know I
was having only value with my La so I'm
grouping by this
Ali but this is another stuff we can do
and earlier I'm like okay I'm interested
in what I'm interested in my sales price
and I want to see if my sale price
um is going to vary depending on this
characteristic
uh surprise dot mean yeah I need to
close the parentheses
Group by closer parenthesis no brackets
set price that mean and here I do see
that if there is no Ali I'm like
1800k and here I'm like 122 and 160. so
it looks like when there is a gravel
array it's like which represent a pave
one for instance and then I say but you
also look like there's only two and
three percent so it's not very like a
wooden comparison no Ali with the two
other one but because it's two one at
two percent we're like okay maybe pave
is better than gravel and I guess maybe
it's more expensive or something so
justifies the house uh to be good like
if you have a good looking house maybe
you're not just going to put gravel in
there so this is something very good
it's like group by to check what the
difference price you know we can check
um if the one with like uh the fans are
more expensive so to do that I can do
the same thing with the fence so to do
the same thing with the fence I will do
DF Group by
um let's say I have my fence I think it
exists and I can do dolceprice.min so I
would grow up by the unique value of
fans and then I have master and I do see
that where there is a bit of variations
this one are closed and this one with
the GDP I don't know what it means uh
but has a bit more expensive stuff
uh the one that is a bit more
interesting to look at is a bedroom
average so this is
um if we look at this bedroom average so
we do see that
um
upper bedroom average here if I do a
value count on this we do see that we
have like three two one so it's like the
number uh of a bedroom I guess or the
bedroom yeah the group yeah so if I do a
DF Buy sales price and then what I want
to do I want to solve the index right so
I want to sort index so here we'll have
the room with zero and then the room
upper so I do see that there is
um some I always say dissimilarities
um and that you know okay so one with
eight uh is as expensive as the one with
like four here so we do see there is not
very a correlation like or something we
can explain with just like bedroom
average great size
um Enzo index
uh soon uh for this part we've and we
mainly do like exploration so this is
like group by uh mean uh so we're
interested uh what is interesting at us
in the study is to interest uh whether
this interest in the sales price and we
want to see how different parameters of
an influence of it so doing this like
group by and then looking at like oh is
the average is different within
categories is a good stuff to start to
data analysis you know you'll have
different parameters and you're like oh
is the behaviors the same in different
groups right
um You can not only look at the median
but you can look at you know the median
and then you will look at the median and
maybe it's more interesting for you to
look at the median price or you can look
maybe at the mean you know how is the
cheapest house doing in all of this
category uh and you can also look by
bedroom average or the Ali you know so
there is lots of different stuff you can
do and in this case it's really you
asking you a question you know what am I
interested to see what is interesting
for me to see here so this is main
question you should ask yourself what is
interesting for me to see there uh what
question I want to answer yeah I'm like
okay is this parameter having an
importance of the sale price on average
so I'm looking at the mean
which is a good indication you know of
like how it's reported I can look at the
STD does it vary a lot uh above within
the categories so this is uh something
very typical and nice to do
um
then in the second part of this lecture
we're more gonna see how to do the graph
in Python so some basic plots and then
we're going to do some plot plot with
time series do some histogram and then
do this data exploration so this is what
we're going to do just after this in the
graph so I'm going to explain you how
you can build a graph with PLT that we
already see here a bit how it was
working but we're going to see more way
of doing it not only about plot but all
the kind of plot and then we will use
another Library called Seaborn and when
using Seaborn are you going to see how
rich it is and how I I think it's like
one of the good library to start with
because it's very that pipelot to start
and then build on top top of Pi plot you
have C board and cboard is very complex
you can do very complex graph and C bone
is very adapted to data frame to do
graph for instance
[Music]
浏览更多相关视频
Case Study on Regression Part I
Python: Pandas Tutorial | Intro to DataFrames
Introduction - Data Analysis with Python
Python Pandas Tutorial 5: Handle Missing Data: fillna, dropna, interpolate
Machine Learning & Data Science Project - 1 : Introduction (Real Estate Price Prediction Project)
Data analysis and visualization
5.0 / 5 (0 votes)