Histograms and Density Plots for Numeric Variables | Statistics Tutorial | MarinStatsLectures
Summary
TLDRThe video explains the importance of histograms and kernel density plots in visualizing the distribution of numerical data. Using the example of age distribution, it walks through creating bins, calculating frequencies, and representing them visually through histograms. The speaker highlights how changing bin sizes affects the distribution and introduces kernel density plots as a smoother alternative. It also touches on interpreting key distribution features like center, spread, and symmetry, setting the stage for deeper discussions on probability distributions in future videos.
Takeaways
- 📊 Histograms and kernel density plots are used to visualize the distribution of a sample, especially for continuous or numerical variables.
- 🧮 The first step in creating a histogram is to make a frequency table by dividing the data into bins (e.g., 0-10, 10-20, etc.) and counting how many observations fall into each bin.
- 🔢 Bins can vary in size, and different software might choose different bin ranges (e.g., 0-10 vs. 0-15). It’s important to stay consistent in bin selection.
- 📉 Histograms visually represent how data is distributed across bins, with the x-axis showing the variable and the y-axis showing frequency, percentage, or proportion.
- 📏 Bars in a histogram should be of equal width, with no gaps between them when displaying continuous data like age.
- 🔄 Changing bin sizes can alter the shape of the histogram, so the visual appearance of the distribution may change slightly based on how the bins are set.
- 📐 Histograms help identify the center, spread, and shape of a distribution, showing whether it's symmetric or skewed.
- 📦 Another way to summarize the distribution of a numeric variable is through a box plot, which offers a different visual approach.
- 🔀 Kernel density plots smooth out histograms by reducing the impact of bin selection, creating a more continuous estimate of the probability distribution.
- 🎯 Both histograms and kernel density plots are essential for summarizing sample data distributions, and they provide insights into the underlying probability distribution.
Q & A
What are histograms and why are they useful?
-Histograms are graphical representations used to visualize the distribution of numerical, quantitative, or continuous variables by dividing the data into bins. They help us understand the frequency of values within each bin and provide insight into the shape, center, and spread of the data distribution.
How do you create a frequency table for a dataset?
-To create a frequency table, you divide the range of the dataset into bins or categories, such as 0-10, 10-20, etc., and then count how many data points fall into each bin. You can also convert these frequencies into proportions or percentages for a clearer understanding of the distribution.
What is the significance of choosing different bin sizes in histograms?
-Choosing different bin sizes can change the shape of the histogram and affect how the distribution appears. For example, using larger bins can smooth out fluctuations, while smaller bins might show more detail but may also appear more erratic.
What are the key differences between a histogram and a bar chart?
-A histogram is used for continuous data and the bars touch each other, indicating that the data points are related and fall on a continuous scale. In contrast, a bar chart is used for categorical data, and the bars do not touch, indicating that each category is distinct and separate.
What do histograms tell us about the center and spread of a distribution?
-Histograms provide a visual representation of the center of the distribution (which can later be summarized numerically using measures like the mean or median) and show how spread out or clustered the data is. They also indicate the variability and symmetry or skewness of the data.
How does changing bin sizes impact the shape of a histogram?
-Changing the bin sizes can alter the appearance of the histogram. Larger bins may result in a smoother, more generalized shape, while smaller bins can reveal more detail but may introduce more variability and noise in the plot.
What is the kernel density plot and how does it differ from a histogram?
-A kernel density plot is a smooth version of a histogram that estimates the probability distribution of a continuous variable. Instead of relying on fixed bins, it smooths the data to reduce the impact of arbitrary bin selection and provides a continuous estimate of the distribution.
Why is it important to be consistent with handling border values in histograms?
-Consistency with border values (e.g., deciding whether an age of 20 belongs in the 10-20 or 20-30 bin) is important to ensure accurate and reliable representation of the data. A computer software typically has a default setting for handling border values, but you can adjust it for your specific needs.
What insights can we gain from a histogram's shape?
-The shape of a histogram can reveal whether the data is symmetric, skewed (to the left or right), or has multiple peaks. This helps in identifying patterns such as normal distribution or the presence of outliers and anomalies in the data.
What is a box plot and how is it related to histograms?
-A box plot is another way to visualize the distribution of a numeric variable. It summarizes data using the median, quartiles, and potential outliers, and it gives a more compact representation compared to histograms, which show the full distribution with bars. Both histograms and box plots are useful for summarizing numerical data.
Outlines
📊 Understanding Histograms and Frequency Tables
The first paragraph introduces histograms and kernel density plots as tools for visualizing the distribution of numerical variables like age. The speaker walks through an example of collecting a sample of 50 individuals and categorizing their ages into bins (e.g., 0–10, 10–20). It explains how to create a frequency table manually, calculate the frequencies for each bin, and convert them into percentages or proportions. The importance of bin choices and handling border cases (like an age of 20) is discussed. Finally, it emphasizes that a histogram visually represents the distribution of the variable, highlighting the equal width of bars and the continuity of age.
📉 The Role of Bin Sizes and Kernel Density Plots
The second paragraph explains how changing bin sizes in a histogram can alter the shape of the distribution and what this means visually. It covers key ideas like the center, spread, and shape of distributions, mentioning how histograms reveal skewness and symmetry. The concept of box plots as a different visualization method is introduced, followed by an explanation of kernel density plots, which smooth the bumps of a histogram to create a more fluid representation of the distribution. The paragraph closes by linking histograms and kernel density plots to probability distributions, which will be expanded in future videos.
Mindmap
Keywords
💡Histogram
💡Kernel Density Plot
💡Frequency Table
💡Bins
💡Distribution
💡Proportion
💡Continuous Variable
💡Center of Distribution
💡Skewness
💡Box Plot
Highlights
Introduction to histograms and kernel density plots for visualizing data distributions.
Example of recording ages for a sample size of 50 and the importance of visualizing the distribution for continuous variables.
Discussion on manually calculating frequency tables and converting frequencies into proportions or percentages.
The choice of bins (or categories) is flexible, such as 0-10, 10-20 or 0-15, 15-30, and how that affects the resulting distribution.
The importance of consistency in assigning values on bin borders, e.g., should an age of 20 go in the 10-20 or 20-30 bin?
Histograms allow for visual representation of data distributions, with continuous variables shown as bars that touch.
The number of bins in a histogram is typically between 5 to 10 for an effective visualization.
Changing the bin sizes in a histogram can alter the shape of the distribution, making it important to choose bin sizes carefully.
Histograms reveal the center and spread of a data set and help identify if the distribution is symmetric or skewed.
Box plots offer another way to summarize a numeric variable, providing a different visualization compared to histograms.
Kernel density plots provide a smoothed-out version of histograms to avoid the fluctuations caused by different bin choices.
Kernel density plots are particularly useful for approximating the probability distribution of a numeric variable.
Histograms and kernel density plots are essential for summarizing the distribution of numeric or continuous variables in data.
These visual tools help estimate the probability distribution of a sample, laying the foundation for understanding probability.
The video sets the stage for further exploration of probability distributions and statistical concepts in upcoming lessons.
Transcripts
so let's talk a little bit about
histograms as well as kernel density
plots so what are they why are they
useful we'll start with a simple example
supposing we've collected a sample of
size 50 and recorded the ages for a
bunch of individuals so histograms as
well as density plots can be used to
help visualize the distribution of our
sample and for a numerical or
quantitative or continuous variable so
we're going to go through and do some of
this stuff by hand
in reality we won't ever do them by hand
we'll have a piece of software or a
computer do them for us but we're going
to go through and do it by hand for the
sake of discussion and and exploring
what exactly these plots are the first
thing we need to do is start by talking
about a frequency table what we can do
is we can when we're going to want to
look at the distribution for our
variable age and we can go through and
create a bunch of bins or buckets or
whatever we want to call them so I'll
look at 0 to 10 10 to 20 20 to 30 30 to
40 and so on then what we're doing next
is count how many people fall into each
of these so I guess one thing worth
mentioning before we get into that I've
chosen these bins or categories to be 0
to 10 10 to 20 and so on you or a
computer might choose slightly different
bins so they might choose 0 to 15 15 to
30 30 to 45 and so on okay so this is
just a choice I've made here for the
sake of discussion at the moment so the
next thing we can do is calculate the
frequency or how many people fall into
each of these different bins so let's
just suppose that we had 5 people
falling in the 0 to 10 for falling into
10 to 20 and I'll just fill in the rest
we can fill this to be the total all
right and in total we've got 50
individuals so the next thing we can do
is convert these frequencies into either
proportions or percentages and so I'm
going to go through and write the
percentage but it doesn't really matter
so this frequency of 5 and
being 10% and if you want to record as a
proportion be 0.10 this for is 8% you
get our proportion of 0.08 16% for a
total of 100% as we said this keyword
distribution this shows the distribution
for the variable age right so again
how are people distributed amongst the
different bins or categories that we've
created now a few things to mention what
do we do with observations that fall on
the border and what I mean by that is an
age of 20 does it go in this bin here or
this bin here doesn't really matter as
long as you're consistent so if you
decide that an age of 20 is going to go
to the 20 to 30 bin okay or the one
above then any ones that fall on the
border should always go to the category
above again a software when it does this
it will have a default value you can
change that if you want when doing with
a piece of software it will choose the
bins for you as well as the number you
can have it change and do more or less
bins you can modify those if you want
usually it's good to create somewhere
between five to ten bins
it's worth noting that if we change the
bins this frequency distribution is
going to change slightly so let's talk a
bit about how we can make a picture or a
plot of this right again
it's a bit well not complicated but it's
a bit messy to look at this a visual is
going to help us out a lot the visual we
can create is called the histogram
essentially what we do is create a plot
of this table so along the x-axis goes
the variable H 0 to 10 20 and then on
the y-axis we can either put the
frequency or we can put the proportion
or the percentage here I'm going to
choose to put the frequency but again it
doesn't matter the plot will look the
same it's really about which one you
prefer to put down here is 0 up to 10 1
2 3 4 6 7 8 9 now we can see in the 0 to
10 bin we have 5 people
so I'll shave that in here in the 10 to
20 we have for 20 to 30 we've got 8 30
to 40 we've got 10 now again another
keyword this here is a nice visual right
again it shows us the distribution
how are people distributed for the
variable age now when doing this plot
them each of the bar should be equal
width okay they are for the most part I
studied statistics not art so maybe it's
a little bit off here you will notice
that each of the bars are touching right
and again that's because age is this
continuous measurement there's no space
between the categories or groupings
where when making a bar chart we left
space between to indicate they're
separate distinct categories as
mentioned before if we were to change
the bins right for you to make them
bigger say 0 to 20 20 to 40 and so on
the shape of this might change slightly
so it's important to note that and it's
important to note that this plot here
helps us visualize a lot about the
variable age right we can kind of see
what's the center so later we'll
summarize those using things like means
or medians but we can see the center of
the distribution looks somewhere around
here how spread out are things right our
age highly variable or is it pretty
narrow again we'll find ways to
numerically summarize Center and spread
very soon it also tells to the shape of
the distribution does it look fairly
symmetric or does it look a little bit
skewed in one direction these are all
things will tighten up the wording and
definitions on soon and it's also worth
mentioning another plot for summarizing
a numeric variables the distribution of
a numeric variable is a box plot which
is similar to a histogram was similar
and what it tries to show but a
different way of trying to summarize a
numeric variable later we're going to
kind of build on all these concepts just
mentioned here another important related
concept is the idea of a kernel density
often the word just density gets thrown
around in the kernel gets left out of it
what a kernel density without getting
into the kind of mathematical and
technical details of it what it is is
this sort of a smoothing out of this
histogram here and again it's trying to
get
around the idea that I've mentioned if
we change the bin slightly of the shape
of this is going to change a little bit
okay so rather than bumping around so
much it tries to smooth this out a way
to think about it without getting into
the technical details of it is it sort
of a smoothes out version of the
histogram
okay so again the histogram or the
kernel density plot they help us summer
well they're useful for helping us
summarise the distribution of a sample
for a numeric or continuous variable and
they give us a sort of estimate of what
the probability distribution will look
like and again probability distribution
is another concept we're going to build
on and expand on in following videos
guys like the video subscribe to our
channel cuz we got lots more almost as
beautiful as a unicorn
関連動画をさらに表示
Must know Visualization in Statistics | Descriptive Statistics | Ultimate Guide !! | Part 10
Describing Distributions: Center, Spread & Shape | Statistics Tutorial | MarinStatsLectures
ETC1000 Topic 2b
Symmetry and Skewness (1.8)
Types Of Plot By Purpose - Introduction
Sample and Population in Statistics | Statistics Tutorial | MarinStatsLectures
5.0 / 5 (0 votes)