Histograms and Density Plots for Numeric Variables | Statistics Tutorial | MarinStatsLectures

MarinStatsLectures-R Programming & Statistics
20 Aug 201907:34

Summary

TLDRThe video explains the importance of histograms and kernel density plots in visualizing the distribution of numerical data. Using the example of age distribution, it walks through creating bins, calculating frequencies, and representing them visually through histograms. The speaker highlights how changing bin sizes affects the distribution and introduces kernel density plots as a smoother alternative. It also touches on interpreting key distribution features like center, spread, and symmetry, setting the stage for deeper discussions on probability distributions in future videos.

Takeaways

  • 📊 Histograms and kernel density plots are used to visualize the distribution of a sample, especially for continuous or numerical variables.
  • 🧮 The first step in creating a histogram is to make a frequency table by dividing the data into bins (e.g., 0-10, 10-20, etc.) and counting how many observations fall into each bin.
  • 🔢 Bins can vary in size, and different software might choose different bin ranges (e.g., 0-10 vs. 0-15). It’s important to stay consistent in bin selection.
  • 📉 Histograms visually represent how data is distributed across bins, with the x-axis showing the variable and the y-axis showing frequency, percentage, or proportion.
  • 📏 Bars in a histogram should be of equal width, with no gaps between them when displaying continuous data like age.
  • 🔄 Changing bin sizes can alter the shape of the histogram, so the visual appearance of the distribution may change slightly based on how the bins are set.
  • 📐 Histograms help identify the center, spread, and shape of a distribution, showing whether it's symmetric or skewed.
  • 📦 Another way to summarize the distribution of a numeric variable is through a box plot, which offers a different visual approach.
  • 🔀 Kernel density plots smooth out histograms by reducing the impact of bin selection, creating a more continuous estimate of the probability distribution.
  • 🎯 Both histograms and kernel density plots are essential for summarizing sample data distributions, and they provide insights into the underlying probability distribution.

Q & A

  • What are histograms and why are they useful?

    -Histograms are graphical representations used to visualize the distribution of numerical, quantitative, or continuous variables by dividing the data into bins. They help us understand the frequency of values within each bin and provide insight into the shape, center, and spread of the data distribution.

  • How do you create a frequency table for a dataset?

    -To create a frequency table, you divide the range of the dataset into bins or categories, such as 0-10, 10-20, etc., and then count how many data points fall into each bin. You can also convert these frequencies into proportions or percentages for a clearer understanding of the distribution.

  • What is the significance of choosing different bin sizes in histograms?

    -Choosing different bin sizes can change the shape of the histogram and affect how the distribution appears. For example, using larger bins can smooth out fluctuations, while smaller bins might show more detail but may also appear more erratic.

  • What are the key differences between a histogram and a bar chart?

    -A histogram is used for continuous data and the bars touch each other, indicating that the data points are related and fall on a continuous scale. In contrast, a bar chart is used for categorical data, and the bars do not touch, indicating that each category is distinct and separate.

  • What do histograms tell us about the center and spread of a distribution?

    -Histograms provide a visual representation of the center of the distribution (which can later be summarized numerically using measures like the mean or median) and show how spread out or clustered the data is. They also indicate the variability and symmetry or skewness of the data.

  • How does changing bin sizes impact the shape of a histogram?

    -Changing the bin sizes can alter the appearance of the histogram. Larger bins may result in a smoother, more generalized shape, while smaller bins can reveal more detail but may introduce more variability and noise in the plot.

  • What is the kernel density plot and how does it differ from a histogram?

    -A kernel density plot is a smooth version of a histogram that estimates the probability distribution of a continuous variable. Instead of relying on fixed bins, it smooths the data to reduce the impact of arbitrary bin selection and provides a continuous estimate of the distribution.

  • Why is it important to be consistent with handling border values in histograms?

    -Consistency with border values (e.g., deciding whether an age of 20 belongs in the 10-20 or 20-30 bin) is important to ensure accurate and reliable representation of the data. A computer software typically has a default setting for handling border values, but you can adjust it for your specific needs.

  • What insights can we gain from a histogram's shape?

    -The shape of a histogram can reveal whether the data is symmetric, skewed (to the left or right), or has multiple peaks. This helps in identifying patterns such as normal distribution or the presence of outliers and anomalies in the data.

  • What is a box plot and how is it related to histograms?

    -A box plot is another way to visualize the distribution of a numeric variable. It summarizes data using the median, quartiles, and potential outliers, and it gives a more compact representation compared to histograms, which show the full distribution with bars. Both histograms and box plots are useful for summarizing numerical data.

Outlines

00:00

📊 Understanding Histograms and Frequency Tables

The first paragraph introduces histograms and kernel density plots as tools for visualizing the distribution of numerical variables like age. The speaker walks through an example of collecting a sample of 50 individuals and categorizing their ages into bins (e.g., 0–10, 10–20). It explains how to create a frequency table manually, calculate the frequencies for each bin, and convert them into percentages or proportions. The importance of bin choices and handling border cases (like an age of 20) is discussed. Finally, it emphasizes that a histogram visually represents the distribution of the variable, highlighting the equal width of bars and the continuity of age.

05:01

📉 The Role of Bin Sizes and Kernel Density Plots

The second paragraph explains how changing bin sizes in a histogram can alter the shape of the distribution and what this means visually. It covers key ideas like the center, spread, and shape of distributions, mentioning how histograms reveal skewness and symmetry. The concept of box plots as a different visualization method is introduced, followed by an explanation of kernel density plots, which smooth the bumps of a histogram to create a more fluid representation of the distribution. The paragraph closes by linking histograms and kernel density plots to probability distributions, which will be expanded in future videos.

Mindmap

Keywords

💡Histogram

A histogram is a type of bar chart used to represent the distribution of numerical data. In the video, it is used to visually display the frequency distribution of ages in different intervals, or 'bins'. The bars in a histogram touch each other, which signifies that the data is continuous, as opposed to a bar chart where the categories are distinct.

💡Kernel Density Plot

A kernel density plot is a way to estimate the probability density function of a continuous variable, smoothing out the bumps in a histogram. In the video, it's described as a 'smoothed version' of a histogram that can help us see a more general pattern in the data without the jaggedness caused by the choice of bins.

💡Frequency Table

A frequency table lists data values or ranges (bins) and the count of how many times each value or range appears in the dataset. In the video, the instructor uses this to organize the age data into groups (e.g., 0-10, 10-20) and counts how many individuals fall into each bin, which forms the basis of the histogram.

💡Bins

Bins, also called 'buckets' or 'categories,' are intervals used in histograms to group continuous data. In the video, the ages of individuals are grouped into bins like 0-10, 10-20, etc. The number and size of bins can be adjusted, and this will affect the appearance of the histogram.

💡Distribution

Distribution refers to how the values of a variable are spread out or arranged. In the context of the video, the distribution of the ages is illustrated through the histogram and kernel density plot, showing where the majority of values (ages) are concentrated and how they are spread across different bins.

💡Proportion

Proportion refers to the fraction or percentage of the total number of data points that fall into a certain category. In the video, the instructor converts the frequency of individuals in each age bin into proportions (e.g., 10% of individuals are in the 0-10 age group), which helps in comparing the relative size of different groups.

💡Continuous Variable

A continuous variable is a variable that can take any numerical value within a range. In the video, 'age' is an example of a continuous variable because it can take on any value (e.g., 20.5 years). This is why a histogram is suitable for visualizing its distribution, as it shows how the ages are spread across the bins.

💡Center of Distribution

The center of distribution refers to the point around which the data is concentrated. In the video, the instructor notes that the histogram and kernel density plot can help us see where the 'center' of the age data lies, which could be summarized later using the mean or median.

💡Skewness

Skewness describes the asymmetry of a distribution. In the video, the instructor mentions that we can observe whether the distribution of ages is symmetric or skewed (i.e., more data points are concentrated on one side of the center). This concept will later help in understanding the shape of the distribution.

💡Box Plot

A box plot is another way to visually represent the distribution of a numerical variable. It shows the median, quartiles, and potential outliers in the data. In the video, it is briefly mentioned as a tool similar to a histogram, providing a summary of the distribution but in a different visual format.

Highlights

Introduction to histograms and kernel density plots for visualizing data distributions.

Example of recording ages for a sample size of 50 and the importance of visualizing the distribution for continuous variables.

Discussion on manually calculating frequency tables and converting frequencies into proportions or percentages.

The choice of bins (or categories) is flexible, such as 0-10, 10-20 or 0-15, 15-30, and how that affects the resulting distribution.

The importance of consistency in assigning values on bin borders, e.g., should an age of 20 go in the 10-20 or 20-30 bin?

Histograms allow for visual representation of data distributions, with continuous variables shown as bars that touch.

The number of bins in a histogram is typically between 5 to 10 for an effective visualization.

Changing the bin sizes in a histogram can alter the shape of the distribution, making it important to choose bin sizes carefully.

Histograms reveal the center and spread of a data set and help identify if the distribution is symmetric or skewed.

Box plots offer another way to summarize a numeric variable, providing a different visualization compared to histograms.

Kernel density plots provide a smoothed-out version of histograms to avoid the fluctuations caused by different bin choices.

Kernel density plots are particularly useful for approximating the probability distribution of a numeric variable.

Histograms and kernel density plots are essential for summarizing the distribution of numeric or continuous variables in data.

These visual tools help estimate the probability distribution of a sample, laying the foundation for understanding probability.

The video sets the stage for further exploration of probability distributions and statistical concepts in upcoming lessons.

Transcripts

play00:00

so let's talk a little bit about

play00:02

histograms as well as kernel density

play00:05

plots so what are they why are they

play00:08

useful we'll start with a simple example

play00:10

supposing we've collected a sample of

play00:13

size 50 and recorded the ages for a

play00:15

bunch of individuals so histograms as

play00:18

well as density plots can be used to

play00:21

help visualize the distribution of our

play00:23

sample and for a numerical or

play00:25

quantitative or continuous variable so

play00:28

we're going to go through and do some of

play00:30

this stuff by hand

play00:31

in reality we won't ever do them by hand

play00:33

we'll have a piece of software or a

play00:35

computer do them for us but we're going

play00:37

to go through and do it by hand for the

play00:38

sake of discussion and and exploring

play00:40

what exactly these plots are the first

play00:43

thing we need to do is start by talking

play00:44

about a frequency table what we can do

play00:47

is we can when we're going to want to

play00:50

look at the distribution for our

play00:51

variable age and we can go through and

play00:54

create a bunch of bins or buckets or

play00:57

whatever we want to call them so I'll

play00:59

look at 0 to 10 10 to 20 20 to 30 30 to

play01:06

40 and so on then what we're doing next

play01:10

is count how many people fall into each

play01:12

of these so I guess one thing worth

play01:13

mentioning before we get into that I've

play01:15

chosen these bins or categories to be 0

play01:18

to 10 10 to 20 and so on you or a

play01:21

computer might choose slightly different

play01:23

bins so they might choose 0 to 15 15 to

play01:27

30 30 to 45 and so on okay so this is

play01:31

just a choice I've made here for the

play01:33

sake of discussion at the moment so the

play01:36

next thing we can do is calculate the

play01:38

frequency or how many people fall into

play01:41

each of these different bins so let's

play01:44

just suppose that we had 5 people

play01:46

falling in the 0 to 10 for falling into

play01:50

10 to 20 and I'll just fill in the rest

play01:52

we can fill this to be the total all

play01:56

right and in total we've got 50

play01:59

individuals so the next thing we can do

play02:01

is convert these frequencies into either

play02:04

proportions or percentages and so I'm

play02:06

going to go through and write the

play02:07

percentage but it doesn't really matter

play02:09

so this frequency of 5 and

play02:13

being 10% and if you want to record as a

play02:15

proportion be 0.10 this for is 8% you

play02:21

get our proportion of 0.08 16% for a

play02:26

total of 100% as we said this keyword

play02:29

distribution this shows the distribution

play02:34

for the variable age right so again

play02:38

how are people distributed amongst the

play02:40

different bins or categories that we've

play02:42

created now a few things to mention what

play02:44

do we do with observations that fall on

play02:45

the border and what I mean by that is an

play02:47

age of 20 does it go in this bin here or

play02:50

this bin here doesn't really matter as

play02:52

long as you're consistent so if you

play02:54

decide that an age of 20 is going to go

play02:56

to the 20 to 30 bin okay or the one

play02:58

above then any ones that fall on the

play03:00

border should always go to the category

play03:02

above again a software when it does this

play03:05

it will have a default value you can

play03:07

change that if you want when doing with

play03:09

a piece of software it will choose the

play03:10

bins for you as well as the number you

play03:12

can have it change and do more or less

play03:15

bins you can modify those if you want

play03:16

usually it's good to create somewhere

play03:19

between five to ten bins

play03:20

it's worth noting that if we change the

play03:23

bins this frequency distribution is

play03:25

going to change slightly so let's talk a

play03:27

bit about how we can make a picture or a

play03:29

plot of this right again

play03:30

it's a bit well not complicated but it's

play03:33

a bit messy to look at this a visual is

play03:35

going to help us out a lot the visual we

play03:37

can create is called the histogram

play03:39

essentially what we do is create a plot

play03:42

of this table so along the x-axis goes

play03:47

the variable H 0 to 10 20 and then on

play03:53

the y-axis we can either put the

play03:55

frequency or we can put the proportion

play03:58

or the percentage here I'm going to

play04:00

choose to put the frequency but again it

play04:02

doesn't matter the plot will look the

play04:03

same it's really about which one you

play04:05

prefer to put down here is 0 up to 10 1

play04:10

2 3 4 6 7 8 9 now we can see in the 0 to

play04:17

10 bin we have 5 people

play04:22

so I'll shave that in here in the 10 to

play04:26

20 we have for 20 to 30 we've got 8 30

play04:31

to 40 we've got 10 now again another

play04:34

keyword this here is a nice visual right

play04:37

again it shows us the distribution

play04:39

how are people distributed for the

play04:41

variable age now when doing this plot

play04:44

them each of the bar should be equal

play04:46

width okay they are for the most part I

play04:49

studied statistics not art so maybe it's

play04:53

a little bit off here you will notice

play04:55

that each of the bars are touching right

play04:57

and again that's because age is this

play04:58

continuous measurement there's no space

play05:01

between the categories or groupings

play05:03

where when making a bar chart we left

play05:06

space between to indicate they're

play05:07

separate distinct categories as

play05:09

mentioned before if we were to change

play05:11

the bins right for you to make them

play05:13

bigger say 0 to 20 20 to 40 and so on

play05:16

the shape of this might change slightly

play05:18

so it's important to note that and it's

play05:22

important to note that this plot here

play05:23

helps us visualize a lot about the

play05:26

variable age right we can kind of see

play05:28

what's the center so later we'll

play05:31

summarize those using things like means

play05:33

or medians but we can see the center of

play05:35

the distribution looks somewhere around

play05:36

here how spread out are things right our

play05:39

age highly variable or is it pretty

play05:41

narrow again we'll find ways to

play05:43

numerically summarize Center and spread

play05:46

very soon it also tells to the shape of

play05:49

the distribution does it look fairly

play05:50

symmetric or does it look a little bit

play05:51

skewed in one direction these are all

play05:54

things will tighten up the wording and

play05:56

definitions on soon and it's also worth

play05:58

mentioning another plot for summarizing

play06:02

a numeric variables the distribution of

play06:04

a numeric variable is a box plot which

play06:06

is similar to a histogram was similar

play06:08

and what it tries to show but a

play06:09

different way of trying to summarize a

play06:11

numeric variable later we're going to

play06:13

kind of build on all these concepts just

play06:15

mentioned here another important related

play06:18

concept is the idea of a kernel density

play06:21

often the word just density gets thrown

play06:23

around in the kernel gets left out of it

play06:26

what a kernel density without getting

play06:27

into the kind of mathematical and

play06:29

technical details of it what it is is

play06:31

this sort of a smoothing out of this

play06:33

histogram here and again it's trying to

play06:35

get

play06:36

around the idea that I've mentioned if

play06:37

we change the bin slightly of the shape

play06:39

of this is going to change a little bit

play06:41

okay so rather than bumping around so

play06:43

much it tries to smooth this out a way

play06:49

to think about it without getting into

play06:51

the technical details of it is it sort

play06:54

of a smoothes out version of the

play06:55

histogram

play06:56

okay so again the histogram or the

play06:59

kernel density plot they help us summer

play07:01

well they're useful for helping us

play07:03

summarise the distribution of a sample

play07:06

for a numeric or continuous variable and

play07:08

they give us a sort of estimate of what

play07:10

the probability distribution will look

play07:13

like and again probability distribution

play07:15

is another concept we're going to build

play07:16

on and expand on in following videos

play07:19

guys like the video subscribe to our

play07:23

channel cuz we got lots more almost as

play07:28

beautiful as a unicorn

Rate This

5.0 / 5 (0 votes)

Etiquetas Relacionadas
HistogramsKernel DensityData VisualizationStatisticsNumeric DataFrequency TablesBox PlotsContinuous VariablesDistribution AnalysisData Science
¿Necesitas un resumen en inglés?