Boxplots in Statistics | Statistics Tutorial | MarinStatsLectures

MarinStatsLectures-R Programming & Statistics
27 Aug 201908:05

Summary

TLDRThis video explains the concept of a boxplot, a statistical tool used to visualize data distribution through Tukey's five-number summary: minimum, first quartile, median, third quartile, and maximum. The speaker uses an example of 50 individuals' heights, discussing how the boxplot highlights the data's median, interquartile range (IQR), and outliers. Additionally, the video covers the calculation of 'fences' to define outliers and mentions related visualization tools like variable-width box plots and violin plots. The video emphasizes the importance of understanding these elements rather than manual calculation.

Takeaways

  • 📊 A boxplot visually displays the distribution of a dataset and is useful for summarizing the data's spread.
  • 🧼 The boxplot shows Tukey’s five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum.
  • 📏 The median, marked inside the box, represents the middle value, splitting the dataset into two equal halves.
  • 📉 Q1 represents the 25th percentile, meaning 25% of the data falls below this value.
  • 📈 Q3 represents the 75th percentile, meaning 75% of the data falls below this value.
  • 📐 The interquartile range (IQR) is the range between Q3 and Q1, showing the spread of the middle 50% of the data.
  • đŸš« The whiskers extend to the minimum and maximum values, excluding outliers, which are represented as individual points.
  • 🔍 Outliers are defined as values outside the upper and lower 'fences,' calculated as 1.5 times the IQR beyond Q3 and Q1.
  • đŸŽ» Violin plots and notched boxplots are alternative visualizations, combining density estimates or adding notches around the median.
  • 📉 Boxplots help identify the shape of the distribution, whether it is symmetric or skewed, based on the data layout.

Q & A

  • What does a boxplot show?

    -A boxplot shows the distribution of a dataset, visually displaying key summary statistics like the minimum, first quartile, median, third quartile, and maximum values. It helps in understanding the spread and skewness of the data.

  • What is Tukey’s five-number summary?

    -Tukey’s five-number summary consists of the minimum, first quartile (Q1), median, third quartile (Q3), and maximum. These are used to describe the spread and center of the dataset.

  • What does the median represent in a boxplot?

    -The median represents the middle value of the dataset, where 50% of the data is below it and 50% is above. In the given example, the median height is approximately 66 inches.

  • What does the first quartile (Q1) represent?

    -The first quartile (Q1) represents the value below which 25% of the data lies. In the example, Q1 is around 63 inches, meaning 25% of the individuals have a height of 63 inches or less.

  • What is the third quartile (Q3) and what does it indicate?

    -The third quartile (Q3) is the value below which 75% of the data lies. In the example, Q3 is approximately 70 inches, indicating that 75% of individuals are 70 inches or shorter.

  • What is the interquartile range (IQR) in a boxplot?

    -The interquartile range (IQR) is the difference between the third quartile (Q3) and the first quartile (Q1), representing the range of the middle 50% of the data. In this example, the IQR is 7 inches (70 - 63).

  • How are outliers represented in a boxplot?

    -Outliers are represented as individual points outside the 'fences,' which are calculated using 1.5 times the IQR added to Q3 for the upper fence and subtracted from Q1 for the lower fence.

  • How is the upper fence calculated in a boxplot?

    -The upper fence is calculated by adding 1.5 times the IQR to the third quartile (Q3). For example, with Q3 at 70 inches and the IQR at 7 inches, the upper fence is at 80.5 inches (70 + 1.5 * 7).

  • What is the lower fence and how is it calculated?

    -The lower fence is calculated by subtracting 1.5 times the IQR from the first quartile (Q1). In the example, with Q1 at 63 inches and the IQR at 7 inches, the lower fence is at 52.5 inches (63 - 1.5 * 7).

  • What are variable width boxplots and how are they used?

    -Variable width boxplots are used to compare multiple distributions, like the heights of males versus females. The width of each boxplot is proportional to the sample size, offering a comparison not only of the distribution but also of the sample size.

Outlines

00:00

📊 Understanding Boxplots and Their Components

In this paragraph, the speaker introduces the concept of a boxplot, explaining its usefulness in visualizing the distribution of a variable, in this case, height. The boxplot represents what is known as Tukey's five-number summary: the minimum, first quartile (Q1), median, third quartile (Q3), and maximum. These values help to show the central tendency and spread of data. The paragraph details how the median divides the data in half and explains the meaning of Q1 (25th percentile) and Q3 (75th percentile), emphasizing the significance of these points in understanding the distribution of the sample.

05:03

📏 Calculating Outliers Using Fences

This section focuses on calculating outliers using the concept of 'fences' in a boxplot. The upper fence is determined by adding 1.5 times the interquartile range (IQR) to Q3, while the lower fence is calculated by subtracting 1.5 times the IQR from Q1. Any value beyond these fences is considered an outlier, and the boxplot visually separates outliers from the rest of the data. The paragraph walks through an example of calculating the fences based on the IQR and points out that outliers are marked individually on the plot. It also notes that software typically performs these calculations.

Mindmap

Keywords

💡Boxplot

A boxplot is a visual representation of data distribution, specifically used to display the spread and central tendency of a dataset. It shows the minimum, first quartile (Q1), median, third quartile (Q3), and maximum values, as mentioned in the script. In this video, the boxplot illustrates the distribution of heights for 50 individuals.

💡Median

The median is the middle value in a dataset, with 50% of the data below it and 50% above it. In the context of the video, the median height is shown at 66 inches, splitting the sample of individuals into two equal halves based on their height.

💡First Quartile (Q1)

The first quartile (Q1) is the value below which 25% of the data falls. In the boxplot shown in the video, Q1 represents the height at which a quarter of the sample is shorter, around 63 inches in this case.

💡Third Quartile (Q3)

The third quartile (Q3) marks the value below which 75% of the data lies. In the video example, Q3 corresponds to 70 inches, meaning three-quarters of the individuals are shorter than this height.

💡Interquartile Range (IQR)

The interquartile range (IQR) is the range between the first and third quartiles (Q3 - Q1), which shows the spread of the middle 50% of the data. In the video, the IQR is calculated as 7 inches, representing the variation in heights between Q1 (63 inches) and Q3 (70 inches).

💡Outlier

Outliers are data points that fall outside the typical range, defined by specific thresholds known as 'fences' in the boxplot. In the video, the presenter explains how outliers are calculated by extending beyond 1.5 times the IQR from the quartiles, showing how these points are represented as individual ticks in the plot.

💡Upper and Lower Fences

The upper and lower fences define the boundaries for identifying outliers. The upper fence is calculated as Q3 + 1.5 * IQR, while the lower fence is Q1 - 1.5 * IQR. Values beyond these fences are considered outliers, as illustrated in the boxplot example from the video.

💡Minimum and Maximum (Excluding Outliers)

The minimum and maximum values in a boxplot represent the lowest and highest data points that are not classified as outliers. In the video, the presenter demonstrates how these values are marked by the end of the 'whiskers' of the boxplot, excluding outliers beyond the fences.

💡Skewness

Skewness describes the asymmetry in the distribution of data. In the video, the presenter draws a skewed boxplot to show how heights can drag out to the right, indicating a right-skewed (positively skewed) distribution where the tail extends toward higher values.

💡Violin Plot

A violin plot is a variation of the boxplot that combines a boxplot with a density plot, creating a symmetrical shape resembling a violin. It shows the distribution of the data across different values, with a smoother visual display. The presenter briefly mentions this plot as a related concept for visualizing distributions.

Highlights

Introduction to box plots and their usefulness in visualizing data distributions.

The box plot represents a sample of 50 individuals and records their heights.

The box plot shows the distribution for the variable height, which is visually represented.

Box plots are based on Tukey's five-number summary: minimum, first quartile, median, third quartile, and maximum.

The median, shown by the tick inside the box, represents the middle value of the dataset.

The first quartile (Q1) represents the height below which 25% of the sample falls.

The third quartile (Q3) represents the height below which 75% of the sample falls.

The interquartile range (IQR) is the range of the middle 50% of the data, calculated as Q3 minus Q1.

The 'whiskers' in the box plot extend to the minimum and maximum values, excluding outliers.

Outliers are calculated based on the upper and lower fences using 1.5 times the IQR.

Software typically handles the calculation of fences and outliers automatically.

Upper fence = Q3 + 1.5 * IQR; lower fence = Q1 - 1.5 * IQR.

Example: For Q3=70 and Q1=63, the IQR is 7, making the upper fence 80.5 and lower fence 52.5.

Box plots also help in visualizing data symmetry and skewness, showing whether the data is balanced or skewed.

Related visualizations include variable-width box plots, notched box plots, and violin plots, which add more details.

Transcripts

play00:00

so let's talk a little bit about what a

play00:01

boxplot is exactly what it's showing and

play00:04

what it's useful for we're gonna use

play00:07

this example here of having collected a

play00:09

sample of 50 individuals and recording

play00:11

their heights and over here I've already

play00:13

drawn in a box plot so let's discuss

play00:16

exactly what info is being shown in this

play00:18

plot or this picture here as noted in a

play00:21

number of these videos this here shows

play00:22

the distribution for the variable height

play00:25

and again we'll talk about exactly what

play00:27

we mean by that as we go through and

play00:29

discuss this plot it's also worth

play00:31

mentioning that this plot is a visual

play00:33

display of what gets called two keys

play00:35

five number summary the minimum the

play00:38

first quartile median third quartile and

play00:40

maximum so we'll show those where they

play00:43

are on the plot as well as in separate

play00:45

videos talk about what is the first

play00:47

quartile and so on so the first thing to

play00:49

note is that the tick in the box here is

play00:52

showing us the median and in this

play00:56

example the median looks like it's about

play01:00

66 inches so the median which we

play01:03

formally defined in a separate video is

play01:05

the point that cuts the data set in half

play01:09

50% below 50% above so 50% of the

play01:13

individuals in our sample have a height

play01:15

of 66 inches or less 50% 66 inches or

play01:20

more the bottom of the box here shows

play01:23

what gets called the first quartile and

play01:25

abbreviated q1 q1 or the first quartile

play01:30

and again this is 25% or 1/4 below it

play01:37

and in our example here it looks like

play01:41

it's at about 63 inches so again 1/4 or

play01:46

25% of individuals in our sample have a

play01:50

height of 63 inches or less right and

play01:52

3/4 are greater than that the top of the

play01:55

box here shows what gets called the

play01:57

third quartile or abbreviated q3 so

play02:01

let's write that here this is the third

play02:04

quartile and it looks like it's about 70

play02:09

inches right so again in our sample

play02:12

75% or three-quarters have a height of

play02:16

70 inches or less let's write that here

play02:20

for completeness sake 75% or

play02:24

three-quarters are below this value and

play02:27

so the size of the box here gets

play02:31

abbreviated the IQR or what's called the

play02:35

interquartile range can we talk a bit

play02:37

more about this in more detail in a

play02:39

separate video but it's the third

play02:42

quartile minus the first quartile or the

play02:46

range of the middle 50% of data right so

play02:51

what's the range of the 50% sitting in

play02:54

the middle cutting off the bottom

play02:55

quarter and the top quarter now these

play02:57

lines here extend right in the line and

play03:01

where the tick is here is what's known

play03:03

as the minimum value excluding outliers

play03:08

and the top up here that is the maximum

play03:12

value excluding outliers and any

play03:17

outliers get drawn in individually and

play03:21

in a moment we'll get to talking about

play03:22

how do we decide what's defined as an

play03:24

outlier versus what's the maximum value

play03:27

that's not an outlier so this plot here

play03:30

helps us visualize the distribution for

play03:33

the variable height so I'm going to draw

play03:35

another version of it over here slightly

play03:37

different exaggerated version and I'm

play03:39

going to tip it on its side just so we

play03:40

can see exactly what I'm trying to say

play03:43

so suppose we had the box plot looking

play03:46

like this again here's the height and

play03:49

the lower end about 50 and the upper end

play03:53

say 100 so again I've taken this box

play03:56

plot and just tipped it on its side here

play03:58

and we're looking at this we can see

play04:01

that it drags out to the right side or

play04:03

the positive side okay so the

play04:05

distribution is a little bit skewed out

play04:08

to the right here okay so again it helps

play04:10

us visualize is the distribution of

play04:12

heights roughly symmetric as shown here

play04:15

or is it a little bit more skewed as

play04:16

I've shown in this exaggerated example

play04:18

here so that's what I mean by helping us

play04:20

visualize the shape of the distribution

play04:22

so in order to decide what's an outlier

play04:25

verse

play04:26

the maximum or minimum values that's not

play04:28

an outlier what we can do is calculate

play04:31

what we call an upper fence as well as a

play04:33

lower fence anything within the fence is

play04:36

not considered an outlier anything

play04:38

beyond the fence is an outlier now all

play04:40

this stuff will be done using software

play04:42

we won't generally do it by hand but

play04:44

we're gonna do the calculations here

play04:45

once by hand so we can get a better

play04:46

understanding of how is the plot

play04:48

produced to define the upper fence we'll

play04:50

look at that first the way that's

play04:52

calculated as we go from q3 or the third

play04:56

quartile and we add on one and a half

play04:59

times the interquartile range so in our

play05:02

example we saw the third quartile at 70

play05:05

and if we add one and a half iqrs if the

play05:09

interquartile range is 70 minus the 63

play05:15

or seven right the range between q3 and

play05:19

q1 or these quartiles is seven so we

play05:22

take 70 Plus one-and-a-half times seven

play05:25

that gives us 80 point five okay so

play05:29

going here from 80 point five this is

play05:32

where the upper fence gets drawn in it's

play05:36

supposed to add value sitting here here

play05:38

and here this is the largest value

play05:40

that's within the fence

play05:42

okay now note liar so the tick gets

play05:44

drawn in there that's the maximum value

play05:46

excluding outliers and then any outliers

play05:48

get drawn in this individual points

play05:50

creating the lower fence again is pretty

play05:53

similar in order to do that we go from

play05:56

q1 minus 1.5 interquartile ranges so

play06:01

again we saw q1 the first quartile is 63

play06:05

and minus one point five times the

play06:08

interquartile range which was seven

play06:10

gives us a value of 52 point five so

play06:14

roughly right around here again this is

play06:17

the lower fence so the tick okay at the

play06:20

bottom of the whisker here goes to what

play06:23

was the smallest value that is not

play06:26

considered an outlier not outside the

play06:28

fence so this would be the minimum value

play06:29

within the fence and then

play06:32

and if tick mark gets drawn there here

play06:34

we can see at the bottom there are no

play06:36

values sitting beyond the fence or as

play06:38

outliers one final thing to mention

play06:40

before we wrap this up is that other

play06:42

definitions or other ways to define

play06:44

outliers exist this is probably the most

play06:46

commonly used one and this is what's

play06:48

used to produce the box plot so we're

play06:50

mentioning that here again you shouldn't

play06:52

focus on doing this calculations you

play06:53

probably should never be asked to do

play06:55

these by hand but it helps us understand

play06:57

how do we decide what's an outlier as

play06:59

well as what's the maximum value that's

play07:01

not considered an outlier when producing

play07:03

a box plot it's also worth mentioning

play07:05

some closely related plots to the box

play07:08

plot our variable width box plots so

play07:11

what those are is when we have multiple

play07:13

box plots say you know heights of males

play07:15

versus heights of females we may have

play07:17

side by side box plots and the width of

play07:19

the box plot is proportional to the

play07:21

sample size there's also not box plots

play07:24

that often drawn a little notch at where

play07:27

the median is as well as violin plots

play07:30

those look sort of like this here we

play07:33

drew this nice smoothing it does it on

play07:35

each side and tilt it so rather than

play07:38

being the block's has sort of a violin

play07:40

shape that a combination of a box plot

play07:43

as well as a density plot in there the

play07:45

box plots most commonly viewed ones but

play07:47

you can explore those other ones on your

play07:48

own if you want to learn a little bit

play07:50

more stick around guys because we

play07:54

darling

play07:54

lots more hope you guys like the video 6

play07:59

is high today

play08:00

[Music]

play08:01

you

Rate This
★
★
★
★
★

5.0 / 5 (0 votes)

Étiquettes Connexes
Box PlotData VisualizationOutliersFive-Number SummaryStatistics TutorialIQRMedianQuartilesSkewnessStatistical Analysis
Besoin d'un résumé en anglais ?