Bar Chart, Pie Chart, Frequency Tables | Statistics Tutorial | MarinStatsLectures

MarinStatsLectures-R Programming & Statistics
13 Aug 201907:35

Summary

TLDRThis script discusses summarizing categorical variables graphically and numerically. It uses smoking status as an example, with categories like never, past, and current smokers. The script explains creating frequency tables, converting frequencies to proportions or percentages, and emphasizes the importance of distribution. It also covers visual representations like bar charts and pie charts, recommending against 3D pie charts for clarity. The key takeaway is summarizing categorical data by counting occurrences and converting them to proportions or percentages for better understanding.

Takeaways

  • 📊 To summarize a categorical variable, count the frequency of individuals in each category and convert these counts into proportions or percentages.
  • 🔱 For larger sample sizes, it's more meaningful to report proportions or percentages rather than raw frequencies.
  • 📈 A frequency table or distribution is a fundamental way to organize and display the data for categorical variables.
  • 📋 The distribution of cases among different categories is a key concept in statistics, often visualized through graphical representations.
  • 📊 Bar charts are effective for visualizing the distribution of categorical variables, with the x-axis representing categories and the y-axis representing frequencies, proportions, or percentages.
  • 🍕 Pie charts provide another visual representation where each slice of the pie corresponds to a category's proportion of the total sample.
  • ⚠ With smaller sample sizes, reporting frequencies might be more meaningful and easier to interpret than proportions or percentages, which could be misleading.
  • 🎹 When creating pie charts, ensure that the slices are proportional to the data they represent; avoid 3D pie charts as they can distort perceptions of size.
  • 📝 It's important to label charts clearly, including the percentages or proportions within pie chart slices for better understanding.
  • 💡 The choice between using a bar chart or a pie chart depends on the size of the dataset and the clarity required for the intended audience.

Q & A

  • What are the three methods discussed for summarizing a categorical variable?

    -The three methods discussed for summarizing a categorical variable are using a frequency table, converting frequencies into proportions or relative frequencies, and reporting these as percentages.

  • What is the significance of a frequency table in summarizing categorical data?

    -A frequency table is significant as it counts how many individuals fall into each category of the variable, providing a clear distribution of the data across different categories.

  • Why might proportions or percentages be more meaningful than frequencies with larger sample sizes?

    -Proportions or percentages might be more meaningful with larger sample sizes because they provide a relative measure that is independent of the sample size, making it easier to compare distributions across different samples.

  • What is the difference between a proportion and a percentage in the context of categorical data?

    -In the context of categorical data, a proportion is the ratio of the number of observations in a category to the total number of observations, while a percentage is the proportion multiplied by 100 to express the ratio as a part of a whole.

  • Why is it recommended to avoid 3D pie charts when summarizing categorical data?

    -3D pie charts are recommended to be avoided because they can distort the perception of the data's distribution, making some slices appear larger than they actually are, which can mislead the interpretation of the data.

  • What is the importance of visual representations like bar charts and pie charts in data summary?

    -Visual representations like bar charts and pie charts are important because they provide a quick and intuitive way to understand the distribution of data across different categories, making complex data easier to interpret.

  • How does the choice of visual representation (bar chart or pie chart) affect the perception of data distribution?

    -The choice of visual representation can significantly affect the perception of data distribution. Bar charts clearly separate categories and are good for comparing proportions, while pie charts show parts of a whole but can be misleading if not presented in 2D.

  • What is the recommended approach when dealing with smaller sample sizes in categorical data?

    -When dealing with smaller sample sizes, it is recommended to report frequencies instead of proportions or percentages, as they provide a more direct and less potentially misleading representation of the data.

  • Can you provide an example of how to calculate the proportion for a category from the transcript?

    -Yes, for the category 'never smokers' with 110 individuals out of a sample size of 200, the proportion is calculated as 110/200, which equals 0.55.

  • What is the main principle of producing a plot that 3D pie charts might violate?

    -3D pie charts might violate the principle of accurately representing data proportions, as the added depth can distort the visual perception of the size of the slices, leading to a misleading representation of the data.

Outlines

00:00

📊 Summarizing Categorical Data

This paragraph discusses methods for summarizing a categorical variable both graphically and numerically. The example given is the smoking status of individuals categorized as 'never,' 'past,' or 'current' smokers within a sample size of 200. The primary method of summarization is through counting and then converting these counts into frequencies, relative frequencies (proportions), or percentages. A frequency table or distribution is introduced as a way to record these counts. The narrative then shifts to visual representations, suggesting bar charts and pie charts as effective graphical tools for displaying the distribution of categorical data. The paragraph emphasizes the importance of choosing the right type of visualization based on the sample size and the nature of the data.

05:01

📈 Visual Representations of Categorical Data

The second paragraph delves into the specifics of creating bar charts and pie charts for visualizing categorical data. It explains that a bar chart should have the variable categories along the x-axis and the frequency, proportion, or percentage along the y-axis. The example provided uses proportions for the smoking status categories, with 'never smokers' at 55%, 'past smokers' at 25%, and 'current smokers' at 20%. The paragraph also addresses the creation of pie charts, where each category's slice of the pie is proportional to its representation in the sample. A cautionary note is sounded against the use of 3D pie charts, as they can mislead by making smaller portions appear larger due to the added depth. The paragraph concludes with a recommendation to avoid 3D pie charts in favor of clearer, more accurate 2D representations.

Mindmap

Keywords

💡Categorical Variable

A categorical variable is a type of data that represents a group name or a category to which an individual belongs. In the video, the smoking status of individuals is used as an example of a categorical variable, with categories like 'never smoker', 'past smoker', and 'current smoker'. This variable is essential for understanding the theme of the video, which is about summarizing categorical data.

💡Frequency

Frequency refers to the count of how many times each category occurs within a dataset. The video script uses the example of a sample size of 200 individuals, with 110 classified as 'never smokers', 50 as 'past smokers', and 40 as 'current smokers' to illustrate frequency. This concept is fundamental to the video's discussion on summarizing categorical data.

💡Relative Frequency

Relative frequency, also known as a proportion, is the ratio of the number of observations in a particular category to the total number of observations. The video explains how to calculate relative frequencies by dividing the frequency of each category by the total sample size, as shown with the smoking status example. This is a key concept in the video for numerically summarizing categorical data.

💡Percentage

Percentage is a way of expressing a proportion as a part of a whole, represented out of 100. In the video, the script converts the relative frequencies of smoking statuses into percentages (55%, 25%, and 20%) to provide a clear and easily interpretable summary of the data. Percentages are highlighted as a useful tool for summarizing categorical variables, especially with larger sample sizes.

💡Frequency Table

A frequency table is a statistical table that displays the frequency distribution of a categorical variable. The video script describes creating a frequency table for the smoking status variable, where each category's frequency is listed. This table is a central tool in the video for visually organizing and summarizing categorical data.

💡Bar Chart

A bar chart is a graphical representation where data is presented using rectangular bars, with lengths proportional to the values they represent. The video suggests using a bar chart to visualize the distribution of smoking statuses, with the x-axis representing the categories and the y-axis representing the proportions. Bar charts are emphasized in the video as an effective way to graphically summarize categorical data.

💡Pie Chart

A pie chart is a circular chart divided into sectors, each representing a proportion of the whole. The video describes how to create a pie chart for the smoking status data, where each sector's size corresponds to the percentage of individuals in each category. Pie charts are presented as a visual tool for summarizing categorical data, with a caution against using 3D pie charts due to potential misinterpretation.

💡Distribution

Distribution in statistics refers to the way data points are spread across different categories or groups. The video script frequently uses the term 'distribution' to discuss how individuals are spread among the smoking status categories. Understanding the distribution is crucial for summarizing and interpreting categorical data.

💡Proportion

Proportion is a measure of the share that each category represents in the total dataset. The video script uses the term 'proportion' interchangeably with 'relative frequency' when discussing the smoking status data. Proportions are calculated and used to summarize the data, showing the relative size of each category within the sample.

💡Sample Size

Sample size refers to the number of observations or individuals included in a study. The video script mentions a sample size of 200 individuals for the smoking status data. The concept of sample size is important because it affects the choice of summarization method, with proportions or percentages being more meaningful for larger samples, while frequencies might be more appropriate for smaller samples.

Highlights

Discussing how to summarize a categorical or qualitative variable both graphically and numerically.

Using a sample size of 200 to record smoking status as never, past, or current smoker.

Summarizing categorical variables by counting individuals in each category and calculating frequencies.

Converting frequencies into proportions or relative frequencies for better understanding.

Reporting proportions or percentages interchangeably for categorical data summary.

The importance of distribution in statistics and how it relates to categorical variables.

Suggesting that proportions or percentages are more meaningful for larger sample sizes.

Advising that frequencies might be more interpretable than proportions with smaller sample sizes.

Visualizing categorical data through bar charts or pie charts.

Creating a bar chart with the x-axis representing the variable and the y-axis showing frequency, proportion, or percentage.

Spacing bars in a bar chart to indicate separate categories without continuity.

Using pie charts to represent the entire sample with slices proportional to the sample's percentage in each category.

Writing percentages or proportions inside pie chart slices for clarity.

Recommending against 3D pie charts due to their potential to mislead by distorting the perceived size of slices.

Emphasizing the simplicity of summarizing categorical variables by counting and converting to proportions or percentages.

Encouraging viewers to stay tuned for more content on the topic.

Transcripts

play00:00

so let's talk a bit about how to

play00:02

summarize a categorical or qualitative

play00:04

variable both graphically as well as

play00:07

numerically so here for example we'll

play00:10

suppose that we've taken a sample and

play00:12

recorded the smoking status of

play00:15

individuals recorded as never passed or

play00:18

current smoker and we'll assume we've

play00:21

taken a sample size of 200 so here we

play00:23

like to use a simple example just for

play00:25

the sake of discussion so the most

play00:28

relevant way to summarize a categorical

play00:30

variable is to count how many people

play00:33

fall into each of the categories or

play00:35

levels of the variable and then

play00:37

summarize that either using a frequency

play00:39

a relative frequency which also gets

play00:42

called a proportion or a percentage so

play00:44

let's take a look at doing that the

play00:46

first thing we need to do is start by

play00:47

talking about a frequency table or what

play00:50

sometimes gets called a frequency

play00:51

distribution and so we have the smoking

play00:54

status and that again we've recorded as

play00:59

never as past or current and again here

play01:07

I'll put down the total so here we can

play01:12

think of recording the frequency or the

play01:15

number that fall into each of these

play01:17

groupings for the categorical variable

play01:19

so we've got a sample size of 200 and

play01:22

let's suppose that 110 responded as

play01:26

never smokers 50 as past and 40 as

play01:31

current then rather than recording the

play01:34

frequencies we can convert this into a

play01:36

proportion or what also gets reported as

play01:39

a relative frequency sometimes so the

play01:44

110 out of the 200 is 0.55 right the 50

play01:49

out of the 200 is 0.25 and the 40 out of

play01:54

the 200 is 0.2 0 for a total of 1.0 or

play02:00

we can also report these as percentages

play02:05

55% 25% and 20% out of the total 100% ok

play02:12

so for the most part it

play02:13

proportion or percentage while there are

play02:15

slight technical differences we'll use

play02:18

the two for the most part

play02:19

interchangeably when we talk about

play02:20

things now an important note about these

play02:23

is that this table here again shows the

play02:28

distribution and that's a keyword

play02:31

statistics you're gonna hear that word

play02:32

thrown around a lot

play02:33

how are cases or individuals distributed

play02:36

amongst the different levels or

play02:38

categories of this categorical variable

play02:40

so on a suggestion when you have larger

play02:43

sample sizes it's often a bit more

play02:46

meaningful to report the proportion or

play02:48

the percentage falling in each category

play02:50

if you had smaller sample sizes suppose

play02:53

we only have 20 individuals and we had

play02:56

11 falling as never smokers 5 is passed

play02:59

and forests current reporting those

play03:01

frequencies is going to be a bit more

play03:02

meaningful or easier to interpret rather

play03:05

than reporting the percentages or

play03:07

proportions which can be a bit

play03:08

misleading with smaller sample sizes now

play03:11

if we want to make a plot of these right

play03:13

it's nice if we can make a visual of

play03:15

this table rather than just looking at a

play03:17

table of numbers especially when we have

play03:19

lots of categories or the table gets

play03:20

bigger we can make either a bar chart or

play03:24

a pie chart so first let's start by

play03:26

talking about the bar chart a bar chart

play03:30

has along the x-axis the variable so

play03:35

here we're looking at smoking status

play03:38

again this was recorded as never past or

play03:42

current and along the y-axis we can put

play03:46

the frequency the proportion or the

play03:48

percentage right the plus gonna look the

play03:50

same I'm going to choose to put the

play03:52

proportion here since our sample size is

play03:56

not very small I think it's more

play03:58

meaningful to report proportions or

play03:59

percentages and I'll just choose the

play04:01

proportion down here zero up here 0.5

play04:07

0.25 and it's important to mention here

play04:10

you probably will never create any of

play04:12

these by hand might use a computer or a

play04:15

piece of software to do these we're

play04:16

going through and looking at doing them

play04:18

by hand for the sake of discussing the

play04:19

concepts and what they are for the never

play04:22

smokers they have a proportion of 0.55

play04:25

roughly up here for the past walkers a

play04:33

proportion of 0.25 and the current

play04:36

smokers the proportion of 0.2 zero

play04:39

in this plot these bars are separated or

play04:42

space between them again to indicate

play04:44

that these are separate categories

play04:46

there's no continuity between the two

play04:47

and as noted before this here also helps

play04:52

show the distribution for this variable

play04:54

right how are people distributed amongst

play04:56

the different categories or levels the

play04:58

one other plot that we can make for this

play05:00

table or for a categorical variable is a

play05:02

pie chart the way a pie chart works is

play05:05

they start with a pie all right all

play05:08

right circle and again this pie of the

play05:11

circle represents the entire sample then

play05:14

what we do is for each category ready to

play05:17

reach each level or category of this

play05:20

variable we draw a slice of the pie and

play05:23

the slice of the pie should be

play05:25

proportional to the percentage of the

play05:28

sample they represent so let's start

play05:30

with the past smokers they're a

play05:34

proportion of 0.25 or 25% of our sample

play05:37

so I'm starting with that because that's

play05:38

the easiest one to draw right it's 1/4

play05:40

of the pie and I'll label this here as

play05:44

being passed these are the past smokers

play05:48

and it's also nice if the percentages or

play05:52

proportions are written in there the

play05:54

next are the never smokers they

play05:56

represented 0.55 or 55% roughly here

play06:00

these are the never smokers again 55%

play06:04

and the current are 20% this here shows

play06:11

the distribution for a sample so another

play06:13

visual way of showing this so one

play06:15

personal preference I want to mention

play06:16

here while you often see these three-d

play06:20

pie charts shown because they look kind

play06:21

of cool I'm going to really recommend

play06:23

that you don't do those and it's because

play06:25

they violate one of the main principles

play06:27

of producing a plot and I'm going to

play06:29

show you that here and well my drawings

play06:31

not perfect here but the slice for the

play06:33

past smokers should be a little bit

play06:35

larger than the current smokers right 25

play06:37

percent verse 20%

play06:39

now when you draw these three-d pie

play06:41

charts they kind of look something like

play06:45

this and they end up looking a little

play06:48

bit cooler but part of the problem that

play06:50

they can cause as you can see looking at

play06:52

the slice for current smokers it

play06:54

actually looks a little bit bigger than

play06:55

the past smokers right and that's

play06:57

because your eye attaches all this extra

play06:59

area to the current smokers the

play07:01

proportion of the pie they take up

play07:03

actually looks larger than it should be

play07:04

okay so I'm going to really suggest that

play07:07

you don't do these even though they look

play07:09

kind of cool

play07:09

they tend to be a little bit misleading

play07:11

one of the key takeaways here is the

play07:13

most simple summary for a categorical

play07:15

variable is to count how many people

play07:17

fall into each of the categories and

play07:19

then convert that to a proportion or a

play07:21

percentage stick around guys because we

play07:24

darling

play07:24

lots more hope you guys like the video 6

play07:29

is hard to say

Rate This
★
★
★
★
★

5.0 / 5 (0 votes)

Étiquettes Connexes
Data AnalysisCategorical DataFrequency TablesProportionsBar ChartsPie ChartsStatistical MethodsData VisualizationSample SizeDescriptive Statistics
Besoin d'un résumé en anglais ?