ETC1000 Topic 2b

Brett Inder
17 Feb 202226:35

Summary

TLDRThis instructional video script delves into the nuances of data analysis, emphasizing the importance of understanding and visualizing data effectively. It covers the generation of descriptive statistics in Excel, the creation of frequency distributions and histograms, and the use of box and whisker plots to illustrate data spread and central tendency. The presenter also discusses the significance of probability distributions, particularly the normal distribution, and how to calculate probabilities and identify corresponding data points using Excel functions. The script underscores the necessity of clear and accurate data visualization for effective communication.

Takeaways

  • πŸ“š Watch the first video for Topic Two before this one for better understanding of the material.
  • πŸ“ˆ Excel can generate various descriptive statistics like mean, median, mode, standard deviation, and range with a single step.
  • πŸ“Š Visualizations like histograms and box plots are crucial for effectively communicating data distribution and should be clear and well-presented.
  • πŸ“‰ Numbers alone may not be enough; visualizations help provide a clearer picture of data distribution and are more memorable.
  • πŸ—‚ Creating a frequency distribution table helps in understanding the categorization and range of data, such as income levels.
  • πŸ“ Customizing histogram appearance, including labels and bin ranges, is essential for accurate and effective data representation.
  • πŸ“‰ Box and whisker plots are useful for visualizing quartiles and understanding the spread and central tendency of data.
  • πŸ“Š The choice of bin ranges in histograms can significantly affect the interpretation of data, as demonstrated with grade distributions.
  • πŸ€” Probability distributions, including the normal distribution, are foundational concepts for understanding how random data is distributed.
  • πŸ“ The normal distribution is common in various fields because it represents data that is symmetrically distributed around the mean with probabilities decreasing as values move away from the mean.
  • πŸ”’ Excel's norm.dist function can calculate probabilities for a normal distribution or find the x value for a given probability.

Q & A

  • What is the main focus of the second video in the series?

    -The main focus of the second video is to complete the discussion on the second topic, which involves finishing off the rest of the material after introducing the concept of standardizing data and summarizing its characteristics using measures with quantitative data.

  • Why is it important to watch the first video before the second one?

    -It is important to watch the first video before the second one because the second video continues from where the first left off, and concepts introduced in the first video are built upon in the second, ensuring that the content makes sense and is understood in the correct sequence.

  • What are some of the descriptive statistics measures discussed in the video?

    -The descriptive statistics measures discussed in the video include mean, median, mode, standard deviation, variance, range, minimum, and maximum.

  • How can one obtain all the mentioned descriptive statistics in Excel with a single step?

    -One can obtain all the mentioned descriptive statistics in Excel with a single step by using the 'Descriptive Statistics' tool, which provides all these measures at once, although it may not include all measures like quartiles and interquartile range.

  • What is a frequency distribution and why is it useful?

    -A frequency distribution is a table that shows the number of data points that fall within certain ranges or categories. It is useful because it provides a detailed view of the data, showing how many observations fall into each defined range, which helps in understanding the distribution of the data.

  • How does the video script emphasize the importance of visualization in data analysis?

    -The script emphasizes the importance of visualization by stating that visualizations can be more memorable and clearer than numbers alone. It also mentions that the presentation of data through graphs can be misleading if not done properly, highlighting the need for careful and accurate visualization to communicate the data effectively.

  • What is a histogram and how does it help in understanding data distribution?

    -A histogram is a graphical representation of the distribution of data. It groups data into intervals, or 'bins', and shows the frequency of observations within each bin. It helps in understanding the data distribution by visually showing the concentration of data points, the spread, and any patterns or outliers.

  • What is a box and whisker plot and how does it represent data?

    -A box and whisker plot is a standardized way of displaying the distribution of data based on a five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. It provides a clear visual representation of the spread and skewness of the data, including potential outliers.

  • Why is the normal distribution important in statistics and what does it represent?

    -The normal distribution is important in statistics because it is a symmetrical bell-shaped curve that represents the distribution of many real-world phenomena. It is widely used in statistical inference and hypothesis testing due to its properties and the central limit theorem, which states that the sum of independent and identically distributed variables will be approximately normally distributed regardless of the original distribution.

  • How can Excel be used to calculate probabilities associated with a normal distribution?

    -Excel can be used to calculate probabilities associated with a normal distribution using the NORM.DIST function, which calculates the cumulative probability for a specified value in a normal distribution with a given mean and standard deviation. The function can also be used in reverse to find the value (x) that corresponds to a given probability.

  • What is the significance of standard deviation in the context of the normal distribution?

    -In the context of the normal distribution, the standard deviation indicates the spread of the data around the mean. A smaller standard deviation means the data points are closer to the mean, while a larger standard deviation indicates a greater spread. It is also used to define the ranges within which a certain percentage of data falls, such as within one or two standard deviations from the mean.

Outlines

00:00

πŸ“ˆ Descriptive Statistics and Data Visualization

This paragraph introduces the importance of watching the first video on topic two before proceeding and promises a faster pace for the current video. It discusses the process of standardizing data and using Excel for descriptive statistics such as mean, median, mode, standard deviation, variance, range, minimum, and maximum. The speaker also touches on the limitations of Excel for certain measures like quartiles and interquartile range, suggesting that these will be covered in future lessons. The focus then shifts to the value of visualizations, such as histograms, for better communication and understanding of data distribution, emphasizing the need for clear and accurate presentation to avoid misleading interpretations.

05:02

πŸ“Š Histograms and Box and Whisker Plots for Data Distribution

The speaker explains how to create a frequency distribution table and a histogram using Excel to visualize income data. They highlight the need for adjusting the histogram's appearance for better clarity and communication, including closing gaps between bars and adding labels. The paragraph also introduces the concept of cumulative percentages and percentages to provide a clearer picture of income distribution. Additionally, the speaker mentions the box and whisker plot, which uses quartiles to summarize data distribution, and emphasizes the importance of visual presentation in data analysis.

10:03

πŸ“‰ Box and Whisker Plots and Histogram Improvements

This paragraph delves deeper into the box and whisker plot, discussing its components like the median, quartiles, and outliers. The speaker uses an example of student grades to illustrate how the plot can convey a lot of information about data distribution in a single graph. They also address the limitations of Excel's automatic histogram generation, demonstrating how to improve the histogram by specifying bin ranges and adjusting the presentation to better communicate the distribution of grades, such as aligning bins with grade boundaries.

15:04

πŸ“š Probability Distributions and the Normal Distribution

The speaker transitions to theoretical concepts, starting with the idea of random data and probability distributions. They explain how to represent data as probabilities and the characteristics of a probability distribution, such as being mutually exclusive and exhaustive. The paragraph introduces the normal distribution, a bell-shaped curve that is common in various fields due to its symmetrical nature and the likelihood of values close to the mean. The speaker also discusses the standard normal distribution and how to calculate probabilities associated with it using Excel functions.

20:04

πŸ”’ Understanding the Normal Distribution in Excel

The paragraph focuses on how to work with the normal distribution in Excel, using the NORM.DIST function to calculate probabilities for given x values and the NORM.INV function to find x values for given probabilities. The speaker illustrates this with examples, such as finding the probability of a value being less than zero in a standard normal distribution and determining the x value that corresponds to a 97.5% probability in a distribution with a mean of 0 and a standard deviation of 1. The importance of understanding these calculations for future topics is emphasized.

25:06

πŸ“˜ Conclusion and Encouragement for Tutorial Practice

In the final paragraph, the speaker thanks the audience for their patience and perseverance, expressing hope that they enjoyed the topic and will learn a lot from the tutorial and other exercises. The speaker encourages continued practice with Excel and the normal distribution, setting the stage for future topics where these concepts will be applied.

Mindmap

Keywords

πŸ’‘Standardizing

Standardizing refers to the process of transforming data into a common format to make it comparable. In the context of the video, it is the initial step taken to prepare data for analysis. The script mentions that before any analysis is done, data should be standardized to ensure it is in a comparable form.

πŸ’‘Descriptive Statistics

Descriptive statistics are numerical measures that summarize and describe the characteristics of a data set, such as the mean, median, mode, standard deviation, and range. The video script explains how Excel can be used to calculate these measures in one step, providing a quick overview of the data's central tendency and dispersion.

πŸ’‘Quantitative Data

Quantitative data represents values that can be measured numerically, such as income or age. The video script discusses measures used with quantitative data to summarize its characteristics, emphasizing the importance of understanding where the data is centered and how spread out it is.

πŸ’‘Frequency Distribution

A frequency distribution is a table that shows the number of data points that fall within specified ranges or categories. The script describes how to set up income ranges and use Excel to determine how many people fall into each range, providing a detailed breakdown of the data.

πŸ’‘Histogram

A histogram is a graphical representation of the distribution of data, with bars representing the frequency of data points within certain ranges. The video script explains how to create a histogram in Excel, emphasizing the importance of presentation to make the data clear and visually appealing.

πŸ’‘Visualization

Visualization refers to the use of graphical representations to communicate information effectively. The video script stresses the importance of visualization in data analysis, noting that a well-presented graph can be more memorable and informative than raw numbers.

πŸ’‘Box and Whisker Plot

A box and whisker plot is a standardized way of displaying the distribution of data based on the five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. The script introduces this plot as a useful visualization tool that can effectively communicate the spread and skewness of a data set.

πŸ’‘Quartiles

Quartiles divide a data set into four equal parts, with the first quartile (Q1) representing the 25th percentile, and the third quartile (Q3) representing the 75th percentile. The video script discusses how quartiles are used in box and whisker plots to summarize the distribution of data.

πŸ’‘Normal Distribution

The normal distribution, also known as the Gaussian distribution, is a probability distribution that is characterized by its bell-shaped curve. The script explains the characteristics of the normal distribution, such as its symmetry and the likelihood of values occurring close to the mean, and how it is commonly found in natural and financial phenomena.

πŸ’‘Probability Distribution

A probability distribution is a statistical description that shows the likelihood of different possible outcomes within a data set. The script describes how the table of percentages for income ranges can be considered a probability distribution, where each category represents the likelihood of a randomly chosen individual falling into that income bracket.

Highlights

The importance of watching the first video for topic two before the second for better understanding.

Excel can generate descriptive statistics like mean, median, mode, standard deviation, variance, range, minimum, and maximum in one step.

Some statistical measures like quartiles and interquartile range are missing in Excel's one-step descriptive statistics.

Visualizations are as important as numbers for summarizing and communicating data effectively.

Creating a frequency distribution table to understand income ranges and the number of people in each range.

Transforming raw numbers into a histogram for better visualization of data distribution.

The necessity of refining initial visualizations for clarity and accuracy in communication.

Using percentages to provide a clearer picture of income distribution among people.

The introduction of box and whisker plots for visual representation of quartiles and median.

Interpreting box and whisker plots to understand the distribution of student scores in a class.

The impact of choosing the right bin ranges in histograms to effectively communicate information.

Calculating probabilities using the normal distribution and Excel's NORM.DIST function.

Understanding the concept of standard normal distribution with a mean of zero and a standard deviation of one.

Using Excel to find the x value that corresponds to a given probability in a normal distribution.

The significance of probability distributions in representing the likelihood of different outcomes.

The characteristics of a probability distribution: mutually exclusive and exhaustive.

The prevalence of the bell-shaped normal distribution in various aspects of life and its importance in statistics.

Transcripts

play00:03

hello everybody

play00:05

topic two second video

play00:07

make sure you make sure you watch the

play00:09

first video for topic two

play00:11

uh before you

play00:13

watch this one otherwise things won't

play00:15

make quite so much sense to you

play00:17

we're going to finish off the rest of

play00:19

the second topic

play00:21

now uh hopefully i will go a little bit

play00:23

quicker apologies for the last video

play00:25

being a bit too long for you but

play00:26

hopefully this one will be zooming along

play00:28

a bit faster

play00:30

we spent most of the last video after we

play00:33

talked about standardizing getting the

play00:35

data in a comparable form before you do

play00:37

anything we talk most of the time about

play00:39

measures which are used with

play00:41

quantitative data to sort of summarize

play00:43

the characteristics of the data where is

play00:44

it centered how spread out it is and

play00:47

what kind of shape it takes

play00:48

as it turns out we can get uh

play00:52

almost all of the measures we're

play00:53

interested in with one step in excel and

play00:55

i'll show you the steps for how to

play00:56

produce this descriptive statistics

play00:58

thing uh later but essentially you can

play01:00

get the mean and the median and the mode

play01:02

and the standard deviation the variance

play01:03

all those things that we talked about

play01:05

the range the minimum max all those

play01:06

things we talked about

play01:08

uh in one go and that's kind of

play01:10

convenient and i'll show you in a little

play01:12

while how to do that there's a few that

play01:13

are missing like the quartiles in the

play01:15

interquartile range and there's a few

play01:17

that we haven't taught you which you can

play01:20

worry don't worry about until future

play01:22

subjects if you learn some more

play01:25

so that's just a little reminder or a

play01:27

footnote for you

play01:28

all of these things though as the others

play01:30

are that i introduced in the previous

play01:32

video are

play01:33

numbers that summarize things and

play01:35

numbers are pretty useful way of

play01:36

summarizing things if you want to say

play01:37

what's the average income of people in

play01:39

this area thirty thousand dollars per

play01:40

year is a pretty useful number to give

play01:42

people uh it's for example what are the

play01:44

range of incomes

play01:46

dollars zero dollars up to six hundred

play01:48

one thousand so numbers are valuable but

play01:49

often visualizations are also good

play01:52

so for example we might want to produce

play01:54

some kind of uh

play01:56

distribution some picture that gives us

play01:58

an idea about the distribution of the

play02:00

data so to do that

play02:02

first of all we might start with that

play02:04

with a table of numbers so instead of

play02:06

just giving us for example in this case

play02:07

here we've got uh information about

play02:10

incomes and we've got it we know where

play02:12

the data is centered we know how spread

play02:13

out it is according to the standard

play02:15

deviation of 35 000 per year etc but

play02:18

what we'd like perhaps is is sort of a a

play02:22

bit more detail about the sort of

play02:24

categories and the ranges so we set it

play02:26

up in what's called a frequency

play02:27

distribution so we define some income

play02:29

ranges

play02:30

naught to ten thousand ten to twenty

play02:32

thousand and so on and then we ask excel

play02:34

to tell us how many people earn

play02:37

an income in this particular range so

play02:40

1509 people earned less than ten

play02:43

thousand dollars 939

play02:46

people earned between 10 and 20 000 and

play02:49

so on okay

play02:51

and 162 people earned more than a

play02:53

hundred thousand dollars that's

play02:54

basically what this table gives us

play02:57

of course this is a pretty ugly

play02:59

distribution if you wanted to put this

play03:01

in a nice report you'd give it some

play03:03

better labels than i've given there and

play03:05

put dollar signs on the

play03:06

bins and so on but

play03:09

this is how it's dumped to us when excel

play03:11

gives it to us it's a starting point of

play03:12

the raw numbers

play03:14

but better still let's show those

play03:15

numbers in some kind of graph for

play03:17

histogram

play03:18

and so i'll show you in a couple of

play03:20

minutes how i produced this table in

play03:21

this graph but for now

play03:24

i'll just show you the output of it i've

play03:25

created a histogram and i've played

play03:28

around with the presentation of the

play03:30

histogram to make it look nice i've

play03:31

given it nice headings and labels and

play03:34

i've also closed up the gaps between the

play03:36

bars for reasons

play03:38

but i'll explain in a moment

play03:40

important stuff that you have to do in

play03:42

order for this final product here to

play03:44

make sense visualization

play03:47

is about communication

play03:49

you might think as a data analytics nerd

play03:51

that

play03:52

how it looks is not that important it's

play03:54

the substance well it couldn't be

play03:56

further from the truth

play03:58

how things look

play04:00

is extremely important because firstly

play04:03

people will remember it if it's

play04:05

if it's clear secondly you can make it

play04:07

look

play04:08

in such a way that it's misleading to

play04:10

people you can present things in a

play04:12

certain way that's not giving a true

play04:13

picture

play04:14

or you can do the opposite you can

play04:16

present it in a way that gives a clear

play04:18

picture about what's going on people

play04:20

will remember the message of a picture

play04:22

far better than they'll remember a whole

play04:24

bunch of numbers

play04:25

so we spend quite a lot of time getting

play04:27

this histogram right not just dumping

play04:29

the first histogram that comes out of

play04:31

exception so i really want to emphasize

play04:32

that

play04:33

that if you're ever asked to produce any

play04:35

kind of visualizations

play04:38

the first visualization you get out of

play04:40

your software is going to be pretty

play04:42

might have the right numbers in it but

play04:43

it's going to be pretty ugly and you're

play04:45

going to need to spend a good amount of

play04:46

time

play04:47

going through all the other steps of

play04:49

making it look right before you finish

play04:51

the task

play04:52

and that might be 80 of the time making

play04:54

it look right 20 of the time getting the

play04:56

first version so i can't underestimate

play04:59

that

play04:59

so you see here that we've got most of

play05:02

the people earning less than ten

play05:03

thousand dollars not actually most of

play05:05

the people but the biggest most common

play05:07

category here is that and then as you

play05:09

get further and further up in the income

play05:10

range less and less people are in each

play05:12

of those ranges and then there's a bit

play05:14

of a blip at the end of people who are

play05:16

into more than a hundred thousand

play05:17

dollars okay so now i've got a bit more

play05:19

of a picture about the sort of

play05:20

distribution of incomes it's quite a

play05:22

different picture to what i learned from

play05:23

just looking at the average the mean i

play05:25

think was thirty seven thousand dollars

play05:26

somewhere in around about here so sure

play05:28

the average income is thirty seven

play05:30

thousand but a whole bunch of people are

play05:31

earning less than that including some

play05:33

people that are earning less than ten

play05:34

thousand so not very much at all this is

play05:36

not uh australian data by the way this

play05:38

is from from the us

play05:40

unfortunately in certain parts of the us

play05:43

there's a large number of very poor

play05:44

people who are in very little

play05:47

one of the

play05:48

richest developing countries in the

play05:50

world

play05:52

strange thing to say but think of that

play05:54

now sometimes it's good for us to not

play05:57

just give the numbers but actually to

play05:59

give them percentage terms so i've asked

play06:01

excel to produce the

play06:03

cumulative percentages and then i've

play06:05

turned those into actual percentages by

play06:07

subtracting the difference you'll get a

play06:08

chance to play with this sort of thing

play06:10

in your tutorial work and so now instead

play06:13

of saying there are 1509 people who are

play06:15

in this range i can say

play06:18

nearly percent of people earn less than

play06:20

ten thousand dollars you see that's

play06:22

pretty neat or

play06:24

just over three percent of people earned

play06:25

more than a hundred thousand dollars

play06:28

or five point six six percent of people

play06:30

earned between fifty and sixty thousand

play06:32

dollars that's perhaps not as

play06:34

interesting or i can look at the

play06:35

cumulative numbers and i can

play06:38

for example say

play06:39

okay so let me just highlight what i'm

play06:41

looking at here

play06:43

uh

play06:44

so i've i've talked about this group

play06:46

here 29 and less than 10 000 3

play06:48

earning more than 100 000 or i might go

play06:51

over here and say 91 percent of people

play06:54

earn less than 70 000

play06:56

okay so all of these numbers are quite

play06:58

useful they're giving us a bit of a

play06:59

picture about how the distribution

play07:02

actually looks and then i might want to

play07:04

also perhaps

play07:05

change my histogram that i had on the

play07:07

previous thing and instead of the

play07:09

bars up here being number of people they

play07:12

might be percentages based on that

play07:14

column there okay so there's a few ways

play07:15

in which i'm buried to make it better

play07:17

and some other suggestions there making

play07:19

sure your boundaries are right and so on

play07:21

so i won't go through that in detail

play07:22

i'll let you read that and absorb it but

play07:25

i think i've got the message across it

play07:27

really has to look good if it doesn't

play07:29

look good then it's bad okay and it's

play07:31

not acceptable

play07:34

um

play07:36

one other graph of visualization i'm

play07:37

going to

play07:39

introduce you to is the box and whisker

play07:41

plot and we'll have a look at it and how

play07:42

we present that in a moment and the

play07:44

reason i like the box and whisker plot

play07:46

is because it works off those quartiles

play07:49

the median the mean the q1 and the q3

play07:52

that we talked about in the previous

play07:54

class

play07:55

remember in the previous

play07:57

video i introduced you to some results

play07:59

from a first-year stats

play08:01

unit

play08:02

one of

play08:04

your units that you're studying from a

play08:05

previous year and here we had the

play08:07

summary data for that particular unit so

play08:09

63.8 was the mean the median was 65 so

play08:13

half the students got less than 65 half

play08:15

the students got more

play08:17

the standard deviation was 15.3 so that

play08:19

was so roughly speaking the average

play08:21

variation from the mean

play08:23

the first quartile was 58 core of the

play08:26

students got less than 58.

play08:28

a quarter of the students got more than

play08:30

74

play08:31

so the middle 50 percent got somewhere

play08:33

between 58 and 74 a interquartile range

play08:36

of 16

play08:38

that's pretty neat way of describing it

play08:41

but even better if i could describe that

play08:42

with something visual so let me show you

play08:44

how i produce

play08:46

okay so there's the data that i've just

play08:47

been showing you there the ranges i'm

play08:49

going to produce something called a box

play08:50

and whisker plot which is what this

play08:52

thing here looks like excel gives you

play08:54

one of these and i and uh

play08:57

i explained briefly in the notes there

play08:59

how you

play09:00

find it under the menu um

play09:09

insert the excel

play09:11

charts and uh look under the histogram

play09:13

button you'll see box and whisker plots

play09:15

okay

play09:19

now the box and whisker basically this

play09:21

solid blue bit is the the middle 50 so

play09:24

q1 the first quartile

play09:27

is

play09:28

58

play09:29

up to 74 that's q1 to q3 and 65 is your

play09:33

median so those three numbers there are

play09:35

the bottom 25 percent cut off the 50 cut

play09:39

off in the 75 cup so that's pretty

play09:41

useful that's where

play09:43

if i look at my class half my students

play09:46

get a score in that range there okay

play09:48

i've got a few really top students who

play09:50

score a better than that and my

play09:53

top number is 91. now that's not the

play09:56

very top mark

play09:58

uh

play09:59

in every case in this case it is

play10:01

actually the very top mark

play10:02

the box and whisker plot produced by

play10:04

excel

play10:06

make some decisions using some

play10:07

algorithms which we won't go to the

play10:09

details of as to whether it's shooting

play10:12

that the the

play10:13

range from 91 to the top sort of

play10:17

line to the bottom line should be from

play10:20

their maximum to the minimum or whether

play10:22

it should treat some of the numbers as

play10:23

what's referred to as outliers now this

play10:26

is an interesting example

play10:28

it didn't find any outliers at the top

play10:30

but there's no totally weird student who

play10:32

got 98 percent here the top mark was 91.

play10:36

at the other end according to excel's

play10:38

algorithm it said look

play10:40

a reasonable range of the data is 34 but

play10:43

there's actually sort of about eight

play10:44

students who got really low marks who

play10:47

sort of don't count shouldn't be treated

play10:49

in the data they're outlines

play10:51

now that's excel's judgment and we don't

play10:52

get any control over that because this

play10:54

is a fairly simple piece of statistical

play10:56

software but if when you learn

play10:58

programming uh in r and future units

play11:00

you'll be able to produce much better

play11:02

box and whisker plots than this but

play11:04

basically this is telling us

play11:06

half the data got half the students got

play11:08

between 58 and 74. the top mark was 91.

play11:12

so

play11:13

the range of almost all the students was

play11:15

between 91 and 34

play11:17

but there was a very small number

play11:19

eight or so students who got actually

play11:22

even further below that so but out of

play11:24

the 600 or so students in this class

play11:25

that's a very small sample so it gets

play11:27

wrapped down the bottom

play11:28

and so that's how i interpret that box

play11:30

and whisper plot and i think that in one

play11:32

graph there i've got quite a lot of

play11:33

information if i had my choice i

play11:35

probably wouldn't do those little little

play11:36

dots down the bottom uh i'd make my plot

play11:39

look a bit neater but it's good enough

play11:41

why is it called a box and whisker well

play11:42

that's the box and these little lines

play11:45

the 91 and the 34 they're the whiskers

play11:47

you know if you're someone with a

play11:48

moustache and little whiskers on the end

play11:50

that's what that is referred to if you

play11:51

try to figure out the jargon

play11:55

all right so

play11:57

uh

play11:58

there's one okay i might want to produce

play12:00

a histogram of this data but actually if

play12:02

i do a histogram

play12:05

it's actually not a very good one let me

play12:06

let me do it quickly now to give you an

play12:08

idea about what i mean so i'm going to

play12:10

go to the data menu

play12:13

and i go to data analysis so in the data

play12:16

menu up here i go to something called

play12:17

data analysis and hopefully you've got

play12:19

that and i ask for it if you don't then

play12:21

you need to do some add-ins

play12:23

for analysis full pack which you can

play12:25

learn about in your tutorial

play12:27

uh then i go to histogram and i say okay

play12:30

and i'm going to send the output to some

play12:32

new worksheet just so that i get mess

play12:33

and i'm going to make sure i ask it for

play12:35

a chart

play12:36

let's see what comes about what's i

play12:37

better tell it where my data is there it

play12:39

is there so it's uh row

play12:42

eight two

play12:47

a five eight six i just happen to know

play12:50

that from memory

play12:52

okay it's calculating the histogram

play12:55

oops

play12:56

here it is okay

play13:00

what an absolutely ugly histogram

play13:03

okay

play13:04

it's terrible

play13:06

that's exactly what you'll get if you

play13:07

just ask excel to give you a histogram

play13:09

what a load of rubbish there's so many

play13:11

things wrong with that histogram

play13:15

so i really don't like it at all okay

play13:18

and you can see why it's pretty obvious

play13:20

okay the data ranges are all stupid um

play13:24

yeah we don't even need to waste our

play13:26

time talking about why that's bad we

play13:28

need to make a better job of it than

play13:29

that and so read the notes there and

play13:31

figure out how to do it and in

play13:32

particular in this particular case i

play13:35

actually

play13:37

decided that well first of all the first

play13:39

thing i need to do is actually

play13:41

i need to specify the ranges of the bins

play13:43

so when i did my histogram i was

play13:45

actually i had a choice then i chose not

play13:47

to take it if i go to data analysis

play13:50

histogram again instead of just

play13:52

leaving that bin range if i leave the

play13:54

bin range blank

play13:55

it will just choose ranges for me and it

play13:58

chose stupid ones so instead of that

play14:00

maybe a simple improvement i can make is

play14:01

let's let's use these numbers here

play14:04

and use those ranges and see what

play14:06

it looks like well now it's looking a

play14:08

lot better isn't it that's much nicer

play14:10

looking histogram i've sort of got and

play14:12

in fact uh

play14:14

for reasons that i'll get to in a moment

play14:15

i've chosen these ranges very carefully

play14:17

okay so that's an immediate improvement

play14:19

i can make and then there's still more

play14:20

that i need to be able to do to that

play14:22

but actually the choice of this range

play14:24

i'll go back to the other

play14:26

sheet here this is

play14:28

the final histogram i produced after

play14:30

mucking around a bit further

play14:32

and i've done a few things there

play14:34

first of all i chose this particular

play14:36

range of bins for a very good reason

play14:38

namely this is the grades that you

play14:40

achieve in this unit if you get 49 or

play14:42

less you get an n if you get 59 or less

play14:45

you get a p

play14:46

and 69 is a credit 79 or less is a

play14:49

distinction and above 79 is a higher

play14:51

stitch okay

play14:52

so i chose to do my histogram not with

play14:56

equal space bars or anything but

play14:58

actually to communicate what is most

play15:00

interesting and relevant to students

play15:02

namely what grade are you going to get

play15:03

here or what's the distribution of the

play15:05

grades

play15:07

okay so that's what i ended up doing

play15:09

there

play15:11

in order to produce histogram this

play15:12

histogram i changed the the default is

play15:15

the bin ranges would have come out as

play15:17

49.59 but if i just change the type

play15:20

over those with np etc then i can end up

play15:23

with much nicer labels in my histogram

play15:25

likewise the frequencies that were here

play15:28

were

play15:28

the original data

play15:31

the account of how many people are in

play15:32

each case each category like over here

play15:35

you'll see here here we are you see the

play15:37

bin the values are 89

play15:39

but i don't want that i want the

play15:41

percentage of people who got or the

play15:42

proportion of people who got each

play15:44

value well that's fine i just took the

play15:47

89 and i divided by however many data

play15:49

points i had 583 and the next one 72 and

play15:52

i divided by 583 so i can change these

play15:54

numbers once excel's produced them for

play15:56

me and get a nicer looking histogram so

play15:58

on this axis i've now got the proportion

play16:00

of students who got that score so you

play16:02

see here 34 35 34 of students got a

play16:06

credit

play16:06

etc uh 13 because 12 of students got a

play16:10

high distinction 15 of students got to

play16:12

fail etc okay so i worked quite hard 80

play16:16

of my time was spent making

play16:19

that ugly looking histogram there into

play16:22

that nice looking histogram there okay

play16:25

and importantly i chose to do it in a

play16:28

for the purpose okay

play16:30

what's the question i'm trying to answer

play16:32

and how does my

play16:33

visualization best answer that question

play16:36

and what communicate that information

play16:38

and in this example that specifically

play16:41

had to do with choosing the categories

play16:42

so that they corresponded to the grades

play16:44

that students achieve in the unit but

play16:46

obviously in other contexts it's going

play16:47

to be a different set of issues you have

play16:49

to address to get the best possible

play16:51

visualization

play16:53

okay

play16:54

now

play16:55

switch gears that's all been practical

play16:57

how to show stuff in data and

play16:59

particularly today how to visualize i'm

play17:02

going to take us now to a little bit of

play17:03

what you might describe as theory and

play17:05

the reason i'm going to do that is

play17:06

because we want to have a deeper

play17:08

understanding of the underlying

play17:10

principles behind all of these methods

play17:11

that we use and the deeper understanding

play17:14

is with the idea that we have random

play17:16

data and the random data is distributed

play17:18

according to probability distributions

play17:20

so you did a little taste of probability

play17:22

in topic one we're going to do a little

play17:24

bit more on probability here in topic

play17:25

two and more in later weeks just so that

play17:28

you get more comfortable with thinking

play17:29

about random data and about probability

play17:32

associated with outcomes

play17:35

we can think about the the table of

play17:37

numbers that we've produced the

play17:39

percentages of people in each

play17:41

income range for example as a

play17:42

probability the probability if i chose

play17:45

someone randomly this is the same table

play17:47

that i had earlier in the present in the

play17:49

in the topic but now i've just turned

play17:51

these numbers here instead of how many

play17:53

people earn between north and ten

play17:54

thousand dollars i've turned it into a

play17:56

probability a proportion

play17:58

now i can say if i chose someone

play18:00

randomly

play18:01

from this suburb what's the chances that

play18:04

they'll earn between north and ten

play18:05

thousand dollars answer 0.29 600. that's

play18:08

a probability that's what we mean by

play18:09

probability the probability of something

play18:11

happening

play18:12

and you'll see the probability that

play18:13

they'll learn more than a hundred

play18:14

thousand is far less okay so think about

play18:16

these numbers not just as however many

play18:19

people are in that code agreement think

play18:20

about them as probabilities

play18:21

notice there's a couple of

play18:22

characteristics of this thing here that

play18:25

make it a probability

play18:26

distribution they're mutually exclusive

play18:29

you can't earn between 30 and 40 and

play18:32

between 40 and 50. everybody is there

play18:35

only once

play18:37

so you can't be in more than one cabin

play18:39

and secondly

play18:41

it's exhaustive everybody's there if i

play18:43

add these probabilities up i get one

play18:46

somebody has to earn

play18:48

something between naught and a million

play18:50

dollars or an infinite bolts okay

play18:53

so it's mutually exclusive and it's

play18:55

exhaustive and that's essential for this

play18:57

to be classified as a probability

play18:58

distribution

play19:01

with some data

play19:03

you can present the probability

play19:04

distribution in a table because there's

play19:06

only a limited number of values it can

play19:08

take in this case we've categorized it

play19:10

so there's only like 10 or 12 categories

play19:11

here

play19:12

10 or 11 categories other data you might

play19:15

have it in the form of that of a

play19:16

probability density function which is

play19:18

like a smooth curve and we're going to

play19:19

see an example of that now to finish off

play19:21

this topic and that's called the normal

play19:23

distribution

play19:24

what's the normal distribution about

play19:26

well you've probably come across this

play19:27

before in high school

play19:28

and it's a sort of a bell-shaped

play19:30

distribution that looks like this and

play19:32

the reason it's so common and so popular

play19:34

is because think about this as a range

play19:36

of possible values of some variable x

play19:39

could be how much income people earn or

play19:40

price of houses or

play19:42

you know how many

play19:43

children a person has

play19:45

that's probably not a good example

play19:46

because the range of that's pretty small

play19:49

but you know

play19:50

things that can vary a lot it's a whole

play19:52

possible set of values that that

play19:54

variable can take

play19:55

it's very common for

play19:58

the probabilities associated with

play19:59

different possible values of that

play20:01

variable to follow this bell shape the

play20:04

reason it's common is that the mean is

play20:06

in the middle

play20:07

and if

play20:09

values that are close to the mean

play20:12

are much more likely to occur you know

play20:14

the probability is given by the height

play20:15

of the bar just like a histogram

play20:18

the high histogram points are the ones

play20:20

that are more common so these are the

play20:22

range of values of x which are much more

play20:24

likely to occur and they're close to the

play20:25

mean

play20:26

the further you get from the mean the

play20:28

muscle much less likely it is that

play20:30

you'll get those kinds of values

play20:34

in either the negative direction or the

play20:36

positive direction and that's

play20:38

essentially what describes a bell-shaped

play20:40

curve

play20:42

in addition to that it's symmetric

play20:44

whether you're going out in the positive

play20:46

direction or the negative direction you

play20:48

get about the same sort of decrease in

play20:51

probabilities of it occurring

play20:53

and that's not true of every set of data

play20:55

in the world i've already given examples

play20:57

of skewed data like house prices and so

play20:59

on but actually a lot of data is pretty

play21:01

close to normally distributed

play21:02

it's got those characteristics have been

play21:04

symmetric most values close to the

play21:07

values close to the mean are much more

play21:08

likely to occur and as you get further

play21:09

and further from the mean

play21:12

so that's why we study it because in

play21:14

natural world and in financial markets

play21:17

and in

play21:18

business

play21:19

any economic development and economies

play21:23

data often ends up looking quite

play21:25

normally distributed

play21:28

okay uh this is a particular type of

play21:30

normal distribution it's it's it's

play21:32

what's called the standard normal

play21:33

because it has a mean of zero

play21:36

so the average is zero and it has a

play21:37

standard deviation of one

play21:39

which means that if as i go out a couple

play21:42

of standard deviations it becomes very

play21:43

unlikely you'll get values

play21:45

out here and when i get to three

play21:47

standard deviations above the mean i've

play21:49

got virtually no chance of that

play21:51

occurring and by four standard

play21:52

deviations it's infinitesimally small

play21:55

and likewise so these can be sort of

play21:57

thought about as like numbers of

play21:58

standard deviations above or below the

play22:00

mean

play22:02

let's take the heights of adults okay

play22:04

you know in a country okay there'll be

play22:06

an average height here for males of 1.8

play22:10

meters or something like that and then

play22:12

as you go

play22:13

one standard deviation the standard

play22:15

deviation of heights might be about five

play22:16

centimeters so as you go up one one

play22:19

standard deviation to 1.85 meters

play22:22

it's less likely you'll meet someone

play22:24

that tall it's even less likely you meet

play22:25

someone 1.9 meters tall 1.95 meters 2

play22:29

meters etc

play22:30

and likewise 1.75 meters is less likely

play22:34

than 1.8 1.7 meters is even less likely

play22:37

1.65 is even less likely 1.6 metres even

play22:40

less since we're talking particularly

play22:42

about male adults here okay so

play22:45

that's exactly the sort of a very

play22:47

commonplace distribution that's occurred

play22:50

now

play22:51

i can work out probabilities associated

play22:53

with normal distributions using excel

play22:56

and

play22:57

in normal distribution it's it's applied

play23:00

in cases where data can take all sorts

play23:01

of values so we don't normally talk

play23:03

about what's the chances of someone

play23:04

being exactly 1.83 centimeters tall we

play23:07

actually prefer to talk in ranges what's

play23:09

the chances of somewhere between being

play23:11

between 1.8 and 1.85

play23:13

or above 1.83 or something so those are

play23:16

the kind of probabilities we can work

play23:18

out and we do that by calculating the

play23:20

area under the curve of this nice

play23:22

bell-shaped thing so the probability of

play23:25

of someone uh

play23:27

of a height more than two standard

play23:29

deviations above the mean would be this

play23:31

shaded area here

play23:33

and so we can calculate that probability

play23:35

or the probability of for a distribution

play23:38

which has got a mean of 10 and a

play23:40

standard deviation of 3 the probability

play23:42

of someone

play23:43

having a value of less than 5 is this

play23:45

shaded area here ok

play23:49

how do i calculate those probabilities

play23:52

well if i want

play23:53

to work out the probabilities i just

play23:55

have to use a function in excel called

play23:56

norm.dist so if i've got a normal and

play23:59

the idea is this is let me just show you

play24:01

what these different components of it is

play24:03

that you'll get to do this in your

play24:05

tutorial work so i won't go into the

play24:06

detail here that's the mean so i've got

play24:09

a normal distribution with a mean of

play24:10

zero

play24:12

that's the standard deviation it's got

play24:14

us

play24:16

ah sorry no i'm wrong that's the normal

play24:18

that's the probability for which i want

play24:21

to calculate the x value i want to

play24:23

calculate for so i want to calculate the

play24:25

chances of x being less than zero

play24:27

okay

play24:28

and i've got

play24:29

a normal distribution with a mean of

play24:31

zero so that's the mean

play24:33

and that's the standard deviation

play24:35

and true just means i want the

play24:37

probability of being less than it

play24:39

so that's actually precisely

play24:42

excuse my terrible normal curve that's

play24:45

the standard normal and i want to work

play24:46

out the probability being less than zero

play24:48

so i'm actually working out that

play24:49

probably there

play24:51

surprise surprise it's a half

play24:53

remember it's symmetric the probability

play24:56

of being less than naught

play24:58

is exactly equal to the probability of

play25:00

being greater than not 50 or 0.5

play25:03

or i could take my normal distribution

play25:05

with a mean of 0 and a standard

play25:07

deviation of 1 and i could look at say

play25:10

what's the chances of it being less than

play25:11

some number up here like 1.96

play25:14

excuse my i can't write very well with

play25:16

my opinion but you know that's 1.96 then

play25:19

that's why you're going to see

play25:21

well that's a

play25:23

quite a large probability because it's

play25:25

all of this area here from way so where

play25:27

there by now it works out to be about

play25:29

97.5

play25:31

chance

play25:33

okay

play25:34

that's what we can do in excel or we can

play25:36

do the reverse we can say if i give you

play25:38

the probability can you tell me what the

play25:40

x value is

play25:42

so i say

play25:43

i've got a standard normal a normal with

play25:45

a mean of 0 and a standard deviation of

play25:47

1. and i want to know what's the x value

play25:48

that gives me a probability of 0.5 of

play25:51

being below it answer well that's the

play25:54

reverse of the question we just asked

play25:55

before zero half the values

play25:58

are less than zero or if i've got a

play26:01

normal distribution

play26:02

uh with a mean of zero and a standard

play26:04

deviation of one and i want the

play26:05

probability to be 0.975 what's the x

play26:09

value answer 1.96 okay so i'm just doing

play26:12

the same thing as before but in reverse

play26:14

okay so that's how we do those things in

play26:16

excel for now park that practice it a

play26:18

little bit and we'll get to make use of

play26:20

it in future

play26:22

topics when we actually start applying

play26:23

the non-distribution

play26:25

okay

play26:26

thank you for your patience and

play26:27

persevering with that topic i hope you

play26:28

enjoyed it and i hope you can learn lots

play26:30

as you work through the tutorial and

play26:32

other questions

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Data AnalysisExcel TutorialHistogramNormal DistributionDescriptive StatisticsVisualizationProbability TheoryEducational ContentIncome DistributionStatistical MeasuresBox Plot