ETC1000 Topic 2b
Summary
TLDRThis instructional video script delves into the nuances of data analysis, emphasizing the importance of understanding and visualizing data effectively. It covers the generation of descriptive statistics in Excel, the creation of frequency distributions and histograms, and the use of box and whisker plots to illustrate data spread and central tendency. The presenter also discusses the significance of probability distributions, particularly the normal distribution, and how to calculate probabilities and identify corresponding data points using Excel functions. The script underscores the necessity of clear and accurate data visualization for effective communication.
Takeaways
- π Watch the first video for Topic Two before this one for better understanding of the material.
- π Excel can generate various descriptive statistics like mean, median, mode, standard deviation, and range with a single step.
- π Visualizations like histograms and box plots are crucial for effectively communicating data distribution and should be clear and well-presented.
- π Numbers alone may not be enough; visualizations help provide a clearer picture of data distribution and are more memorable.
- π Creating a frequency distribution table helps in understanding the categorization and range of data, such as income levels.
- π Customizing histogram appearance, including labels and bin ranges, is essential for accurate and effective data representation.
- π Box and whisker plots are useful for visualizing quartiles and understanding the spread and central tendency of data.
- π The choice of bin ranges in histograms can significantly affect the interpretation of data, as demonstrated with grade distributions.
- π€ Probability distributions, including the normal distribution, are foundational concepts for understanding how random data is distributed.
- π The normal distribution is common in various fields because it represents data that is symmetrically distributed around the mean with probabilities decreasing as values move away from the mean.
- π’ Excel's norm.dist function can calculate probabilities for a normal distribution or find the x value for a given probability.
Q & A
What is the main focus of the second video in the series?
-The main focus of the second video is to complete the discussion on the second topic, which involves finishing off the rest of the material after introducing the concept of standardizing data and summarizing its characteristics using measures with quantitative data.
Why is it important to watch the first video before the second one?
-It is important to watch the first video before the second one because the second video continues from where the first left off, and concepts introduced in the first video are built upon in the second, ensuring that the content makes sense and is understood in the correct sequence.
What are some of the descriptive statistics measures discussed in the video?
-The descriptive statistics measures discussed in the video include mean, median, mode, standard deviation, variance, range, minimum, and maximum.
How can one obtain all the mentioned descriptive statistics in Excel with a single step?
-One can obtain all the mentioned descriptive statistics in Excel with a single step by using the 'Descriptive Statistics' tool, which provides all these measures at once, although it may not include all measures like quartiles and interquartile range.
What is a frequency distribution and why is it useful?
-A frequency distribution is a table that shows the number of data points that fall within certain ranges or categories. It is useful because it provides a detailed view of the data, showing how many observations fall into each defined range, which helps in understanding the distribution of the data.
How does the video script emphasize the importance of visualization in data analysis?
-The script emphasizes the importance of visualization by stating that visualizations can be more memorable and clearer than numbers alone. It also mentions that the presentation of data through graphs can be misleading if not done properly, highlighting the need for careful and accurate visualization to communicate the data effectively.
What is a histogram and how does it help in understanding data distribution?
-A histogram is a graphical representation of the distribution of data. It groups data into intervals, or 'bins', and shows the frequency of observations within each bin. It helps in understanding the data distribution by visually showing the concentration of data points, the spread, and any patterns or outliers.
What is a box and whisker plot and how does it represent data?
-A box and whisker plot is a standardized way of displaying the distribution of data based on a five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. It provides a clear visual representation of the spread and skewness of the data, including potential outliers.
Why is the normal distribution important in statistics and what does it represent?
-The normal distribution is important in statistics because it is a symmetrical bell-shaped curve that represents the distribution of many real-world phenomena. It is widely used in statistical inference and hypothesis testing due to its properties and the central limit theorem, which states that the sum of independent and identically distributed variables will be approximately normally distributed regardless of the original distribution.
How can Excel be used to calculate probabilities associated with a normal distribution?
-Excel can be used to calculate probabilities associated with a normal distribution using the NORM.DIST function, which calculates the cumulative probability for a specified value in a normal distribution with a given mean and standard deviation. The function can also be used in reverse to find the value (x) that corresponds to a given probability.
What is the significance of standard deviation in the context of the normal distribution?
-In the context of the normal distribution, the standard deviation indicates the spread of the data around the mean. A smaller standard deviation means the data points are closer to the mean, while a larger standard deviation indicates a greater spread. It is also used to define the ranges within which a certain percentage of data falls, such as within one or two standard deviations from the mean.
Outlines
π Descriptive Statistics and Data Visualization
This paragraph introduces the importance of watching the first video on topic two before proceeding and promises a faster pace for the current video. It discusses the process of standardizing data and using Excel for descriptive statistics such as mean, median, mode, standard deviation, variance, range, minimum, and maximum. The speaker also touches on the limitations of Excel for certain measures like quartiles and interquartile range, suggesting that these will be covered in future lessons. The focus then shifts to the value of visualizations, such as histograms, for better communication and understanding of data distribution, emphasizing the need for clear and accurate presentation to avoid misleading interpretations.
π Histograms and Box and Whisker Plots for Data Distribution
The speaker explains how to create a frequency distribution table and a histogram using Excel to visualize income data. They highlight the need for adjusting the histogram's appearance for better clarity and communication, including closing gaps between bars and adding labels. The paragraph also introduces the concept of cumulative percentages and percentages to provide a clearer picture of income distribution. Additionally, the speaker mentions the box and whisker plot, which uses quartiles to summarize data distribution, and emphasizes the importance of visual presentation in data analysis.
π Box and Whisker Plots and Histogram Improvements
This paragraph delves deeper into the box and whisker plot, discussing its components like the median, quartiles, and outliers. The speaker uses an example of student grades to illustrate how the plot can convey a lot of information about data distribution in a single graph. They also address the limitations of Excel's automatic histogram generation, demonstrating how to improve the histogram by specifying bin ranges and adjusting the presentation to better communicate the distribution of grades, such as aligning bins with grade boundaries.
π Probability Distributions and the Normal Distribution
The speaker transitions to theoretical concepts, starting with the idea of random data and probability distributions. They explain how to represent data as probabilities and the characteristics of a probability distribution, such as being mutually exclusive and exhaustive. The paragraph introduces the normal distribution, a bell-shaped curve that is common in various fields due to its symmetrical nature and the likelihood of values close to the mean. The speaker also discusses the standard normal distribution and how to calculate probabilities associated with it using Excel functions.
π’ Understanding the Normal Distribution in Excel
The paragraph focuses on how to work with the normal distribution in Excel, using the NORM.DIST function to calculate probabilities for given x values and the NORM.INV function to find x values for given probabilities. The speaker illustrates this with examples, such as finding the probability of a value being less than zero in a standard normal distribution and determining the x value that corresponds to a 97.5% probability in a distribution with a mean of 0 and a standard deviation of 1. The importance of understanding these calculations for future topics is emphasized.
π Conclusion and Encouragement for Tutorial Practice
In the final paragraph, the speaker thanks the audience for their patience and perseverance, expressing hope that they enjoyed the topic and will learn a lot from the tutorial and other exercises. The speaker encourages continued practice with Excel and the normal distribution, setting the stage for future topics where these concepts will be applied.
Mindmap
Keywords
π‘Standardizing
π‘Descriptive Statistics
π‘Quantitative Data
π‘Frequency Distribution
π‘Histogram
π‘Visualization
π‘Box and Whisker Plot
π‘Quartiles
π‘Normal Distribution
π‘Probability Distribution
Highlights
The importance of watching the first video for topic two before the second for better understanding.
Excel can generate descriptive statistics like mean, median, mode, standard deviation, variance, range, minimum, and maximum in one step.
Some statistical measures like quartiles and interquartile range are missing in Excel's one-step descriptive statistics.
Visualizations are as important as numbers for summarizing and communicating data effectively.
Creating a frequency distribution table to understand income ranges and the number of people in each range.
Transforming raw numbers into a histogram for better visualization of data distribution.
The necessity of refining initial visualizations for clarity and accuracy in communication.
Using percentages to provide a clearer picture of income distribution among people.
The introduction of box and whisker plots for visual representation of quartiles and median.
Interpreting box and whisker plots to understand the distribution of student scores in a class.
The impact of choosing the right bin ranges in histograms to effectively communicate information.
Calculating probabilities using the normal distribution and Excel's NORM.DIST function.
Understanding the concept of standard normal distribution with a mean of zero and a standard deviation of one.
Using Excel to find the x value that corresponds to a given probability in a normal distribution.
The significance of probability distributions in representing the likelihood of different outcomes.
The characteristics of a probability distribution: mutually exclusive and exhaustive.
The prevalence of the bell-shaped normal distribution in various aspects of life and its importance in statistics.
Transcripts
hello everybody
topic two second video
make sure you make sure you watch the
first video for topic two
uh before you
watch this one otherwise things won't
make quite so much sense to you
we're going to finish off the rest of
the second topic
now uh hopefully i will go a little bit
quicker apologies for the last video
being a bit too long for you but
hopefully this one will be zooming along
a bit faster
we spent most of the last video after we
talked about standardizing getting the
data in a comparable form before you do
anything we talk most of the time about
measures which are used with
quantitative data to sort of summarize
the characteristics of the data where is
it centered how spread out it is and
what kind of shape it takes
as it turns out we can get uh
almost all of the measures we're
interested in with one step in excel and
i'll show you the steps for how to
produce this descriptive statistics
thing uh later but essentially you can
get the mean and the median and the mode
and the standard deviation the variance
all those things that we talked about
the range the minimum max all those
things we talked about
uh in one go and that's kind of
convenient and i'll show you in a little
while how to do that there's a few that
are missing like the quartiles in the
interquartile range and there's a few
that we haven't taught you which you can
worry don't worry about until future
subjects if you learn some more
so that's just a little reminder or a
footnote for you
all of these things though as the others
are that i introduced in the previous
video are
numbers that summarize things and
numbers are pretty useful way of
summarizing things if you want to say
what's the average income of people in
this area thirty thousand dollars per
year is a pretty useful number to give
people uh it's for example what are the
range of incomes
dollars zero dollars up to six hundred
one thousand so numbers are valuable but
often visualizations are also good
so for example we might want to produce
some kind of uh
distribution some picture that gives us
an idea about the distribution of the
data so to do that
first of all we might start with that
with a table of numbers so instead of
just giving us for example in this case
here we've got uh information about
incomes and we've got it we know where
the data is centered we know how spread
out it is according to the standard
deviation of 35 000 per year etc but
what we'd like perhaps is is sort of a a
bit more detail about the sort of
categories and the ranges so we set it
up in what's called a frequency
distribution so we define some income
ranges
naught to ten thousand ten to twenty
thousand and so on and then we ask excel
to tell us how many people earn
an income in this particular range so
1509 people earned less than ten
thousand dollars 939
people earned between 10 and 20 000 and
so on okay
and 162 people earned more than a
hundred thousand dollars that's
basically what this table gives us
of course this is a pretty ugly
distribution if you wanted to put this
in a nice report you'd give it some
better labels than i've given there and
put dollar signs on the
bins and so on but
this is how it's dumped to us when excel
gives it to us it's a starting point of
the raw numbers
but better still let's show those
numbers in some kind of graph for
histogram
and so i'll show you in a couple of
minutes how i produced this table in
this graph but for now
i'll just show you the output of it i've
created a histogram and i've played
around with the presentation of the
histogram to make it look nice i've
given it nice headings and labels and
i've also closed up the gaps between the
bars for reasons
but i'll explain in a moment
important stuff that you have to do in
order for this final product here to
make sense visualization
is about communication
you might think as a data analytics nerd
that
how it looks is not that important it's
the substance well it couldn't be
further from the truth
how things look
is extremely important because firstly
people will remember it if it's
if it's clear secondly you can make it
look
in such a way that it's misleading to
people you can present things in a
certain way that's not giving a true
picture
or you can do the opposite you can
present it in a way that gives a clear
picture about what's going on people
will remember the message of a picture
far better than they'll remember a whole
bunch of numbers
so we spend quite a lot of time getting
this histogram right not just dumping
the first histogram that comes out of
exception so i really want to emphasize
that
that if you're ever asked to produce any
kind of visualizations
the first visualization you get out of
your software is going to be pretty
might have the right numbers in it but
it's going to be pretty ugly and you're
going to need to spend a good amount of
time
going through all the other steps of
making it look right before you finish
the task
and that might be 80 of the time making
it look right 20 of the time getting the
first version so i can't underestimate
that
so you see here that we've got most of
the people earning less than ten
thousand dollars not actually most of
the people but the biggest most common
category here is that and then as you
get further and further up in the income
range less and less people are in each
of those ranges and then there's a bit
of a blip at the end of people who are
into more than a hundred thousand
dollars okay so now i've got a bit more
of a picture about the sort of
distribution of incomes it's quite a
different picture to what i learned from
just looking at the average the mean i
think was thirty seven thousand dollars
somewhere in around about here so sure
the average income is thirty seven
thousand but a whole bunch of people are
earning less than that including some
people that are earning less than ten
thousand so not very much at all this is
not uh australian data by the way this
is from from the us
unfortunately in certain parts of the us
there's a large number of very poor
people who are in very little
one of the
richest developing countries in the
world
strange thing to say but think of that
now sometimes it's good for us to not
just give the numbers but actually to
give them percentage terms so i've asked
excel to produce the
cumulative percentages and then i've
turned those into actual percentages by
subtracting the difference you'll get a
chance to play with this sort of thing
in your tutorial work and so now instead
of saying there are 1509 people who are
in this range i can say
nearly percent of people earn less than
ten thousand dollars you see that's
pretty neat or
just over three percent of people earned
more than a hundred thousand dollars
or five point six six percent of people
earned between fifty and sixty thousand
dollars that's perhaps not as
interesting or i can look at the
cumulative numbers and i can
for example say
okay so let me just highlight what i'm
looking at here
uh
so i've i've talked about this group
here 29 and less than 10 000 3
earning more than 100 000 or i might go
over here and say 91 percent of people
earn less than 70 000
okay so all of these numbers are quite
useful they're giving us a bit of a
picture about how the distribution
actually looks and then i might want to
also perhaps
change my histogram that i had on the
previous thing and instead of the
bars up here being number of people they
might be percentages based on that
column there okay so there's a few ways
in which i'm buried to make it better
and some other suggestions there making
sure your boundaries are right and so on
so i won't go through that in detail
i'll let you read that and absorb it but
i think i've got the message across it
really has to look good if it doesn't
look good then it's bad okay and it's
not acceptable
um
one other graph of visualization i'm
going to
introduce you to is the box and whisker
plot and we'll have a look at it and how
we present that in a moment and the
reason i like the box and whisker plot
is because it works off those quartiles
the median the mean the q1 and the q3
that we talked about in the previous
class
remember in the previous
video i introduced you to some results
from a first-year stats
unit
one of
your units that you're studying from a
previous year and here we had the
summary data for that particular unit so
63.8 was the mean the median was 65 so
half the students got less than 65 half
the students got more
the standard deviation was 15.3 so that
was so roughly speaking the average
variation from the mean
the first quartile was 58 core of the
students got less than 58.
a quarter of the students got more than
74
so the middle 50 percent got somewhere
between 58 and 74 a interquartile range
of 16
that's pretty neat way of describing it
but even better if i could describe that
with something visual so let me show you
how i produce
okay so there's the data that i've just
been showing you there the ranges i'm
going to produce something called a box
and whisker plot which is what this
thing here looks like excel gives you
one of these and i and uh
i explained briefly in the notes there
how you
find it under the menu um
insert the excel
charts and uh look under the histogram
button you'll see box and whisker plots
okay
now the box and whisker basically this
solid blue bit is the the middle 50 so
q1 the first quartile
is
58
up to 74 that's q1 to q3 and 65 is your
median so those three numbers there are
the bottom 25 percent cut off the 50 cut
off in the 75 cup so that's pretty
useful that's where
if i look at my class half my students
get a score in that range there okay
i've got a few really top students who
score a better than that and my
top number is 91. now that's not the
very top mark
uh
in every case in this case it is
actually the very top mark
the box and whisker plot produced by
excel
make some decisions using some
algorithms which we won't go to the
details of as to whether it's shooting
that the the
range from 91 to the top sort of
line to the bottom line should be from
their maximum to the minimum or whether
it should treat some of the numbers as
what's referred to as outliers now this
is an interesting example
it didn't find any outliers at the top
but there's no totally weird student who
got 98 percent here the top mark was 91.
at the other end according to excel's
algorithm it said look
a reasonable range of the data is 34 but
there's actually sort of about eight
students who got really low marks who
sort of don't count shouldn't be treated
in the data they're outlines
now that's excel's judgment and we don't
get any control over that because this
is a fairly simple piece of statistical
software but if when you learn
programming uh in r and future units
you'll be able to produce much better
box and whisker plots than this but
basically this is telling us
half the data got half the students got
between 58 and 74. the top mark was 91.
so
the range of almost all the students was
between 91 and 34
but there was a very small number
eight or so students who got actually
even further below that so but out of
the 600 or so students in this class
that's a very small sample so it gets
wrapped down the bottom
and so that's how i interpret that box
and whisper plot and i think that in one
graph there i've got quite a lot of
information if i had my choice i
probably wouldn't do those little little
dots down the bottom uh i'd make my plot
look a bit neater but it's good enough
why is it called a box and whisker well
that's the box and these little lines
the 91 and the 34 they're the whiskers
you know if you're someone with a
moustache and little whiskers on the end
that's what that is referred to if you
try to figure out the jargon
all right so
uh
there's one okay i might want to produce
a histogram of this data but actually if
i do a histogram
it's actually not a very good one let me
let me do it quickly now to give you an
idea about what i mean so i'm going to
go to the data menu
and i go to data analysis so in the data
menu up here i go to something called
data analysis and hopefully you've got
that and i ask for it if you don't then
you need to do some add-ins
for analysis full pack which you can
learn about in your tutorial
uh then i go to histogram and i say okay
and i'm going to send the output to some
new worksheet just so that i get mess
and i'm going to make sure i ask it for
a chart
let's see what comes about what's i
better tell it where my data is there it
is there so it's uh row
eight two
a five eight six i just happen to know
that from memory
okay it's calculating the histogram
oops
here it is okay
what an absolutely ugly histogram
okay
it's terrible
that's exactly what you'll get if you
just ask excel to give you a histogram
what a load of rubbish there's so many
things wrong with that histogram
so i really don't like it at all okay
and you can see why it's pretty obvious
okay the data ranges are all stupid um
yeah we don't even need to waste our
time talking about why that's bad we
need to make a better job of it than
that and so read the notes there and
figure out how to do it and in
particular in this particular case i
actually
decided that well first of all the first
thing i need to do is actually
i need to specify the ranges of the bins
so when i did my histogram i was
actually i had a choice then i chose not
to take it if i go to data analysis
histogram again instead of just
leaving that bin range if i leave the
bin range blank
it will just choose ranges for me and it
chose stupid ones so instead of that
maybe a simple improvement i can make is
let's let's use these numbers here
and use those ranges and see what
it looks like well now it's looking a
lot better isn't it that's much nicer
looking histogram i've sort of got and
in fact uh
for reasons that i'll get to in a moment
i've chosen these ranges very carefully
okay so that's an immediate improvement
i can make and then there's still more
that i need to be able to do to that
but actually the choice of this range
i'll go back to the other
sheet here this is
the final histogram i produced after
mucking around a bit further
and i've done a few things there
first of all i chose this particular
range of bins for a very good reason
namely this is the grades that you
achieve in this unit if you get 49 or
less you get an n if you get 59 or less
you get a p
and 69 is a credit 79 or less is a
distinction and above 79 is a higher
stitch okay
so i chose to do my histogram not with
equal space bars or anything but
actually to communicate what is most
interesting and relevant to students
namely what grade are you going to get
here or what's the distribution of the
grades
okay so that's what i ended up doing
there
in order to produce histogram this
histogram i changed the the default is
the bin ranges would have come out as
49.59 but if i just change the type
over those with np etc then i can end up
with much nicer labels in my histogram
likewise the frequencies that were here
were
the original data
the account of how many people are in
each case each category like over here
you'll see here here we are you see the
bin the values are 89
but i don't want that i want the
percentage of people who got or the
proportion of people who got each
value well that's fine i just took the
89 and i divided by however many data
points i had 583 and the next one 72 and
i divided by 583 so i can change these
numbers once excel's produced them for
me and get a nicer looking histogram so
on this axis i've now got the proportion
of students who got that score so you
see here 34 35 34 of students got a
credit
etc uh 13 because 12 of students got a
high distinction 15 of students got to
fail etc okay so i worked quite hard 80
of my time was spent making
that ugly looking histogram there into
that nice looking histogram there okay
and importantly i chose to do it in a
for the purpose okay
what's the question i'm trying to answer
and how does my
visualization best answer that question
and what communicate that information
and in this example that specifically
had to do with choosing the categories
so that they corresponded to the grades
that students achieve in the unit but
obviously in other contexts it's going
to be a different set of issues you have
to address to get the best possible
visualization
okay
now
switch gears that's all been practical
how to show stuff in data and
particularly today how to visualize i'm
going to take us now to a little bit of
what you might describe as theory and
the reason i'm going to do that is
because we want to have a deeper
understanding of the underlying
principles behind all of these methods
that we use and the deeper understanding
is with the idea that we have random
data and the random data is distributed
according to probability distributions
so you did a little taste of probability
in topic one we're going to do a little
bit more on probability here in topic
two and more in later weeks just so that
you get more comfortable with thinking
about random data and about probability
associated with outcomes
we can think about the the table of
numbers that we've produced the
percentages of people in each
income range for example as a
probability the probability if i chose
someone randomly this is the same table
that i had earlier in the present in the
in the topic but now i've just turned
these numbers here instead of how many
people earn between north and ten
thousand dollars i've turned it into a
probability a proportion
now i can say if i chose someone
randomly
from this suburb what's the chances that
they'll earn between north and ten
thousand dollars answer 0.29 600. that's
a probability that's what we mean by
probability the probability of something
happening
and you'll see the probability that
they'll learn more than a hundred
thousand is far less okay so think about
these numbers not just as however many
people are in that code agreement think
about them as probabilities
notice there's a couple of
characteristics of this thing here that
make it a probability
distribution they're mutually exclusive
you can't earn between 30 and 40 and
between 40 and 50. everybody is there
only once
so you can't be in more than one cabin
and secondly
it's exhaustive everybody's there if i
add these probabilities up i get one
somebody has to earn
something between naught and a million
dollars or an infinite bolts okay
so it's mutually exclusive and it's
exhaustive and that's essential for this
to be classified as a probability
distribution
with some data
you can present the probability
distribution in a table because there's
only a limited number of values it can
take in this case we've categorized it
so there's only like 10 or 12 categories
here
10 or 11 categories other data you might
have it in the form of that of a
probability density function which is
like a smooth curve and we're going to
see an example of that now to finish off
this topic and that's called the normal
distribution
what's the normal distribution about
well you've probably come across this
before in high school
and it's a sort of a bell-shaped
distribution that looks like this and
the reason it's so common and so popular
is because think about this as a range
of possible values of some variable x
could be how much income people earn or
price of houses or
you know how many
children a person has
that's probably not a good example
because the range of that's pretty small
but you know
things that can vary a lot it's a whole
possible set of values that that
variable can take
it's very common for
the probabilities associated with
different possible values of that
variable to follow this bell shape the
reason it's common is that the mean is
in the middle
and if
values that are close to the mean
are much more likely to occur you know
the probability is given by the height
of the bar just like a histogram
the high histogram points are the ones
that are more common so these are the
range of values of x which are much more
likely to occur and they're close to the
mean
the further you get from the mean the
muscle much less likely it is that
you'll get those kinds of values
in either the negative direction or the
positive direction and that's
essentially what describes a bell-shaped
curve
in addition to that it's symmetric
whether you're going out in the positive
direction or the negative direction you
get about the same sort of decrease in
probabilities of it occurring
and that's not true of every set of data
in the world i've already given examples
of skewed data like house prices and so
on but actually a lot of data is pretty
close to normally distributed
it's got those characteristics have been
symmetric most values close to the
values close to the mean are much more
likely to occur and as you get further
and further from the mean
so that's why we study it because in
natural world and in financial markets
and in
business
any economic development and economies
data often ends up looking quite
normally distributed
okay uh this is a particular type of
normal distribution it's it's it's
what's called the standard normal
because it has a mean of zero
so the average is zero and it has a
standard deviation of one
which means that if as i go out a couple
of standard deviations it becomes very
unlikely you'll get values
out here and when i get to three
standard deviations above the mean i've
got virtually no chance of that
occurring and by four standard
deviations it's infinitesimally small
and likewise so these can be sort of
thought about as like numbers of
standard deviations above or below the
mean
let's take the heights of adults okay
you know in a country okay there'll be
an average height here for males of 1.8
meters or something like that and then
as you go
one standard deviation the standard
deviation of heights might be about five
centimeters so as you go up one one
standard deviation to 1.85 meters
it's less likely you'll meet someone
that tall it's even less likely you meet
someone 1.9 meters tall 1.95 meters 2
meters etc
and likewise 1.75 meters is less likely
than 1.8 1.7 meters is even less likely
1.65 is even less likely 1.6 metres even
less since we're talking particularly
about male adults here okay so
that's exactly the sort of a very
commonplace distribution that's occurred
now
i can work out probabilities associated
with normal distributions using excel
and
in normal distribution it's it's applied
in cases where data can take all sorts
of values so we don't normally talk
about what's the chances of someone
being exactly 1.83 centimeters tall we
actually prefer to talk in ranges what's
the chances of somewhere between being
between 1.8 and 1.85
or above 1.83 or something so those are
the kind of probabilities we can work
out and we do that by calculating the
area under the curve of this nice
bell-shaped thing so the probability of
of someone uh
of a height more than two standard
deviations above the mean would be this
shaded area here
and so we can calculate that probability
or the probability of for a distribution
which has got a mean of 10 and a
standard deviation of 3 the probability
of someone
having a value of less than 5 is this
shaded area here ok
how do i calculate those probabilities
well if i want
to work out the probabilities i just
have to use a function in excel called
norm.dist so if i've got a normal and
the idea is this is let me just show you
what these different components of it is
that you'll get to do this in your
tutorial work so i won't go into the
detail here that's the mean so i've got
a normal distribution with a mean of
zero
that's the standard deviation it's got
us
ah sorry no i'm wrong that's the normal
that's the probability for which i want
to calculate the x value i want to
calculate for so i want to calculate the
chances of x being less than zero
okay
and i've got
a normal distribution with a mean of
zero so that's the mean
and that's the standard deviation
and true just means i want the
probability of being less than it
so that's actually precisely
excuse my terrible normal curve that's
the standard normal and i want to work
out the probability being less than zero
so i'm actually working out that
probably there
surprise surprise it's a half
remember it's symmetric the probability
of being less than naught
is exactly equal to the probability of
being greater than not 50 or 0.5
or i could take my normal distribution
with a mean of 0 and a standard
deviation of 1 and i could look at say
what's the chances of it being less than
some number up here like 1.96
excuse my i can't write very well with
my opinion but you know that's 1.96 then
that's why you're going to see
well that's a
quite a large probability because it's
all of this area here from way so where
there by now it works out to be about
97.5
chance
okay
that's what we can do in excel or we can
do the reverse we can say if i give you
the probability can you tell me what the
x value is
so i say
i've got a standard normal a normal with
a mean of 0 and a standard deviation of
1. and i want to know what's the x value
that gives me a probability of 0.5 of
being below it answer well that's the
reverse of the question we just asked
before zero half the values
are less than zero or if i've got a
normal distribution
uh with a mean of zero and a standard
deviation of one and i want the
probability to be 0.975 what's the x
value answer 1.96 okay so i'm just doing
the same thing as before but in reverse
okay so that's how we do those things in
excel for now park that practice it a
little bit and we'll get to make use of
it in future
topics when we actually start applying
the non-distribution
okay
thank you for your patience and
persevering with that topic i hope you
enjoyed it and i hope you can learn lots
as you work through the tutorial and
other questions
5.0 / 5 (0 votes)