ETC1000 Topic 1b
Summary
TLDRThis video continues the exploration of categorical data, focusing on the concepts of probability, marginal and conditional probabilities, and independence. Using examples related to medical conditions and exercise habits, the speaker explains how to calculate and interpret these probabilities. Additionally, the video covers the importance of understanding these concepts for program evaluation, demonstrated through a job search program. The speaker emphasizes the need for statistical tests to validate findings and introduces advanced topics for further study. Practical tips on working with pivot tables and calculating probabilities are also provided.
Takeaways
- 📊 The session covers categorical data, focusing on different medical conditions and amounts of exercise among 5,000 people, presented in a frequency distribution table.
- 🔢 The frequency distribution table is used to calculate probabilities, turning raw counts into marginal probabilities by dividing each count by the total population (5,000).
- 🧮 Marginal probabilities focus on one characteristic of interest, such as the amount of exercise or type of illness.
- 🔀 Joint or intersection probabilities look at the probability of two characteristics occurring together, such as having diabetes and engaging in minimal exercise.
- 🔍 Conditional probabilities are calculated by conditioning on a particular column or row total, providing insights into the likelihood of one characteristic given another.
- 💡 Conditional probabilities are essential for understanding relationships and potential causation between variables, such as the impact of exercise on diabetes.
- 📐 Independence is a crucial concept where two events are independent if the probability of one occurring is unaffected by the outcome of the other.
- 📈 Independence can be tested by comparing conditional probabilities across different groups to see if they are equal.
- 👩🏫 The example of a job search program demonstrates the practical application of these concepts, showing how to evaluate the effectiveness of interventions.
- 🔍 Program evaluation involves comparing the success rates of those who participated in a program versus those who didn't, highlighting the importance of conditional probabilities and independence.
- 📉 In real-world applications, statistical tests are necessary to determine if differences in probabilities are significant or due to chance, which will be covered in future videos.
Q & A
What is a frequency distribution table and why is it used in the script?
-A frequency distribution table is a statistical tool used to organize and display data in a tabular form, showing the frequency or count of occurrences for different categories. In the script, it is used to represent the medical conditions and exercise habits of 5,000 people, allowing for a clear visualization of the data.
What is the difference between marginal and joint probabilities?
-Marginal probabilities refer to the probability of a single event or characteristic occurring, regardless of other variables. Joint probabilities, on the other hand, refer to the probability of two or more events or characteristics occurring simultaneously. In the script, marginal probabilities are found in the margins of the table, while joint probabilities are found in the intersection of rows and columns.
How are probabilities calculated from the frequency distribution table?
-Probabilities are calculated by dividing the frequency or count of each category by the total number of observations. In the script, the total number of observations is 5,000, and each cell in the table is divided by this number to convert counts into probabilities.
What is conditional probability and how is it related to the data presented in the script?
-Conditional probability is the probability of an event occurring, given that another event has already occurred. In the script, it is calculated by taking the joint probability of two characteristics and dividing it by the marginal probability of one of the characteristics, which provides insight into the relationship between the two.
Why is the concept of independence important in analyzing the data in the script?
-The concept of independence is crucial as it helps determine whether the occurrence of one event has any impact on the occurrence of another. If two variables are independent, the probability of one does not affect the probability of the other. In the script, the analysis of exercise and diabetes shows that they are not independent, indicating a relationship between exercise levels and the likelihood of having diabetes.
How does the script illustrate the application of conditional probabilities in real-world scenarios?
-The script uses the example of a job search program to illustrate the application of conditional probabilities. It shows how the probability of finding a job is different for those who participated in the program versus those who did not, demonstrating the effectiveness of the program in improving employment chances.
What is the significance of the pivot table in the script's discussion of probabilities?
-The pivot table is significant as it allows for the easy manipulation and visualization of data. In the script, it is used to convert raw data into probabilities and to calculate conditional probabilities by showing values as percentages of rows or columns.
What statistical concept is briefly mentioned at the end of the script and why is it important?
-Statistical testing is briefly mentioned at the end of the script. It is important because it helps determine whether observed differences in probabilities are statistically significant and not due to chance, providing a more robust analysis of the data.
What is the purpose of the advanced section mentioned in the script for those studying at a higher level?
-The advanced section is intended to provide a deeper understanding of probability distributions and to introduce common probability distributions. It offers a more in-depth exploration of the topic for those who wish to gain a more comprehensive knowledge of the subject.
How does the script use the concept of program evaluation to discuss the effectiveness of a job search program?
-The script uses program evaluation to compare the employment outcomes of participants and non-participants of a job search program. By comparing the conditional probabilities of finding a job for both groups, it evaluates the effectiveness of the program in improving employment rates.
Outlines
📚 Introduction to Categorical Data and Frequency Distribution
The speaker introduces the continuation of the topic on categorical data and emphasizes the importance of watching the first half. The discussion centers around a two-way frequency distribution table representing medical conditions and exercise amounts among 5,000 people. Probabilities are introduced as a foundational concept for analyzing complex data.
📊 Understanding Marginal and Joint Probabilities
The speaker explains marginal probabilities, which focus on one characteristic of interest, and joint or intersection probabilities, which consider two characteristics simultaneously. Examples from the data table, such as the probability of having diabetes and minimal exercise, are used to illustrate these concepts.
🔄 Conditional Probabilities and Pivot Tables
The concept of conditional probabilities is introduced, which considers the probability of one event given another. The speaker demonstrates how to calculate conditional probabilities using pivot tables, converting percentages to probabilities, and explains the significance of these probabilities in data analysis.
📉 Independence in Probability
The speaker delves into the concept of independence, where two events are independent if the occurrence of one does not affect the probability of the other. Using the example of diabetes and exercise, the speaker explains how to determine independence and its implications for understanding relationships between variables.
📝 Program Evaluation and Practical Applications
The final part covers an example of evaluating a job search program to illustrate the importance of understanding probabilities and independence. The speaker highlights the relevance of these concepts in real-world scenarios, such as public health and program evaluation, and hints at more advanced statistical tests to come.
Mindmap
Keywords
💡Categorical Data
💡Frequency Distribution
💡Probability
💡Marginal Probability
💡Joint Probability
💡Conditional Probability
💡Independence
💡Program Evaluation
💡Statistical Test
💡Pivot Table
💡Uncertainty
Highlights
Introduction to the second half of Topic One, emphasizing the importance of watching the first half for context.
Discussion on categorical data and its presentation in a two-way frequency distribution table.
Explanation of how to represent data about medical conditions and exercise levels for 5,000 people in a table format.
Introduction to the concept of probabilities as a foundational idea for analyzing risk and uncertainty.
Description of how to convert frequency distribution data into probabilities by dividing by the total number of observations.
Definition and explanation of marginal probabilities, focusing on the probability of one characteristic of interest.
Introduction to joint or intersection probabilities, which represent the probability of two events occurring together.
Explanation of conditional probabilities and how they are calculated from joint probabilities and marginal probabilities.
Demonstration of how to calculate conditional probabilities using pivot tables and percentage calculations.
Discussion on the importance of understanding the relationship between variables, such as exercise and diabetes, through conditional probabilities.
Introduction to the concept of independence and its significance in determining whether two variables are linked.
Example illustrating how to evaluate the effectiveness of a job search program using conditional probabilities.
Explanation of how to determine if a training program is independent of getting a job by comparing conditional probabilities.
Discussion on the potential issues with program evaluation, such as selection bias, and the need for more sophisticated statistical tests.
Encouragement for higher-level students to explore advanced sections on probability distributions for a deeper understanding.
Acknowledgment of the challenges of working from home and a light-hearted moment involving the speaker's wife.
Transcripts
hello again
we're on the second half of topic one
make sure you watch the first half first
because otherwise this second half may
not make a lot of sense to you
so if you take a
look at the notes i will share my screen
and we will work our way through the
second half there
we're going to think
as you recall
about categorical data and in the
example we've got on the screen there is
the one we've been looking at so far
we've got different medical conditions
that people have
and also different amounts of exercise
that are embarking
and so we can present
that information about these 5 000
people in the form of a table like this
which we call a frequency distribution
this is a two-way frequency distribution
because there are two characteristics
that we're interested in
and so each of the rows represents a
medical condition and each of the
columns represents the amount of
exercise people get
so of our 5 000 people for example 43 of
those people
engage in a moderate to a large amount
of exercise and have a heart disease
okay and uh at the other extreme uh
we've got uh 323 people who do minimal
exercise and who suffer from depression
as their primary medical condition
okay so that's what that data looks like
now we're going to
take
this familiar sort of way of presenting
data and think about it now
as
an idea of probabilities and the reason
we do that is because if we want to make
the world if we want to start analyzing
more complex data we're going to need
some sort of
sort of more proper tools at our
disposal rather than just using pivot
tables and we're going to need some
um
more fancy methodologies and in order to
do that we need a sort of theoretical
foundation to the way in which we look
at data and the likelihood of different
things occurring and the theoretical
foundation we have is a probability
and you know probabilities are pretty
important because the world is full of
uncertainty as you know there's a lot we
don't know
and so the way in which
we capture uncertainty at least one way
in which we capture uncertainty at risk
and all those different things is with
probabilities what's the likelihood of
this or that occurring
and so that's really the foundational
idea
for
actually the way we analyze risk and
uncertainty in the world and people are
always sitting there making judgments
based on an estimate of a probability
so we as data analytics people need to
have a bit of an idea about what
probabilities actually are
and at their most basic most basic level
they're really simple probability is
simply the proportion of times something
happens that's all it is so of these 5
000 people we can divide all of those
numbers in that table there by 5 000
and turn them all into probabilities and
that's what these numbers here are
okay so for example uh let's take an uh
let's take a person who
is engaging in moderate or frequent
amount of exercise and let's just look
at the bottom row here so we'll ignore
the medical condition just look at all
of the people
all
1784 people
who embarked in moderate amount of
exercise
that
1784 out of 5 000 is actually
35.68 of the people in other words
in probability sense if you randomly
pick one of these people the probability
that that person
will do a moderate or large amount of
exercise is 0.3568
okay so that's a statement of
probability you just made
or if you randomly picked a person
what's the chances they've got diabetes
.0404 that number there
okay
the number in the last column for
diabetes
right okay so that's the idea of
uh
probabilities in the most basic sense
now in these two examples you'll notice
i've only looked at one characteristic
of interest i've only looked at the
amount of exercise or at
the type of illness that they have
that's what's called a marginal
probability it's called that because
it's in the margins of the table
that's the simplest way to remember it
anyway but importantly it's saying even
though you might know information about
two characteristics of the people we
just want to know a probability about
one of those two characteristics and so
we call that a marginal probability
often though we're interested in some of
the numbers that are in the middle of
the table and those numbers are our
joint or intersection probabilities
let's take an example of that okay so
what's the probability of
somebody having diabetes
and engaging in minimal exercise
answer
0.0292 go to the diabetes row
and
the
minimal exercise column and you get
0.0292
back to the table up the top here
there's 146 of those people
out of the 5 000 and that's how when you
divide by 5000 that's how you get the
0.0292
so that is a
joint or a intersection probability
and you've come across that in high
school and you've probably seen the
symbol the upside down u symbol to
indicate intersection
both things have to be true
for that probability
for that particular event to occur they
have to have both diabetes and minimal
exercise
okay
and likewise we could choose diabetes
and moderate to high frequency exercise
and we get 0.0112
okay just by looking at the second of
the two numbers in the diabetes row
all right then we can do any one of
those
marginal joint probabilities sorry
intersectional joint
probabilities now there's another type
of view which we think is pretty
interesting because this is where we
start getting a clue about what causes
what in the world and what connections
are between things and that's the idea
of a conditional probability
let's go back to condition to
calculating a percentage of column table
remember how to do that i go to my pivot
table
and uh
i've got one right here okay i just go
to the cell and i click on the right
mouse button and show values as
percentage of column or i can do
percentage of row or percentage of total
if i want conditional probabilities i'm
going to need conditional
um columns or row totals so first of all
let's do percentage of column
that's what we've got here
so these numbers here are just the same
as what's in my table except that
they're expressed not as percentages but
as probabilities that's easy i just
click the right mouse button having
highlighted them all and format the
cells and instead of making it a
percentage i'm going to format it as a
number
then so i get
those different
probabilities there okay so you can
convert pivot tables from percentages to
numbers to probabilities quite easily
what do these numbers mean
well these are what's called conditional
because you're saying we're only going
to look at the people in the first
column who've done minimal exercise if a
person's done minimal exercise what's
the probability they've got diabetes
answer
0.0454 so look at the first column and
the diabetes row 0.0454
okay
so that's a conditional probability
given so we use that phrase that word
given that this is true or
considering a person who is only
doing minimal exercise what's the
chances they'll have depression 0.1084
etc
so what about someone who's done
moderate to frequent exercise that's the
second column for example their
probability of having diabetes is 0.0314
lower than the probability for the
person who's done minimal exercise
okay so these are examples of
conditioning on a particular column and
calculate what's called conditional
probabilities
each column of this table is a
probability distribution in its own
right each person in the minimal
exercise categories in one of these
categories here
as one of those conditions or no no
medical condition
we write down the probabilities with
this little vertical line here to say
it's a conditional probability so the
probability of having diabetes given
that you had minimal exercises is 0.0454
so when you see that vertical line you
just replace it with the word given
now
a little bit of maths to show us how we
got from one
set of probabilities to the other
we got the conditional
probabilities by doing percentage of
column so in other words by taking a
particular column in this table and
dividing each of these values by the
total for their column
so diabetes of 0.0292 divided by 0.6432
will give me
0.0454 there
so that's actually what we show you in
this
expression here
so the formula for calculating a
conditional probability
is to calculate the intersection
probability and divide by
the probability the marginal probability
of of the of the conditioning variable
in this case minimal exercise
that's the formula
and
can see its application but but
importantly actually it's reasonably
straightforward intuitively because it
derives exactly from the
structure of the pivot table that's the
initial pivot table i just to get it to
a percentage of column take each value
and divide by the total for their column
and that's essentially dividing by the
marginal
probability
dividing the intersection probabilities
by the marginal probabilities to get the
conditional probabilities
okay so that's the logic of it you might
need to listen to that and think
throughout throwing through that again
but that's the basic idea
everything's the same if i want
percentage of row i go back to my pivot
table and say oops for some reason
instead of that i want you to show me
show values as percentage of row
there and so now all my rows add to one
so now i'm conditioning on having a
particular condition so given that you
have depression
what's the probability that you and back
in minimal exercise answer 0.655
for example okay
so that's what percentage of row is it's
the same idea it's still a conditional
probability but it's flipped around what
are you conditioning on
so now
you've got for example the probability
of minimal exercise given diabetes
rather than the probability of diabetes
given minimal exercise
percentage of ronald instead of
percentage of column
how do we calculate them
easy just divide the values in the
original table by the total for that
particular row in other words divide the
intersection probability by the marginal
probability and you get
the conditional probability
that's an example given to you right
here so you can go back and have a look
at the tables and confirm that for
yourself
okay that's all fun
well maybe not fun but just to review
we're looking at probability because
it's the way we describe uncertainty in
the world and we've got three basic
types of probability that the pivot
table perfectly illustrates for us we've
got
marginal probabilities the probability
of one particular characteristic of
interest we've got
joint probabilities probability of both
things being true at once
diabetes and minimal exercise for
example
and then we've got conditional
probabilities where we say given that
one thing is true what's the probability
of the other thing being true
okay so those are our three types of
probability
now we're going to use those to
explore a very important phenomenon and
that is this idea of independence so
what's independence all about this is
our main kind of final sort of key point
if you like for topic one so concentrate
hard even if you're
finding a little bit hard to keep up
with everything so far go back over it
make sure you're on top of it and then
make sure you really nail this stuff
here
let's go back to the condition the
percentage of column probability table
that we have here
and let's just have a look at the
chances of having diabetes that row
there corresponding to diabetes
and so we've got three different
probabilities of getting diabetes
depending on what you condition on
if you condition on minimal exercise
then
you get a probability of getting
diabetes of 0.0454 you'll see that
in this
line here okay given that you do minimal
exercise you're in that first column the
probability of diabetes is 0.0454
the second column given that you do
moderate exercise
the probability of having diabetes is
0.0314
and the last column
is the marginal probability of just
having diabetes if you just don't pay
any
you look at all the people whether they
do minimal or lots of exercise just
what's the chance of having diabetes
answer
just over four percent
so you'll see i've got three different
probabilities of having diabetes
depending upon how much exercise i do so
there's obviously a link between
exercise and having diabetes
now the idea of independence
is what is that independence will mean
that there's no link between these two
characteristics of interest
an independent thing says it doesn't
matter how much exercise you do
you won't it makes you no more or less
likely to have diabetes
so
if that was true that would be a very
important piece of information because
it would tell us sending people off to
do lots of exercise isn't going to help
it's not going to reduce the diabetes of
the population so we better find out if
it's true or not
so the idea of independence is that two
events are independent if the
probability of one occurring is
unaffected by the probability of the by
the outcome of the other so whether you
exercise a lot or exercise not much
doesn't make any difference to the
probability of having diabetes that's
what would need to be true in order for
this to be independent
so in other words we would need the
probability of having diabetes to be the
same
whether you do minimal exercise or
frequent exercise
okay so the probability of diabetes
would need given that you do minimal
exercise would just be the same as the
probability given that you do moderate
exercise and it'll be the same as the
probability of just having diabetes
overall
that's what we want to see occur
for it to be independent
well guess what
they're not equal
you're much more likely a reasonable
amount more likely to have diabetes if
you do minimal exercise four and a half
percent versus three percent for those
that do moderate exercise and four
percent overall for the population
so because those three probabilities are
not equal
then we conclude that the amount of
exercise you get
is not independent
of
having diabetes okay so there's a
connection between these two
and that's important obviously
this is a simple little study but if we
did this more comprehensively and so on
that would be important to public health
messaging
now i'm just going to run quickly
through one more example
a different set of data different just
to illustrate and show you how important
this is and i'm going to go fairly
quickly through this because it's a bit
of a repeat of what you've just done but
as i say go back through it again if you
if you don't follow it all the first
time
now we're trying to evaluate whether or
not
we can do something to help people find
employment so we've got a job search
program that we put people through
and so we've got 100 people
and some of those people participated in
this job search program 24 of them and
76 people didn't participate
okay so that's what we've got
and
all of these people started the year
unemployed
by the end of the program six months
later
hopefully some of these people have
found jobs
now it turns out
that of those hundred people by the end
of the six months
about half of them 49 of the 100 did
find work
and 51 were still unemployed
so there's been
some in some progress
but
a lot of the people here didn't do the
program and so so we need to figure out
whether or not and in fact quite a few
people who didn't take part in the
program okay
managed to get themselves a job
26 out of the 76 people still managed to
get a job even though they didn't take
part in any training
so do we really need the training
in other words does the training help
people to get work is the training
independent of
getting a job
that's the question you're answering
okay how do we do it well
we can look at the conditional
probabilities
given
so first of all
what's the probability of getting a job
sometime in the next six months
well answer 0.49 okay 49 of the people
so that's the marginal probability of
finding a job
okay 49 of people were able to get a job
if you did the program
this is impressive 23 out of the 24
people that did the program got a job
so given so this is a conditional
probability that you participated in the
program your chance of getting a job is
0.96
if you didn't participate in the program
there's 76 of them well some of them did
get a job but only 26 of them so in
other words about .33 is the probability
the conditional probability of getting a
job if you didn't do the program so
given that you did not participate
what's the chances of getting a job
so
is doing the training independent of
getting a job well
are the probabilities of employment
given that you participated the same as
the probability of employment given that
you did not participate and the overall
probability of employment
absolutely not 0.96 0.34 and 0.49 are
not equal
okay clearly
doing the training program has gained
you a much more likely to get a job 96
chance of getting a job versus 34 for
those that didn't do the training the
training program worked
okay so this is a great example of
examining independence to try and give
us an idea about what how the world
works
and in fact this is an example of what
we refer to as program evaluation this
is just a general comment for you to to
help you put it in context you might
feel we're doing something really nitty
gritty and rather uninteresting here but
this is the basic idea behind pretty
much what everybody does whenever we
evaluate anything
does it work or not and hopefully we
evaluate things because you know the
government wants to fix problems whether
it's make people healthier or get people
jobs or
you know help reduce crime or whatever
it might be they're going to introduce
programs to do it you'd like to know
whether the programs work or not
well the basic approach you take to
evaluating a program is exactly what i
just described in this example here
namely
you look at some of the people that did
the program and you look at some people
who didn't do the program and you look
at the probability of success you know
getting a job in this case for those
that did the program versus probability
of success for those that didn't well it
better be better for the ones that did
the program
and if it is
then the program at least partially
works in this case really well no 96
success rate is pretty impressive
okay
so that's the idea of program evaluation
it's precisely this more complicated and
you can probably think for yourself some
of the problems with this particular
example here
which i'll just hint at which is
who chose
who participates in the program or not
maybe all of the highly motivated people
took part in the program and all the
lazy people didn't bother
maybe that's the reason why there was
such high success rate here's a little
clue for you that this is not as neat as
it looks but we'll leave that as a
question for you to ponder and think
about and talk about
okay now just briefly
to finish up this video
so far everything we've done so far has
just been you know calculating
probabilities but actually in the real
world you've only got a sample of people
and these are estimates and so little
differences in probability here may not
actually be real they might just be
flukes so we need something a bit more
fancy a statistical test to decide
whether these differences in conditional
probabilities are are big enough to be
sort of robust and to be kind of
legitimate not just due to chance and
that's what we'll do in later later
videos when you
learn more about statistical tests and
so on
okay my wife just come to join this
video you can say hello to her
another time okay now lastly these are
the hazards of working from home as you
i'm sure are well aware of if you've
been studying from home for a while
lastly for those of you who are
studying this at a higher level i would
encourage you to have a look through the
advanced section of the topic
this is where we go into a little bit
more about probability distributions and
a very couple of very common probability
distributions just to give you a taste
of future things that you'll look at in
times to come okay thank you very much
all the best
Voir Plus de Vidéos Connexes
5.0 / 5 (0 votes)