p-hacking: What it is and how to avoid it!
Summary
TLDRIn this StatQuest episode, Josh Starmer explains P-hacking, a practice where researchers manipulate data analysis to achieve statistically significant results, leading to false positives. He uses the example of testing drugs for virus recovery to illustrate how P-hacking can occur. The video warns against cherry-picking data and emphasizes the importance of proper sample size determination through power analysis to avoid false positives. It also introduces methods like the false discovery rate to compensate for multiple testing problems.
Takeaways
- π¬ P-hacking refers to the misuse of statistical analyses to produce false positives, which can lead to incorrect conclusions in research.
- π In the context of drug testing, P-hacking can occur when researchers continue to test different drugs until they find one that appears to work, based on a statistically significant p-value.
- π The script illustrates the concept using a normal distribution to show how recovery times from a virus can be analyzed, and how selecting specific data points can lead to misleading results.
- π€ The importance of not cherry-picking data is emphasized; researchers should not only test and report on the data that supports their hypotheses.
- π The script explains the multiple testing problem, where conducting many tests increases the likelihood of obtaining false positives due to the arbitrary p-value threshold of 0.05.
- π To combat P-hacking, the script suggests using methods like the false discovery rate, which adjusts p-values to account for multiple comparisons and reduces the chance of false positives.
- π The concept of power analysis is introduced as a way to determine the appropriate sample size before conducting an experiment, which can help prevent P-hacking by ensuring sufficient data to detect true effects.
- π« The script warns against the temptation to add more data to a study after observing a p-value close to the significance threshold, as this can increase the risk of false positives.
- π The necessity of including all p-values from all tests in any adjustment method is highlighted to ensure the validity of the statistical analysis.
- π The video concludes with a call to action for viewers to learn more about statistical methods to avoid P-hacking, such as through further StatQuest videos on power analysis.
Q & A
What is P-hacking?
-P-hacking refers to the misuse and abuse of analysis techniques, which can lead to being fooled by false positives.
Why is P-hacking a problem in statistical analysis?
-P-hacking is a problem because it can lead to false positives, which means incorrectly rejecting the null hypothesis when there is no actual effect.
What is the significance level typically used in statistical tests?
-The significance level typically used in statistical tests is 0.05, which means that there is a 5% chance of a false positive.
What is the multiple testing problem?
-The multiple testing problem occurs when doing a lot of tests, which increases the likelihood of encountering false positives.
How can the false discovery rate help address the multiple testing problem?
-The false discovery rate adjusts p-values to account for multiple comparisons, usually resulting in larger p-values and reducing the number of false positives.
What is a power analysis and why is it important?
-A power analysis is performed before an experiment to determine the appropriate sample size needed to have a high probability of correctly rejecting the null hypothesis.
Why is it incorrect to add more data to a test with a p-value close to 0.05?
-Adding more data to a test with a p-value close to 0.05 can increase the chance of a false positive, as the initial p-value calculation already considered the data available.
What should one do when they get a p-value close to but not less than 0.05?
-When a p-value is close to but not less than 0.05, one should conduct a power analysis to determine the correct sample size rather than adding more data to the existing test.
What is the role of the null hypothesis in the context of P-hacking?
-In the context of P-hacking, the null hypothesis is that there is no difference between groups or conditions. P-hacking can lead to the incorrect rejection of this null hypothesis due to false positives.
How can one avoid P-hacking in their statistical analyses?
-To avoid P-hacking, one should calculate p-values for all tests, adjust for multiple comparisons using methods like the false discovery rate, and conduct power analyses to determine appropriate sample sizes.
Outlines
π§ͺ Understanding P-Hacking
Josh, the host of Stat Quest, introduces the concept of P-hacking, which is the misuse of statistical analysis that can lead to false positives. The video uses an analogy of testing drugs to reduce recovery time from a virus. Initially, drugs A and B show no significant effect, but drug II appears promising with a p-value of 0.02, leading to the rejection of the null hypothesis. However, this is a P-hacking scenario because the process of testing multiple drugs and selecting the one with the lowest p-value increases the likelihood of a false positive.
π The Multiple Testing Problem
The video delves into the multiple testing problem, illustrating how testing multiple samples can lead to false positives. It explains the concept using a normal distribution curve representing recovery times from a virus. By selecting small groups of three people, the likelihood of getting a p-value that suggests a significant difference increases, even though the samples come from the same distribution. The video emphasizes the importance of not cherry-picking data and using methods like the false discovery rate to adjust p-values and reduce false positives.
π Avoiding P-Hacking in Practice
The final paragraph discusses how to avoid P-hacking in practical scenarios. It warns against adding more data to a test just because the initial p-value is close to the significance threshold, as this can lead to false positives. The video suggests conducting a power analysis before the experiment to determine the appropriate sample size, which helps in avoiding being misled by false positives. It concludes by encouraging viewers to use all p-values for adjustments and to plan experiments carefully to prevent P-hacking.
Mindmap
Keywords
π‘P-hacking
π‘P-value
π‘Null Hypothesis
π‘False Positive
π‘Multiple Testing Problem
π‘False Discovery Rate (FDR)
π‘Power Analysis
π‘Sample Size
π‘Statistical Significance
π‘Type I Error
Highlights
P hacking is the misuse and abuse of analysis techniques that can lead to false positives.
The video explains how to avoid P hacking by understanding its implications.
An example of drug testing illustrates the concept of P hacking.
Drug A and Drug B examples show how P values can be misleading if not analyzed properly.
Drug II appears to reduce recovery time, leading to a P value of 0.02, which is considered significant.
The concept of false positives is introduced, explaining that a 5% significance threshold leads to about 5% false positives.
The multiple testing problem is discussed as a source of false positives when many tests are conducted.
False discovery rate is introduced as a method to compensate for the multiple testing problem.
The importance of not cherry-picking data and including all P values for proper analysis is emphasized.
A scenario demonstrates how adding more data to a borderline significant P value can lead to P hacking.
Power analysis is mentioned as a method to determine the correct sample size before conducting an experiment.
The video concludes with a summary of how to avoid P hacking and the importance of proper statistical analysis.
The presenter encourages viewers to subscribe for more content and supports the channel through Patreon and other means.
The video promises to cover power and power analyses in future episodes to help determine appropriate sample sizes.
The concept of P hacking is compared to a game of chance, where the more tests conducted, the higher the chance of false positives.
A practical example of P hacking is given, where additional data collection leads to a statistically significant but incorrect conclusion.
Transcripts
P hackin don't do it if you do it it's a
shame skin quest yeah hello I'm Josh
stormer and welcome to stat quest today
we're gonna talk about P hacking what it
is and how to avoid it
note this stack quest assumes that you
are already familiar with p-values if
not check out the quests
imagine there was a virus and we wanted
to develop a drug to reduce the time it
took to recover from it
so he created a bunch of candidate drugs
and we tested each one to find out if
any of them worked so we measured how
long it took for three people to recover
from the virus without any drugs and
then we gave three people drug a and
measured how long it took them to
recover just by looking at the graph it
appears that drug a did not shorten
recovery very much if at all so we
measure how long it takes three more
people to recover without any medicine
and then we give three people drug B and
measure how long it takes them to
recover again just by looking at the
graph it doesn't look like drug B helped
people recover any faster than people
without any medicine so we just keep
testing drugs until we get one that
really looks like it does a good job
and at long last it looks like drugs II
does a great job reducing the amount of
time it takes to recover from the virus
so we calculate the means of the two
groups and do a statistical test to
compare the means and we get a p-value
equal to 0.02 and since 0.02 is less
than 0.05 we reject the null hypothesis
which is that there is no difference
between not taking a drug and taking
drugs II BAM
no no BAM we just pee hacked pee hacking
refers to the misuse and abuse of
analysis techniques and results in being
fooled by false positives however
instead of feeling great shame let's
learn about pee hacking so we don't do
it again
imagine we measured recovery times for a
whole lot of people who did not take any
drugs to fight the virus
and then we fit a normal distribution to
all of the recovery times the red area
under the curve indicates the percentage
of people that recovered from the
illness within a range of possible
values for example 2.5 percent of the
area under the curve is for durations
less than 5 days indicating a 2.5
percent of the people recovered in less
than 5 days
in contrast 95 percent of the area under
the curve is between 5 and 15 days
indicating that 95 percent of the people
were covered between 5 to 15 days
now if we only asked three people
represented by light blue circles how
long it took them to recover from the
illness there's a good chance all three
would say something between five and
fifteen days
and if we asked a different set of three
people represented by dark blue circles
there's a good chance all three would
also say something between five and 15
days just like before we can plot these
two groups of people on a graph
and we can calculate the mean values for
the two groups and compare those two
means and get a p-value equal to zero
point eight six and because zero point
eight six is greater than 0.05
we would fail to see a significant
difference between the two groups of
observations in other words the p-value
did not convince us that the
observations came from two different
distributions and that makes sense
because both groups of people came from
the exact same distribution
now imagine we asked another group of
three people how long it took them to
recover and we plotted their recovery
times and mean value on a graph then we
asked another group of three people how
long it took them to recover and we
plotted their recovery times and mean
value on the graph again we do a test to
compare the two means and we get a
p-value equal to zero point six three
and since 0.63 is greater than 0.05 we
would fail to see a significant
difference between the two groups of
observations
and again this is good because both sets
of observations came from the exact same
distribution
now imagine we just keep taking two
groups of three from the same
distribution and testing to see if they
are different note these two groups
almost look like they could be different
but the p value equals 0.06 which is
greater than the standard threshold for
significance 0.05 so we just keep going
and sooner or later we will get
something like this
when we compare the two means the
p-value equals zero point zero two and
that tells us that there is a
statistically significant difference
between the two groups suggesting that
the data came from two different
distributions which is incorrect since
we know that both samples came from the
same distribution the small p-value is a
false positive note you may remember
from the stat quest on interpreting
p-values that setting the threshold for
significance to 0.05 means that
approximately 5% of the statistical
tests we do on data gathered from the
same distribution will result in false
positives that means if we did 100 tests
we would expect about 5 false positives
or 5 percent and if we did 10,000 tests
we would expect about 500 false
positives in other words the more tests
we do the more false positives we have
to deal with oh no it's the dreaded
terminology alert doing a lot of tests
and ending up with false positives is
called the multiple testing problem the
good news is that there are many ways to
compensate for the multiple testing
problem and reduce the number of false
positives
one popular method is called the false
discovery rate
shameless self-promotion I have a whole
stack quest on the false discovery rate
the link is in the description below the
main idea is that you input the p-values
for every single comparison d ppppp boop
boop and then the false discovery rate
does some surprisingly simple
mathematics and outcome adjusted
p-values that are usually larger than
the original p-values and ultimately
some of the tests that were false
positives before end up with adjusted
p-values greater than 0.05
like I said I have a whole stack quest
on this method if you want to know more
details the important thing to know now
however is that in order for false
discovery rates or any other method that
compensates for multiple testing to work
properly you have to include all of the
p-values for all of the tests not just
the one that looks like it will give you
a small p-value
in other words don't cherry-pick your
data and only do tests that look good
BAM
now let's talk about a slightly less
obvious form of pee hacking remember
these two groups the p-value was 0.06
now we know that both groups came from
the same distribution but typically when
we are doing experiments we don't know
if they both came from the same
distribution or different ones and let's
be honest we usually hope that the
observations come from two different
distributions in this example we are
looking for a new drug to help people so
we want to see an improvement so when we
get data like this where the p-value is
close to 0.05 but not less than it is
very tempting to think hmm I wonder if
the p-value will get smaller if I add
more data
so we add one more measurement to each
group and now when we calculate the
p-value we get zero point zero two which
is less than 0.05 so we can report a
statistically significant difference
hooray we got what we wanted right
no we P hacked again wah wah when a
p-value is close to 0.05 like what we
had with the original data there's a
surprisingly high probability that just
adding one new measurement to both
groups will result in a false positive
in other words even though using a
threshold of 0.05 should only result in
5% of the bogus test giving us false
positives the theory assumes that we
only calculate a single p-value to make
a decision in this case we calculated
two p-values to make our decision the
one at the start which was 0.06 then
because the first p-value is close to
0.05 we added more data and calculated a
second p-value in this case we know all
of the measurements came from the exact
same distribution so we know this is a
false positive so how do we keep from
making this mistake in order to avoid
making this mistake we need to determine
the proper sample size before doing the
experiment and that means we need to do
a power analysis
a power analysis is performed before
doing an experiment and tells us how
many replicates we need in order to have
a relatively high probability of
correctly rejecting the null hypothesis
cool where can I learn more about doing
a power analysis in the next stack
quests we'll talk about power and power
analyses to determine the appropriate
sample size BAM
in summary if you have a bunch of things
you want to test out like a bunch of
different drugs that might help people
recover from a virus don't just collect
all the data but only calculate a
p-value for the one time things look
different
instead calculate a p-value for each
test and adjust all of the p-values with
something like the false discovery rate
this will help reduce the probability of
reporting a false positive
and when you do a test and get a p-value
close to 0.05 but not quite less than
0.05
don't just add more observations to the
data you already have instead use the
data you have for a power analysis to
determine the correct sample size this
will help prevent you from being fooled
by a false positive double bomb hooray
we've made it to the end of another
exciting stat quest if you like this
stat quest and want to see more please
subscribe and if you want to support
stack quest consider contributing to my
patreon campaign becoming a channel
member buying one or two of my original
songs or a t-shirt or a hoodie or just
donate the links are in the description
below alright until next time quest on
Browse More Related Video
p-values: What they are and how to interpret them
STAT115 Chapter 5.3 Multiple Hypotheses Testing and False Discovery Rate
Hypothesis Testing and The Null Hypothesis, Clearly Explained!!!
How to Perform and Interpret Independent Sample T-Test in SPSS
StatQuest: Logistic Regression
Statistical Significance versus Practical Significance
5.0 / 5 (0 votes)