p-hacking: What it is and how to avoid it!

StatQuest with Josh Starmer
3 May 202013:44

Summary

TLDRIn this StatQuest episode, Josh Starmer explains P-hacking, a practice where researchers manipulate data analysis to achieve statistically significant results, leading to false positives. He uses the example of testing drugs for virus recovery to illustrate how P-hacking can occur. The video warns against cherry-picking data and emphasizes the importance of proper sample size determination through power analysis to avoid false positives. It also introduces methods like the false discovery rate to compensate for multiple testing problems.

Takeaways

  • πŸ”¬ P-hacking refers to the misuse of statistical analyses to produce false positives, which can lead to incorrect conclusions in research.
  • πŸ“Š In the context of drug testing, P-hacking can occur when researchers continue to test different drugs until they find one that appears to work, based on a statistically significant p-value.
  • πŸ“ˆ The script illustrates the concept using a normal distribution to show how recovery times from a virus can be analyzed, and how selecting specific data points can lead to misleading results.
  • πŸ€” The importance of not cherry-picking data is emphasized; researchers should not only test and report on the data that supports their hypotheses.
  • πŸ“‰ The script explains the multiple testing problem, where conducting many tests increases the likelihood of obtaining false positives due to the arbitrary p-value threshold of 0.05.
  • πŸ›  To combat P-hacking, the script suggests using methods like the false discovery rate, which adjusts p-values to account for multiple comparisons and reduces the chance of false positives.
  • πŸ”Ž The concept of power analysis is introduced as a way to determine the appropriate sample size before conducting an experiment, which can help prevent P-hacking by ensuring sufficient data to detect true effects.
  • 🚫 The script warns against the temptation to add more data to a study after observing a p-value close to the significance threshold, as this can increase the risk of false positives.
  • πŸ“ The necessity of including all p-values from all tests in any adjustment method is highlighted to ensure the validity of the statistical analysis.
  • πŸŽ“ The video concludes with a call to action for viewers to learn more about statistical methods to avoid P-hacking, such as through further StatQuest videos on power analysis.

Q & A

  • What is P-hacking?

    -P-hacking refers to the misuse and abuse of analysis techniques, which can lead to being fooled by false positives.

  • Why is P-hacking a problem in statistical analysis?

    -P-hacking is a problem because it can lead to false positives, which means incorrectly rejecting the null hypothesis when there is no actual effect.

  • What is the significance level typically used in statistical tests?

    -The significance level typically used in statistical tests is 0.05, which means that there is a 5% chance of a false positive.

  • What is the multiple testing problem?

    -The multiple testing problem occurs when doing a lot of tests, which increases the likelihood of encountering false positives.

  • How can the false discovery rate help address the multiple testing problem?

    -The false discovery rate adjusts p-values to account for multiple comparisons, usually resulting in larger p-values and reducing the number of false positives.

  • What is a power analysis and why is it important?

    -A power analysis is performed before an experiment to determine the appropriate sample size needed to have a high probability of correctly rejecting the null hypothesis.

  • Why is it incorrect to add more data to a test with a p-value close to 0.05?

    -Adding more data to a test with a p-value close to 0.05 can increase the chance of a false positive, as the initial p-value calculation already considered the data available.

  • What should one do when they get a p-value close to but not less than 0.05?

    -When a p-value is close to but not less than 0.05, one should conduct a power analysis to determine the correct sample size rather than adding more data to the existing test.

  • What is the role of the null hypothesis in the context of P-hacking?

    -In the context of P-hacking, the null hypothesis is that there is no difference between groups or conditions. P-hacking can lead to the incorrect rejection of this null hypothesis due to false positives.

  • How can one avoid P-hacking in their statistical analyses?

    -To avoid P-hacking, one should calculate p-values for all tests, adjust for multiple comparisons using methods like the false discovery rate, and conduct power analyses to determine appropriate sample sizes.

Outlines

00:00

πŸ§ͺ Understanding P-Hacking

Josh, the host of Stat Quest, introduces the concept of P-hacking, which is the misuse of statistical analysis that can lead to false positives. The video uses an analogy of testing drugs to reduce recovery time from a virus. Initially, drugs A and B show no significant effect, but drug II appears promising with a p-value of 0.02, leading to the rejection of the null hypothesis. However, this is a P-hacking scenario because the process of testing multiple drugs and selecting the one with the lowest p-value increases the likelihood of a false positive.

05:04

πŸ” The Multiple Testing Problem

The video delves into the multiple testing problem, illustrating how testing multiple samples can lead to false positives. It explains the concept using a normal distribution curve representing recovery times from a virus. By selecting small groups of three people, the likelihood of getting a p-value that suggests a significant difference increases, even though the samples come from the same distribution. The video emphasizes the importance of not cherry-picking data and using methods like the false discovery rate to adjust p-values and reduce false positives.

10:05

πŸ’Š Avoiding P-Hacking in Practice

The final paragraph discusses how to avoid P-hacking in practical scenarios. It warns against adding more data to a test just because the initial p-value is close to the significance threshold, as this can lead to false positives. The video suggests conducting a power analysis before the experiment to determine the appropriate sample size, which helps in avoiding being misled by false positives. It concludes by encouraging viewers to use all p-values for adjustments and to plan experiments carefully to prevent P-hacking.

Mindmap

Keywords

πŸ’‘P-hacking

P-hacking refers to the practice of manipulating statistical analyses or selectively reporting results in a way that leads to misleading conclusions, often by capitalizing on chance results. In the video, it is discussed as a misuse of analysis techniques that can result in false positives. The example given is testing multiple drugs for their effectiveness in reducing recovery time from a virus, where the temptation to continue testing until a significant result is found exemplifies p-hacking.

πŸ’‘P-value

A p-value is a statistical measure used to determine the likelihood that an observed result could have occurred by chance alone. It plays a central role in hypothesis testing. In the video, p-values are used to compare the means of different groups, and a p-value of less than 0.05 is typically used to reject the null hypothesis. The video warns against the misuse of p-values in p-hacking.

πŸ’‘Null Hypothesis

The null hypothesis is a statistical assumption that there is no effect or relationship between variables being tested. It serves as a baseline for comparison in hypothesis testing. The video explains that when a p-value is less than 0.05, the null hypothesis is rejected, suggesting a significant difference between groups, such as the effectiveness of a drug.

πŸ’‘False Positive

A false positive occurs when a test incorrectly indicates that a relationship or effect exists when it does not. The video discusses how p-hacking can lead to false positives, where a statistically significant result is obtained when in fact there is no real effect, as illustrated by the example of finding a significant result after repeatedly testing different drugs.

πŸ’‘Multiple Testing Problem

The multiple testing problem arises when conducting many statistical tests, increasing the likelihood of obtaining false positives. The video explains that as the number of tests increases, so does the chance of finding at least one significant result by chance alone, which is a key issue in p-hacking.

πŸ’‘False Discovery Rate (FDR)

The false discovery rate is a statistical method used to control the expected proportion of false positives among the rejected null hypotheses. The video suggests using FDR to adjust p-values, which can help reduce the number of false positives when conducting multiple tests, as it adjusts for the increased likelihood of Type I errors.

πŸ’‘Power Analysis

A power analysis is a statistical method used to determine the appropriate sample size for an experiment to ensure that it has a high probability of detecting an effect if one truly exists. The video emphasizes the importance of conducting a power analysis before an experiment to avoid the temptation to add more data post hoc, which can lead to p-hacking.

πŸ’‘Sample Size

Sample size refers to the number of observations or subjects included in a study. The video discusses how determining the correct sample size before conducting an experiment is crucial to avoid p-hacking. A proper sample size ensures that the study has enough power to detect a true effect, reducing the need to add more data after observing non-significant results.

πŸ’‘Statistical Significance

Statistical significance is a term used to describe the probability that a result is not due to chance. In the video, statistical significance is used as a criterion for determining whether a drug is effective, with a p-value less than 0.05 indicating significance. However, the video warns that the pursuit of statistical significance can lead to p-hacking if not properly managed.

πŸ’‘Type I Error

A Type I error occurs when a true null hypothesis is incorrectly rejected, leading to a false positive result. The video discusses the risk of Type I errors in the context of p-hacking, where the threshold for significance (0.05) means that 5% of tests conducted on the same distribution will falsely indicate a significant difference.

Highlights

P hacking is the misuse and abuse of analysis techniques that can lead to false positives.

The video explains how to avoid P hacking by understanding its implications.

An example of drug testing illustrates the concept of P hacking.

Drug A and Drug B examples show how P values can be misleading if not analyzed properly.

Drug II appears to reduce recovery time, leading to a P value of 0.02, which is considered significant.

The concept of false positives is introduced, explaining that a 5% significance threshold leads to about 5% false positives.

The multiple testing problem is discussed as a source of false positives when many tests are conducted.

False discovery rate is introduced as a method to compensate for the multiple testing problem.

The importance of not cherry-picking data and including all P values for proper analysis is emphasized.

A scenario demonstrates how adding more data to a borderline significant P value can lead to P hacking.

Power analysis is mentioned as a method to determine the correct sample size before conducting an experiment.

The video concludes with a summary of how to avoid P hacking and the importance of proper statistical analysis.

The presenter encourages viewers to subscribe for more content and supports the channel through Patreon and other means.

The video promises to cover power and power analyses in future episodes to help determine appropriate sample sizes.

The concept of P hacking is compared to a game of chance, where the more tests conducted, the higher the chance of false positives.

A practical example of P hacking is given, where additional data collection leads to a statistically significant but incorrect conclusion.

Transcripts

play00:00

P hackin don't do it if you do it it's a

play00:04

shame skin quest yeah hello I'm Josh

play00:11

stormer and welcome to stat quest today

play00:14

we're gonna talk about P hacking what it

play00:16

is and how to avoid it

play00:19

note this stack quest assumes that you

play00:22

are already familiar with p-values if

play00:25

not check out the quests

play00:28

imagine there was a virus and we wanted

play00:32

to develop a drug to reduce the time it

play00:34

took to recover from it

play00:36

so he created a bunch of candidate drugs

play00:40

and we tested each one to find out if

play00:44

any of them worked so we measured how

play00:47

long it took for three people to recover

play00:50

from the virus without any drugs and

play00:53

then we gave three people drug a and

play00:56

measured how long it took them to

play00:58

recover just by looking at the graph it

play01:02

appears that drug a did not shorten

play01:05

recovery very much if at all so we

play01:09

measure how long it takes three more

play01:11

people to recover without any medicine

play01:13

and then we give three people drug B and

play01:17

measure how long it takes them to

play01:19

recover again just by looking at the

play01:23

graph it doesn't look like drug B helped

play01:26

people recover any faster than people

play01:28

without any medicine so we just keep

play01:31

testing drugs until we get one that

play01:33

really looks like it does a good job

play01:37

and at long last it looks like drugs II

play01:40

does a great job reducing the amount of

play01:43

time it takes to recover from the virus

play01:46

so we calculate the means of the two

play01:48

groups and do a statistical test to

play01:52

compare the means and we get a p-value

play01:56

equal to 0.02 and since 0.02 is less

play02:03

than 0.05 we reject the null hypothesis

play02:07

which is that there is no difference

play02:09

between not taking a drug and taking

play02:12

drugs II BAM

play02:15

no no BAM we just pee hacked pee hacking

play02:21

refers to the misuse and abuse of

play02:24

analysis techniques and results in being

play02:27

fooled by false positives however

play02:31

instead of feeling great shame let's

play02:34

learn about pee hacking so we don't do

play02:36

it again

play02:38

imagine we measured recovery times for a

play02:41

whole lot of people who did not take any

play02:43

drugs to fight the virus

play02:46

and then we fit a normal distribution to

play02:49

all of the recovery times the red area

play02:53

under the curve indicates the percentage

play02:55

of people that recovered from the

play02:57

illness within a range of possible

play02:59

values for example 2.5 percent of the

play03:04

area under the curve is for durations

play03:07

less than 5 days indicating a 2.5

play03:11

percent of the people recovered in less

play03:13

than 5 days

play03:14

in contrast 95 percent of the area under

play03:18

the curve is between 5 and 15 days

play03:21

indicating that 95 percent of the people

play03:24

were covered between 5 to 15 days

play03:28

now if we only asked three people

play03:31

represented by light blue circles how

play03:34

long it took them to recover from the

play03:36

illness there's a good chance all three

play03:39

would say something between five and

play03:40

fifteen days

play03:43

and if we asked a different set of three

play03:46

people represented by dark blue circles

play03:48

there's a good chance all three would

play03:51

also say something between five and 15

play03:54

days just like before we can plot these

play03:59

two groups of people on a graph

play04:02

and we can calculate the mean values for

play04:05

the two groups and compare those two

play04:08

means and get a p-value equal to zero

play04:11

point eight six and because zero point

play04:15

eight six is greater than 0.05

play04:19

we would fail to see a significant

play04:21

difference between the two groups of

play04:23

observations in other words the p-value

play04:27

did not convince us that the

play04:29

observations came from two different

play04:31

distributions and that makes sense

play04:34

because both groups of people came from

play04:37

the exact same distribution

play04:40

now imagine we asked another group of

play04:43

three people how long it took them to

play04:44

recover and we plotted their recovery

play04:47

times and mean value on a graph then we

play04:51

asked another group of three people how

play04:53

long it took them to recover and we

play04:56

plotted their recovery times and mean

play04:58

value on the graph again we do a test to

play05:03

compare the two means and we get a

play05:05

p-value equal to zero point six three

play05:09

and since 0.63 is greater than 0.05 we

play05:15

would fail to see a significant

play05:17

difference between the two groups of

play05:18

observations

play05:20

and again this is good because both sets

play05:24

of observations came from the exact same

play05:26

distribution

play05:29

now imagine we just keep taking two

play05:31

groups of three from the same

play05:33

distribution and testing to see if they

play05:35

are different note these two groups

play05:39

almost look like they could be different

play05:41

but the p value equals 0.06 which is

play05:45

greater than the standard threshold for

play05:47

significance 0.05 so we just keep going

play05:53

and sooner or later we will get

play05:56

something like this

play05:58

when we compare the two means the

play06:01

p-value equals zero point zero two and

play06:04

that tells us that there is a

play06:07

statistically significant difference

play06:09

between the two groups suggesting that

play06:12

the data came from two different

play06:14

distributions which is incorrect since

play06:19

we know that both samples came from the

play06:21

same distribution the small p-value is a

play06:24

false positive note you may remember

play06:28

from the stat quest on interpreting

play06:30

p-values that setting the threshold for

play06:33

significance to 0.05 means that

play06:37

approximately 5% of the statistical

play06:40

tests we do on data gathered from the

play06:42

same distribution will result in false

play06:45

positives that means if we did 100 tests

play06:50

we would expect about 5 false positives

play06:53

or 5 percent and if we did 10,000 tests

play06:57

we would expect about 500 false

play07:00

positives in other words the more tests

play07:04

we do the more false positives we have

play07:07

to deal with oh no it's the dreaded

play07:10

terminology alert doing a lot of tests

play07:14

and ending up with false positives is

play07:16

called the multiple testing problem the

play07:21

good news is that there are many ways to

play07:23

compensate for the multiple testing

play07:25

problem and reduce the number of false

play07:27

positives

play07:29

one popular method is called the false

play07:32

discovery rate

play07:34

shameless self-promotion I have a whole

play07:38

stack quest on the false discovery rate

play07:40

the link is in the description below the

play07:44

main idea is that you input the p-values

play07:47

for every single comparison d ppppp boop

play07:52

boop and then the false discovery rate

play07:54

does some surprisingly simple

play07:56

mathematics and outcome adjusted

play08:00

p-values that are usually larger than

play08:02

the original p-values and ultimately

play08:07

some of the tests that were false

play08:09

positives before end up with adjusted

play08:12

p-values greater than 0.05

play08:16

like I said I have a whole stack quest

play08:19

on this method if you want to know more

play08:21

details the important thing to know now

play08:24

however is that in order for false

play08:27

discovery rates or any other method that

play08:30

compensates for multiple testing to work

play08:32

properly you have to include all of the

play08:36

p-values for all of the tests not just

play08:39

the one that looks like it will give you

play08:41

a small p-value

play08:43

in other words don't cherry-pick your

play08:46

data and only do tests that look good

play08:49

BAM

play08:53

now let's talk about a slightly less

play08:55

obvious form of pee hacking remember

play08:58

these two groups the p-value was 0.06

play09:04

now we know that both groups came from

play09:07

the same distribution but typically when

play09:12

we are doing experiments we don't know

play09:14

if they both came from the same

play09:16

distribution or different ones and let's

play09:21

be honest we usually hope that the

play09:23

observations come from two different

play09:25

distributions in this example we are

play09:29

looking for a new drug to help people so

play09:32

we want to see an improvement so when we

play09:36

get data like this where the p-value is

play09:39

close to 0.05 but not less than it is

play09:43

very tempting to think hmm I wonder if

play09:47

the p-value will get smaller if I add

play09:49

more data

play09:51

so we add one more measurement to each

play09:54

group and now when we calculate the

play09:58

p-value we get zero point zero two which

play10:01

is less than 0.05 so we can report a

play10:05

statistically significant difference

play10:08

hooray we got what we wanted right

play10:11

no we P hacked again wah wah when a

play10:18

p-value is close to 0.05 like what we

play10:22

had with the original data there's a

play10:25

surprisingly high probability that just

play10:28

adding one new measurement to both

play10:30

groups will result in a false positive

play10:34

in other words even though using a

play10:37

threshold of 0.05 should only result in

play10:41

5% of the bogus test giving us false

play10:44

positives the theory assumes that we

play10:48

only calculate a single p-value to make

play10:51

a decision in this case we calculated

play10:55

two p-values to make our decision the

play10:58

one at the start which was 0.06 then

play11:03

because the first p-value is close to

play11:06

0.05 we added more data and calculated a

play11:10

second p-value in this case we know all

play11:15

of the measurements came from the exact

play11:17

same distribution so we know this is a

play11:20

false positive so how do we keep from

play11:24

making this mistake in order to avoid

play11:28

making this mistake we need to determine

play11:31

the proper sample size before doing the

play11:34

experiment and that means we need to do

play11:38

a power analysis

play11:41

a power analysis is performed before

play11:43

doing an experiment and tells us how

play11:46

many replicates we need in order to have

play11:49

a relatively high probability of

play11:51

correctly rejecting the null hypothesis

play11:54

cool where can I learn more about doing

play11:58

a power analysis in the next stack

play12:02

quests we'll talk about power and power

play12:04

analyses to determine the appropriate

play12:07

sample size BAM

play12:13

in summary if you have a bunch of things

play12:16

you want to test out like a bunch of

play12:18

different drugs that might help people

play12:20

recover from a virus don't just collect

play12:23

all the data but only calculate a

play12:26

p-value for the one time things look

play12:28

different

play12:30

instead calculate a p-value for each

play12:33

test and adjust all of the p-values with

play12:36

something like the false discovery rate

play12:39

this will help reduce the probability of

play12:42

reporting a false positive

play12:45

and when you do a test and get a p-value

play12:48

close to 0.05 but not quite less than

play12:52

0.05

play12:55

don't just add more observations to the

play12:58

data you already have instead use the

play13:02

data you have for a power analysis to

play13:05

determine the correct sample size this

play13:08

will help prevent you from being fooled

play13:10

by a false positive double bomb hooray

play13:17

we've made it to the end of another

play13:19

exciting stat quest if you like this

play13:22

stat quest and want to see more please

play13:24

subscribe and if you want to support

play13:26

stack quest consider contributing to my

play13:29

patreon campaign becoming a channel

play13:32

member buying one or two of my original

play13:34

songs or a t-shirt or a hoodie or just

play13:37

donate the links are in the description

play13:39

below alright until next time quest on

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
P-hackingStatistical AnalysisFalse PositivesData ScienceResearch MethodsDrug TestingStatistical SignificanceMultiple TestingPower AnalysisData Interpretation