Is Most Published Research Wrong?

Veritasium
11 Aug 201612:22

Summary

TLDRThis script from 'Journal of Personality and Social Psychology' discusses the concept of precognition and the reliability of scientific research. It critiques the use of p-values, highlighting how they can lead to false positives and the reproducibility crisis in science. The script also touches on the incentives for scientists to publish significant findings, the challenges of replication, and recent efforts to improve research integrity, emphasizing the importance of skepticism and rigorous methodology in scientific inquiry.

Takeaways

  • ๐Ÿ”ฎ In 2011, a study titled 'Feeling the Future' suggested that people might have the ability to see into the future, but the results were not conclusive.
  • ๐ŸŽฐ The study involved predicting images behind curtains, with a hit rate of 53% for erotic images, which was statistically significant with a p-value of .01.
  • ๐Ÿ“Š Scientists use p-values to determine the significance of results, with a threshold of .05 commonly used to indicate that results are unlikely due to chance.
  • ๐Ÿง However, relying on a p-value of .05 can lead to a significant number of false positives, especially when multiple hypotheses are being tested.
  • ๐Ÿงช The 'Reproducibility Project' found that only 36% of psychology studies could be statistically significantly replicated, raising questions about the reliability of published research.
  • ๐Ÿซ A study supposedly showing that eating chocolate helps with weight loss was later revealed to be a case of p-hacking, where data was manipulated to produce a false positive result.
  • ๐Ÿ”ฌ The scientific community has recognized the reproducibility crisis and is taking steps to improve research practices, such as pre-registering studies and reducing publication bias.
  • ๐Ÿ“ˆ Publication bias is a significant issue, as journals are more likely to publish studies with statistically significant results, which can skew the scientific record.
  • ๐Ÿ“‰ The pressure to publish can lead researchers to focus on novel and unexpected hypotheses, which may not hold up under scrutiny, further contributing to the reproducibility crisis.
  • ๐ŸŒŸ Despite the challenges, science remains the most reliable method for understanding the world, even if it is not infallible.

Q & A

  • What was the title of the article published in the Journal of Personality and Social Psychology in 2011?

    -The title of the article was 'Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect'.

  • What was the main claim of the 2011 study regarding the ability to see into the future?

    -The main claim of the study was that people could potentially see into the future, as indicated by participants having a slightly higher hit rate than chance when predicting which curtain hid an erotic image.

  • What is a p-value in the context of statistical significance?

    -A p-value is a statistical measure that indicates the probability of obtaining a result at least as extreme as the one observed, assuming the null hypothesis is true. A p-value less than .05 is generally considered significant.

  • Why might a 53% hit rate for erotic images in the study not necessarily mean that people can see into the future?

    -A 53% hit rate does not necessarily mean that people can see into the future because it could be due to chance, and the p-value of .01, while significant, might not be enough to conclude such an extraordinary claim without further evidence.

  • What is the issue with using a p-value threshold of .05 for determining statistical significance?

    -Using a p-value threshold of .05 can lead to a significant number of false positives, especially when there are many hypotheses being tested or when the studies are underpowered, biased, or suffer from p-hacking.

  • What is meant by 'p-hacking' in the context of scientific research?

    -P-hacking refers to the practice of manipulating the data analysis, such as selecting or modifying variables, sample sizes, or statistical techniques, to achieve a p-value that meets the threshold for statistical significance.

  • What was the result of the Reproducibility Project that attempted to replicate 100 psychology studies?

    -The Reproducibility Project found that only 36% of the 100 psychology studies they attempted to replicate had statistically significant results the second time around.

  • Why are replication studies important in scientific research?

    -Replication studies are important because they help to verify the validity and reliability of initial findings, ensuring that scientific knowledge is built on solid and reproducible evidence.

  • What are some of the recent changes in scientific practices aimed at addressing the reproducibility crisis?

    -Recent changes include conducting large-scale replication studies, establishing sites like Retraction Watch for withdrawn papers, creating repositories for unpublished negative results, and adopting practices like preregistration of hypotheses and methods to reduce publication bias and p-hacking.

  • What is the significance of the pentaquark example in the context of the reproducibility crisis?

    -The pentaquark example illustrates how even with stringent statistical requirements, false discoveries can occur due to biases in data analysis, emphasizing the importance of blinding and replication in scientific research.

Outlines

00:00

๐Ÿ”ฎ The Illusion of Precognition in Research

The paragraph discusses a 2011 study published in the 'Journal of Personality and Social Psychology' which suggested that people might have the ability to see into the future. The study involved nine experiments, one of which had participants predict which of two curtains on a computer screen concealed an image. The hit rate for erotic images was slightly higher than chance, leading to a p-value of .01, suggesting a 1% likelihood of the result occurring by luck. This led to a broader discussion on the significance of p-values in determining the validity of scientific findings, questioning whether the common threshold of p < .05 is too lenient and how it might lead to a high proportion of false positives in published research.

05:03

๐Ÿ”ฌ The Reproducibility Crisis in Scientific Research

This paragraph delves into the issue of reproducibility in scientific research, highlighting the potential for a large portion of published studies to be false due to factors such as p-hacking, researcher bias, and the low rate of negative results being published. It discusses a hypothetical scenario where only a small fraction of true relationships are identified, and many false hypotheses are mistakenly validated due to the p-value threshold. The paragraph also references the Reproducibility Project, which attempted to replicate psychology studies with only a 36% success rate, and a study on the pentaquark particle that was later debunked, illustrating the prevalence of false findings even with stringent statistical requirements.

10:03

๐ŸŒŸ Towards Improved Research Practices and Transparency

The final paragraph addresses the ongoing efforts to improve the reliability of scientific research. It mentions large-scale replication studies, the establishment of Retraction Watch, and the creation of online repositories for unpublished negative results. There is a shift towards pre-registering hypotheses and methods to reduce publication bias and encourage higher-powered studies. The paragraph concludes with a reflection on the importance of the scientific method, despite its flaws, as a more reliable way of seeking truth compared to other methods of inquiry.

Mindmap

Keywords

๐Ÿ’กAnomalous Retroactive Influences

This term refers to the supposed ability of individuals to be influenced by future events, implying a form of precognition or seeing into the future. In the video, it is mentioned in the context of a study that claimed to provide experimental evidence for such influences, challenging the conventional understanding of time and causality.

๐Ÿ’กp-value

A p-value is a statistical measure used to determine the likelihood that a result is due to chance. In the script, it is used to evaluate the significance of the hit rate in the experiment, with a p-value of .01 suggesting a low probability that the observed results were due to luck, thus hinting at a potential ability to perceive the future.

๐Ÿ’กStatistical Significance

Statistical significance is a concept used to determine if a result is unlikely to have occurred by chance alone. The video discusses the threshold of p < .05 as a criterion for significance, which is a standard used in many scientific studies to decide whether results are noteworthy.

๐Ÿ’กFalse Positives

False positives occur when a test incorrectly indicates that a particular condition is present. The video script discusses how even with a p-value threshold of .05, a significant number of published results can be false positives, highlighting the limitations of relying solely on statistical significance.

๐Ÿ’กReproducibility

Reproducibility in scientific research refers to the ability of other researchers to obtain the same results when repeating an experiment. The video emphasizes the importance of reproducibility by discussing the Reproducibility Project, which aimed to verify past psychological studies and found that many did not hold up upon retesting.

๐Ÿ’กp-hacking

p-hacking is the practice of manipulating data or statistical analyses to achieve a desired outcome, often to achieve statistical significance. The video describes how researchers might engage in p-hacking by making decisions about data collection and analysis that can inflate the likelihood of obtaining significant results.

๐Ÿ’กPublication Bias

Publication bias is the tendency of journals to publish studies with positive results over those with null or negative findings. The video points out that this bias can skew the scientific literature, making it appear that more findings are significant than they actually are.

๐Ÿ’กNull Hypothesis

The null hypothesis is a statistical assumption that there is no effect or relationship between variables. In the context of the video, the null hypothesis would be that people cannot see into the future, and any significant results would suggest that this hypothesis is false.

๐Ÿ’กReplication Studies

Replication studies are experiments conducted to verify the results of previous research. The video discusses the challenges of publishing replication studies, as they often do not yield significant results and are less likely to be accepted by journals, which can hinder the self-correcting process of science.

๐Ÿ’กStatistical Power

Statistical power is the probability that a test will correctly reject a false null hypothesis. The script mentions an 80% statistical power, indicating that the experiments were designed to have a high likelihood of detecting true relationships if they exist.

๐Ÿ’กResearch Incentives

Research incentives refer to the motivations and pressures that drive scientists to conduct and publish their research. The video discusses how the pressure to publish and the focus on novel and significant results can lead to practices that increase the likelihood of false positives and hinder the overall reliability of scientific findings.

Highlights

An article in the 'Journal of Personality and Social Psychology' suggests experimental evidence for people being able to see into the future.

Nine experiments were conducted, one involving predicting which of two curtains on a computer screen concealed an image.

The hit rate for erotic images was 53%, slightly higher than the expected 50%, suggesting a potential ability to predict the future.

A p-value of .01 indicates a 1% chance of the 53% hit rate occurring by luck alone, hinting at the study's significance.

The commonly accepted p-value threshold of .05 for statistical significance was set by Ronald Fisher in 1925.

Research suggests that up to 30% of published results could be false due to factors like p-value thresholds and publication biases.

The reproducibility of scientific studies is questioned, with only 36% of psychology studies successfully replicated in the Reproducibility Project.

A study on the basic science of cancer saw only 6 out of 53 landmark studies successfully replicated.

A study claiming chocolate aids weight loss was widely publicized despite being intentionally designed to increase false positives.

The concept of 'p-hacking', where researchers manipulate data analysis to achieve a significant p-value, is discussed.

The pentaquark particle discovery in particle physics, initially supported by multiple experiments, was later debunked as a false positive.

Researchers are incentivized to publish novel and statistically significant results, which can lead to a focus on p-hacking and less emphasis on replication.

Efforts to improve scientific integrity include large-scale replication studies, publicizing retracted papers, and repositories for unpublished results.

The scientific method, despite its flaws, is considered more reliable than any other way of knowing.

The video concludes by acknowledging the challenges in scientific research and the importance of continued efforts to improve reproducibility and integrity.

Transcripts

play00:00

In 2011 an article was published in the reputable "Journal of Personality and

play00:05

Social Psychology". It was called "Feeling the Future: Experimental Evidence for

play00:10

Anomalous Retroactive Influences on Cognition and Affect" or, in other words,

play00:15

proof that people can see into the future. The paper reported on nine

play00:20

experiments. In one, participants were shown two curtains on a computer screen

play00:23

and asked to predict which one had an image behind it, the other just covered a

play00:27

blank wall. Once the participant made their selection the computer randomly

play00:30

positioned an image behind one of the curtains, then the selected curtain was

play00:34

pulled back to show either the image or the blank wall

play00:37

the images were randomly selected from one of three categories: neutral, negative,

play00:42

or erotic. If participants selected the curtain covering the image this was

play00:46

considered a hit. Now with there being two curtains and the images positions

play00:50

randomly behind one of them, you would expect the hit rate to be about fifty

play00:54

percent. And that is exactly what the researchers found, at least for negative

play00:59

neutral images

play01:01

however for erotic images the hit rate was fifty-three percent. Does that mean

play01:05

that we can see into the future? Is that slight deviation significant? Well to

play01:09

assess significance scientists usually turn to p-values, a statistic that tells

play01:13

you how likely a result, at least this extreme, is if the null hypothesis is

play01:17

true. In this case the null hypothesis would just be that people couldn't

play01:21

actually see into the future and the 53-percent result was due to lucky

play01:24

guesses. For this study the p-value was .01 meaning there was just a one-percent

play01:29

chance of getting a hit rate of fifty-three percent or higher from

play01:32

simple luck. p-values less than .05 are generally considered significant

play01:36

and worthy of publication but you might want to use a higher bar before you

play01:40

accept that humans can accurately perceive the future and, say, invite the

play01:44

study's author on your news program; but hey, it's your choice. After all, the .05

play01:49

threshold was arbitrarily selected by Ronald Fisher in a book he published in

play01:54

1925. But this raises the question: how much of the published research literature is

play01:59

actually false? The intuitive answer seems to be five percent. I mean if

play02:03

everyone is using p less than .05 as a cut-off for statistical

play02:06

significance, you would expect five of every hundred results to be false positives

play02:11

but that unfortunately grossly underestimates the problem and here's why.

play02:16

Imagine you're a researcher in a field where there are a thousand hypotheses

play02:20

currently being investigated.

play02:22

Let's assume that ten percent of them reflect true relationships and the rest

play02:25

are false, but no one of course knows which are which, that's the whole point

play02:28

of doing the research. Now, assuming the experiments are pretty well designed,

play02:32

they should correctly identify around say 80 of the hundred true relationships

play02:36

this is known as a statistical power of eighty percent, so 20 results are false

play02:42

negatives, perhaps the sample size was too small or the measurements were not

play02:45

sensitive enough. Now considered that from those 900 false hypotheses using a

play02:50

p-value of .05, forty-five false hypotheses will be incorrectly

play02:55

considered true. As for the rest, they will be correctly identified as false

play02:59

but most journals rarely published no results: they make up just ten to thirty

play03:03

percent of papers depending on the field, which means that the papers that

play03:07

eventually get published will include 80 true positive results:

play03:10

45 false positive results and maybe 20 true negative results.

play03:15

Nearly a third of published results will be wrong

play03:18

even with the system working normally, things get even worse if studies are

play03:22

underpowered, and analysis shows they typically are, if there is a higher ratio

play03:26

of false-to-true hypotheses being tested or if the researchers are biased.

play03:32

All of this was pointed out in 2005 paper entitled "Why most published research is false".

play03:37

So, recently, researchers in a number of fields have attempted to

play03:40

quantify the problem by replicating some prominent past results.

play03:44

The Reproducibility Project repeated a hundred psychology studies but found only

play03:48

thirty-six percent had a statistically significant result the second time

play03:52

around and the strength of measured relationships were on average half those

play03:56

of the original studies. An attempted verification of 53 studies considered

play03:59

landmarks in the basic science of cancer only managed to reproduce six even

play04:04

working closely with the original study's authors these results are even

play04:08

worse than i just calculated the reason for this is nicely illustrated by a 2015

play04:13

study showing that eating a bar of chocolate every day can help you lose

play04:16

weight faster. In this case the participants were randomly allocated to

play04:20

one of three treatment groups:

play04:22

one went on a low-carb diet, another one on the same low carb diet plus a 1.5 ounce

play04:26

bar of chocolate per day and the third group was the control, instructed

play04:30

just to maintain their regular eating habits at the end of three weeks the

play04:33

control group had neither lost nor gained weight but both low carb groups

play04:37

had lost an average of five pounds per person

play04:40

the group that a chocolate however lost weight ten percent faster than the

play04:44

non-chocolate eaters the finding was statistically significant with a p-value less than .05

play04:50

As you might expect this news spread like wildfire, to the

play04:53

front page of Bild, the most widely circulated daily newspaper in Europe

play04:57

and into the Daily Star, the Irish Examiner, to Huffington Post and even Shape Magazine

play05:02

unfortunately the whole thing had been faked, kind of. I mean researchers did

play05:07

perform the experiment exactly as they described, but they intentionally

play05:11

designed it to increase the likelihood of false positives: the sample size was

play05:15

incredibly small, just five people per treatment group, and for each person 18

play05:20

different measurements were tracked including: weight, cholesterol, sodium,

play05:24

blood protein levels, sleep quality, well-being, and so on; so if weight loss

play05:29

didn't show a significant difference there were plenty of other factors that

play05:32

might have. So the headline could have been "chocolate lowers cholesterol" or

play05:36

"increases sleep quality" or... something.

play05:39

The point is: a p-value is only really valid for a single measure

play05:43

once you're comparing a whole slew of variables the probability that at least

play05:46

one of them gives you a false positive goes way up, and this is known as "p-hacking".

play05:51

Researchers can make a lot of decisions about their analysis that can

play05:54

decrease the p-value, for example let's say you analyze your data and you find

play05:58

it nearly reaches statistical significance, so you decide to collect

play06:01

just a few more data points to be sure

play06:03

then if the p-value drops below .05 you stop collecting data, confident that

play06:08

these additional data points could only have made the result more significant if

play06:11

there were really a true relationship there, but numerical simulations show

play06:15

that relationships can cross the significance threshold by adding more

play06:19

data points even though a much larger sample would show that there really is

play06:23

no relationship. In fact, there are a great number of ways to increase the

play06:27

likelihood of significant results like: having two dependent variables, adding

play06:31

more observations, controlling for gender, or dropping one of three conditions

play06:36

combining all three of these strategies together increases the

play06:39

likelihood of a false-positive to over sixty percent, and that is using p less than .05

play06:45

Now if you think this is just a problem for psychology

play06:47

neuroscience or medicine, consider the pentaquark, an exotic particle made

play06:52

up of five quarks, as opposed to the regular three for protons or neutrons.

play06:56

Particle physics employs particularly stringent requirements for statistical

play07:00

significance referred to as 5-sigma or one chance in 3.5 million of getting a

play07:05

false positive, but in 2002 a Japanese experiment found evidence for the

play07:09

Theta-plus pentaquark, and in the two years that followed 11 other independent

play07:13

experiments then looked for and found evidence of that same pentaquark with

play07:17

very high levels of statistical significance. From July 2003 to

play07:22

May 2004 a theoretical paper on pentaquarks was published on average every

play07:26

other day, but alas, it was a false discovery for their experimental

play07:31

attempts to confirm that theta-plus pentaquark using greater statistical

play07:34

power failed to find any trace of its existence.

play07:37

The problem was those first scientists weren't blind to the data, they knew how

play07:41

the numbers were generated and what answer they expected to get, and the way

play07:45

the data was cut and analyzed, or p-hacked, produced the false finding.

play07:50

Now most scientists aren't p-hacking maliciously, there are legitimate decisions to be

play07:54

made about how to collect, analyze and report data, and these decisions impact

play07:58

on the statistical significance of results. For example, 29 different

play08:02

research groups were given the same data and asked to determine if dark-skinned

play08:05

soccer players are more likely to be given red cards; using identical data

play08:10

some groups found there was no significant effect while others

play08:13

concluded dark-skinned players were three times as likely to receive a red card.

play08:18

The point is that data doesn't speak for itself, it must be interpreted.

play08:22

Looking at those results

play08:23

it seems that dark skinned players are more likely to get red carded but

play08:26

certainly not three times as likely; consensus helps in this case but

play08:31

for most results only one research group provides the analysis and therein lies

play08:35

the problem of incentives: scientists have huge incentives to publish papers,

play08:40

in fact their careers depend on it; as one scientist Brian Nosek puts it:

play08:44

"There is no cost to getting things wrong, the cost is not getting them published".

play08:49

Journals are far more likely to publish

play08:51

results that reach statistical significance so if a method of data

play08:54

analysis results in a p-value less than .05 then you're likely to go with

play08:58

that method, publication's also more likely if the result is novel and

play09:02

unexpected, this encourages researchers to investigate more and more unlikely

play09:05

hypotheses which further decreases the ratio of true to spurious relationships

play09:10

that are tested; now what about replication? Isn't science meant to

play09:14

self-correct by having other scientists replicate the findings of an initial

play09:18

discovery? In theory yes but in practice it's more complicated, like take the

play09:22

precognition study from the start of this video: three researchers attempted

play09:26

to replicate one of those experiments, and what did they find?

play09:29

well, surprise surprise, the hit rate they obtained was not significantly different

play09:32

from chance. When they tried to publish their findings in the same journal as

play09:36

the original paper they were rejected. The reason? The journal refuses to

play09:41

publish replication studies. So if you're a scientist the successful strategy is

play09:46

clear and don't even attempt replication studies because few journals will

play09:49

publish them, and there is a very good chance that your results won't be

play09:53

statistically significant any way in which case instead of being able to

play09:57

convince colleagues of the lack of reproducibility of an effect you will be

play10:01

accused of just not doing it right.

play10:03

So a far better approach is to test novel and unexpected hypotheses and then

play10:08

p-hack your way to a statistically significant result. Now I don't want to

play10:13

be too cynical about this because over the past 10 years things have started

play10:16

changing for the better.

play10:17

Many scientists acknowledge the problems i've outlined and are starting to take

play10:21

steps to correct them: there are more large-scale replication studies

play10:25

undertaken in the last 10 years, plus there's a site, Retraction Watch,

play10:28

dedicated to publicizing papers that have been withdrawn, there are online

play10:32

repositories for unpublished negative results and there is a move towards

play10:37

submitting hypotheses and methods for peer review before conducting

play10:40

experiments with the guarantee that research will be published regardless of

play10:43

results so long as the procedure is followed. This eliminates publication

play10:48

bias, promotes higher powered studies and lessens the incentive for p-hacking.

play10:53

The thing I find most striking about the reproducibility crisis in science is not

play10:57

the prevalence of incorrect information in published scientific journals

play11:01

after all getting to the truth we know is hard and mathematically not everything that

play11:06

is published can be correct.

play11:08

What gets me is the thought that even trying our best to figure out what's

play11:11

true, using our most sophisticated and rigorous mathematical tools: peer review,

play11:16

and the standards of practice, we still get it wrong so often; so how frequently

play11:20

do we delude ourselves when we're not using the scientific method? As flawed as

play11:26

our science may be, it is far away more reliable than any other way of knowing

play11:31

that we have.

play11:37

This episode of veritasium was supported in part by these fine

play11:40

people on Patreon and by Audible.com, the leading provider of audiobooks online

play11:45

with hundreds of thousands of titles in all areas of literature including:

play11:48

fiction, nonfiction and periodicals, Audible offers a free 30-day trial to

play11:53

anyone who watches this channel, just go to audible.com/veritasium so they know

play11:57

i sent you. A book i'd recommend is called "The Invention of Nature" by Andrea Wolf

play12:02

which is a biography of Alexander von Humboldt, an adventurer and naturalist

play12:07

who actually inspired Darwin to board the Beagle; you can download that

play12:11

book or any other of your choosing for a one month free trial at audible.com/veritasium

play12:16

so as always i want to thank Audible for supporting me and I really

play12:18

want to thank you for watching.

Rate This
โ˜…
โ˜…
โ˜…
โ˜…
โ˜…

5.0 / 5 (0 votes)

Related Tags
Scientific ValidityResearch ReproducibilityP-Value SignificanceFalse PositivesStatistical AnalysisPublication BiasReplication CrisisScientific MethodData InterpretationResearch Ethics