STAT115 Chapter 5.3 Multiple Hypotheses Testing and False Discovery Rate

Xiaole Shirley Liu
7 Feb 202026:26

Summary

TLDRThe transcript discusses the challenge of multiple hypothesis testing in high-throughput experiments, particularly when analyzing differential gene expression across 20,000 genes. It emphasizes the importance of controlling for false positives using techniques like the Bonferroni correction and False Discovery Rate (FDR). The speaker explains how FDR provides a less conservative approach compared to family-wise error rate, making it useful in identifying genuinely differentially expressed genes. The discussion also covers strategies for reducing noise in data and improving the accuracy of differential gene expression analysis.

Takeaways

  • 🧬 High-throughput experiments can involve testing differential expression across thousands of genes, necessitating a method to determine which genes are significantly different.
  • 📉 Relying solely on p-values for multiple hypothesis testing can lead to false positives due to the large number of tests, even if no genes are truly differentially expressed.
  • 🎯 The Bonferroni correction is a conservative approach to control for the family-wise error rate, adjusting the p-value threshold to avoid any false positives, but it may be too strict.
  • 🔍 False Discovery Rate (FDR) offers a less conservative approach than the family-wise error rate, allowing a small percentage of false positives while maintaining overall accuracy.
  • 📊 FDR is calculated as the proportion of false positives among all the genes called significant, which helps in determining an acceptable balance between true and false positives.
  • 🧪 The FDR value is always higher than the corresponding p-value because it accounts for the likelihood of false positives in multiple testing scenarios.
  • 🚫 Noisy data can lead to a uniform p-value distribution, indicating that no genes are significantly differentially expressed; reducing the number of hypotheses can help mitigate this.
  • 🔬 In RNA-seq experiments, genes with low expression levels in too few samples can be filtered out to reduce noise and improve the signal-to-noise ratio.
  • 💡 The key to successful differential expression analysis is to correctly balance the stringency of statistical tests to avoid too many or too few significant results.
  • 📚 The concept of multiple hypothesis testing has become increasingly important with the advent of high-throughput sequencing, emphasizing the need for statistical rigor in biological research.

Q & A

  • What is the primary concern when testing the expression of a large number of genes in a high-throughput experiment?

    -The primary concern is controlling for multiple hypothesis testing, as testing a large number of genes (e.g., 20,000) can lead to a significant number of false positives due to random chance.

  • Why can't a simple p-value cutoff be used to identify differentially expressed genes in large-scale experiments?

    -Using a simple p-value cutoff, such as 0.01, can lead to many false positives in large-scale experiments because, with a large number of tests, some genes will appear to be differentially expressed purely by chance.

  • What is the family-wise error rate (FWER) and how is it controlled?

    -The family-wise error rate is the probability of making even one false positive call among all the hypotheses tested. It is controlled by using methods like the Bonferroni correction, which adjusts the p-value cutoff by dividing it by the number of tests performed.

  • Why might the Bonferroni correction be too conservative in some cases?

    -The Bonferroni correction can be too conservative, especially when the data is noisy or there are few replicates, making it difficult to detect any differentially expressed genes because the p-value cutoff becomes very stringent.

  • What is the false discovery rate (FDR) and why is it preferred over the family-wise error rate in some experiments?

    -The false discovery rate (FDR) is the expected proportion of false positives among all the genes called differentially expressed. It is preferred over the family-wise error rate in some experiments because it allows for some false positives, which is often acceptable in large-scale studies where controlling the FWER would be too stringent.

  • How is the FDR estimated in practice?

    -FDR is estimated by looking at the distribution of p-values across all tests. The Benjamini-Hochberg method is a commonly used approach that estimates FDR by extrapolating the level of noise (false positives) from the right side of the p-value distribution and comparing it to the signal (true positives).

  • What is the significance of the p-value distribution in determining the quality of a statistical test in gene expression analysis?

    -The p-value distribution helps determine the quality of a statistical test. A uniform distribution suggests no true signal, while a skewed distribution toward smaller p-values indicates the presence of differentially expressed genes. If the distribution is flat on the right side, it suggests the noise level is low, making the statistical test more reliable.

  • What does it mean if the p-value distribution is uniform in an RNA-seq experiment?

    -If the p-value distribution is uniform, it suggests that the data is too noisy and that there are no significant differentially expressed genes in the experiment.

  • How can the number of hypotheses tested in an RNA-seq experiment be reduced to improve the signal-to-noise ratio?

    -The number of hypotheses tested can be reduced by filtering out genes with low expression levels in too few samples, as these genes are unlikely to reach statistical significance. This reduces the noise and increases the likelihood of detecting true signals.

  • What role do full change and FDR play in reporting differentially expressed genes?

    -Full change and FDR are often used together in reporting differentially expressed genes. FDR helps control for false positives, while filtering by full change ensures that only genes with a biologically meaningful change in expression are reported. This approach provides a more accurate and relevant list of differentially expressed genes.

Outlines

00:00

🔬 Multiple Hypothesis Testing in Genomic Studies

The paragraph discusses the challenge of identifying differentially expressed genes in high-throughput experiments, specifically in genome-wide studies with about 20,000 genes. The dilemma lies in determining how many genes are significantly different between two conditions, using p-values derived from statistical tests like the T-distribution or negative binomial distribution. The issue of false positives arises when applying a standard p-value cutoff across many tests, potentially leading to the erroneous identification of genes as differentially expressed.

05:01

📉 Controlling for Family-Wise Error Rate

This paragraph introduces the concept of controlling the family-wise error rate (FWER) to avoid incorrectly identifying even a single gene as differentially expressed. The Bonferroni correction is presented as a method to adjust p-values by dividing the significance level (alpha) by the number of tests (M). This adjustment is highly conservative, often making it difficult to detect truly differentially expressed genes, especially in noisy datasets. To address this, the concept of the false discovery rate (FDR) is introduced as a less stringent alternative.

10:03

🎯 Understanding False Discovery Rate (FDR)

The paragraph explains the false discovery rate (FDR), which estimates the proportion of false positives among all the genes identified as differentially expressed. The description includes an explanation of type I and type II errors (false positives and false negatives) and how FDR aims to balance sensitivity and specificity in identifying differentially expressed genes. FDR allows for a controlled rate of false positives, making it a more practical approach than FWER in large-scale genomic studies.

15:06

🔗 P-Value Distributions and Statistical Testing

This paragraph details the interpretation of p-value distributions in the context of RNA-seq experiments. It discusses how the distribution of p-values can reveal the quality of the data and the presence of differentially expressed genes. The mixture of noise and true signals in large datasets is highlighted, with an emphasis on how a well-conducted statistical test should produce a characteristic distribution. The concept of FDR is further explained by showing how to estimate it using the distribution of p-values.

20:07

📊 Applying FDR in Differential Expression Analysis

The focus here is on the practical application of FDR in analyzing differential gene expression. The paragraph explains how researchers choose FDR thresholds (e.g., 1%, 5%, 10%) to report significant genes and the importance of also considering fold change as an additional filter. It discusses the balance between detecting true positives and minimizing false positives, with the goal of obtaining a manageable and meaningful set of differentially expressed genes.

25:15

🚧 Reducing Hypotheses in Noisy Datasets

This paragraph addresses the challenge of handling noise in datasets with many genes. It suggests strategies for reducing the number of hypotheses tested, such as filtering out genes with low expression levels or low variability across samples. These approaches help to increase the signal-to-noise ratio, making it easier to identify truly differentially expressed genes. The importance of using heuristic methods like DESeq to focus on more promising candidates is emphasized.

📈 The Impact of High-Throughput Sequencing on Biology

The final paragraph reflects on the transformative impact of high-throughput sequencing technologies on biological research. It notes that before these technologies, biologists were less concerned with statistical rigor. The need for proper statistical methods, like FDR, became apparent with the advent of experiments that generate vast amounts of data, such as gene expression microarrays. The importance of multiple hypothesis testing in these contexts is underscored as a critical advancement in the field.

Mindmap

Keywords

💡Multiple Hypothesis Testing

Multiple hypothesis testing refers to the statistical process of conducting several hypothesis tests simultaneously. In the context of the video, it is discussed in relation to the analysis of gene expression in high-throughput experiments, where thousands of genes are tested for differential expression. The challenge is to control the error rate across all tests, as testing many hypotheses increases the chance of finding significant results just by random chance.

💡P-Value

A p-value is a measure that helps determine the significance of results in hypothesis testing. It represents the probability of obtaining results at least as extreme as the observed results, assuming the null hypothesis is true. In the video, p-values are discussed in the context of gene expression analysis, where they are used to determine if the expression of a gene is significantly different between two conditions. However, with many tests, p-values alone can be misleading without corrections.

💡Bonferroni Correction

The Bonferroni correction is a method used to adjust p-values when performing multiple hypothesis tests to control the family-wise error rate. This method divides the desired significance level by the number of tests, making it more stringent to claim significance. In the video, it is mentioned as a way to control the probability of making even a single Type I error (false positive) when testing thousands of genes.

💡Family-Wise Error Rate (FWER)

Family-wise error rate (FWER) is the probability of making at least one Type I error across a family of tests. It is important in multiple hypothesis testing because as the number of tests increases, so does the likelihood of making at least one incorrect rejection of the null hypothesis. The video discusses FWER in the context of ensuring that no genes are incorrectly identified as differentially expressed, using corrections like the Bonferroni method.

💡False Discovery Rate (FDR)

False discovery rate (FDR) is the expected proportion of false positives among the rejected hypotheses. It is a less stringent alternative to controlling the FWER, allowing for some false positives in exchange for greater power to detect true positives. The video highlights FDR as a crucial concept in genomic studies where controlling FWER might be too conservative, leading to a loss of potentially important discoveries.

💡Differential Expression

Differential expression refers to the change in gene expression levels between two or more conditions. In the video, this concept is central to the analysis being discussed, where the goal is to identify genes that show statistically significant differences in expression between experimental groups. Understanding differential expression is key to drawing biological conclusions from the data.

💡Type I Error (False Positive)

A Type I error occurs when a true null hypothesis is incorrectly rejected, leading to a false positive result. In the video, Type I errors are particularly concerning in the context of multiple hypothesis testing, as the risk of making such errors increases with the number of tests. Controlling for these errors is crucial to ensure the reliability of the findings, particularly when identifying differentially expressed genes.

💡Type II Error (False Negative)

A Type II error occurs when a false null hypothesis is not rejected, resulting in a false negative. This means that a real effect or difference is missed. The video touches on Type II errors in the context of gene expression studies, emphasizing the importance of balancing the need to avoid false positives (Type I errors) with the need to minimize false negatives in order to detect true differential expression.

💡Benjamini-Hochberg Method

The Benjamini-Hochberg method is a statistical procedure used to control the false discovery rate (FDR) when performing multiple comparisons. Unlike the Bonferroni correction, which controls the family-wise error rate, the Benjamini-Hochberg method allows for a certain proportion of false discoveries, thereby increasing the power to detect true effects. The video mentions this method as a widely used approach in high-throughput genomic studies to manage the trade-off between identifying true positives and limiting false positives.

💡RNA-Seq Experiment

RNA-Seq (RNA sequencing) is a technique used to analyze the quantity and sequences of RNA in a sample. In the context of the video, RNA-Seq experiments are used to measure gene expression levels across thousands of genes, providing data for differential expression analysis. The discussion in the video revolves around the challenges of interpreting RNA-Seq data, particularly in controlling for multiple hypothesis testing to avoid false discoveries.

Highlights

Introduction to multiple hypothesis testing in high throughput experiments and its relevance in genome-wide studies.

Explanation of the challenge in determining the cutoff point for differentially expressed genes in a genome with approximately 20,000 genes.

Discussion on the limitations of using p-value as a cutoff in large datasets, leading to potential false positives.

Introduction of the family-wise error rate (FWER) and its role in ensuring that no single gene is incorrectly identified as differentially expressed.

Explanation of the Bonferroni correction as a method to control the family-wise error rate in multiple hypothesis testing.

Introduction to the concept of the false discovery rate (FDR) as a less conservative alternative to FWER, allowing for a certain percentage of false positives.

Comparison of FWER and FDR, highlighting the trade-off between the stringency of FWER and the flexibility of FDR.

Description of the Benjamini-Hochberg method for estimating FDR in multiple hypothesis testing.

Clarification on the interpretation of p-values and FDR in the context of RNA-seq experiments and their role in determining differential gene expression.

Explanation of how noise and signal are separated in statistical tests, and the importance of good experimental design to achieve reliable results.

Discussion on how the p-value distribution can indicate the quality of an experiment and the presence of true signals.

Emphasis on the importance of filtering genes with low expression levels in order to reduce noise and improve the accuracy of differential expression analysis.

Summary of the role of statistical methods in improving the accuracy of high-throughput experiments in biology, particularly in gene expression studies.

Acknowledgment of the significant contribution of statistics in helping biologists understand and apply multiple hypothesis testing.

Final thoughts on the practical implications of using FDR in biological research, ensuring that a manageable number of genes are identified as differentially expressed.

Transcripts

play00:00

all right so the next question is

play00:04

multiple hypothesis testing so this

play00:10

really is relevant to the question we

play00:14

have a high throughput experiment and in

play00:18

the whole genome there are about 20,000

play00:21

genes and we did this differential

play00:23

expression analysis we tested the

play00:25

expression of every gene between one

play00:28

group of samples and another group of

play00:29

samples and we are asking is it

play00:31

different and at the end you know you

play00:33

can rank the genes by their level of

play00:35

differential expression but where do you

play00:37

cut the line to say between the two

play00:39

conditions I have 200 genes that are

play00:42

different or 500 genes that are

play00:44

different or a thousand genes that are

play00:46

different can we use p-value for

play00:49

something like this from because

play00:51

basically if you use the negative

play00:54

binomial distribution or if you convert

play00:57

the original data into a log normal you

play01:00

can also use T distribution to calculate

play01:02

a differential expression each can give

play01:05

you a p-value right can you use the

play01:07

p-value to give you to to decide how

play01:11

many genes to cut right yeah that's that

play01:28

that definitely the right answer so we

play01:32

we mention here that when we try to test

play01:35

a differential expression for every gene

play01:37

with a p-value supposedly on every gene

play01:40

you decide you want to use points your

play01:43

one as cutoff and if this gene is if you

play01:48

decide this is not differentially

play01:50

expressed that's the null hypothesis or

play01:52

that's an IHOP hypothesis if the genes

play01:55

are really differentially expressed it's

play01:57

not the alternative hypothesis you are

play02:01

rejected and now and really cause

play02:03

something to be differentially expressed

play02:04

if something means the p-value cut off

play02:07

the situation as just the student

play02:11

mentioned in

play02:12

Gino we have 20,000 genes if you use

play02:15

this point zero one as cutoff a p-value

play02:19

eyes cut off there they're gonna be 200

play02:22

genes that are being that will be called

play02:25

because you have so many genes and by

play02:31

chance you will have 200 genes that have

play02:34

a p-value less than point zero one and

play02:36

so even if even if you have no genes

play02:41

differentially expressed you will still

play02:43

call about 200 genes and probably all of

play02:45

them are wrong this is similar to a

play02:47

situation like this I'm today in the

play02:51

classroom I'm trying to ask how do

play02:53

students decide whether they are gonna

play02:55

sit in the first half of the classroom

play02:58

or the second half of the classroom you

play03:01

I could try different things I say okay

play03:03

let's first try undergrad sure it's all

play03:05

sit in the front

play03:06

no how about people with classes all sit

play03:09

in a friend no and all male students

play03:11

sits in the friend no people who live

play03:15

within the earth who who are from the US

play03:19

was it in a friend or who wear sweaters

play03:22

will sit in the front you know like if

play03:23

you try enough times maybe we will see

play03:26

that oh all of those who have a sibling

play03:30

and ate noodles yesterday sit in a

play03:33

friend righteous we just don't know be

play03:34

but if you would try it

play03:36

twenty thousand different things maybe

play03:38

one of them will be a hit red or or some

play03:41

of those will be well well indeed to be

play03:43

able to separate the students in the

play03:44

front versus the students in the back

play03:46

but that's just because I've tried many

play03:48

different options it's the same

play03:50

situation with looking at 20,000 genes

play03:53

and asking whether they are different

play03:54

and so we need to make sure we we

play03:58

control for multiple hypothesis testing

play04:01

and um one way to do this is to make

play04:05

sure that I don't run recall even a

play04:08

single gene incorrectly and this is

play04:11

called a family-wise error rate so this

play04:15

is the probability of rejecting at most

play04:18

one hypothesis at less than point at a

play04:23

certain cutoff which means the probably

play04:26

of not even calling one of them wrong is

play04:30

gonna be one minus this number so

play04:33

supposedly um if I want to correct for

play04:35

the family a family wise error rate at

play04:38

point zero five would mean that for

play04:43

whatever number of genes I call even one

play04:46

of them is wrong is no longer is is

play04:48

lower than 0.5 probability then I have

play04:51

to use on each individual gene a much

play04:53

more stringent p-value okay so the way

play04:56

to do this and we wouldn't go to the

play05:01

statistical deduction is to use this

play05:04

bonferroni correction to really control

play05:07

the family-wise error rate if you have M

play05:11

hypothesis to test and you want to

play05:13

control the final whatever number of

play05:16

genes you call I want not even a single

play05:18

gene to be wrongly called at this level

play05:21

of confidence say point zero five

play05:23

confidence then for each individual gene

play05:26

our p-value kind of need to be alpha

play05:30

divided by this M which means if they

play05:33

say 'no rna-seq experiment if the alpha

play05:38

is 0.5 a point zero five and we have

play05:41

20,000 genes to make the prediction the

play05:43

p-value cutoff would be point zero five

play05:46

divided by 20,000 which is a very small

play05:49

number and so sometimes when you don't

play05:52

have so many replicates or you know the

play05:55

data is fairly noisy you may not be able

play05:58

to see any genes to be differentially

play06:01

expressed at least at this cutoff and so

play06:04

this is to conservatives in terms of

play06:07

multiple hypothesis testing correction

play06:10

and so what happy used are this other

play06:14

concept called false discovery rate and

play06:17

so if we look at the study like this so

play06:24

the rows are the truth okay the truth is

play06:30

it's a zero is your null hypothesis you

play06:33

say the truth conditions where the two

play06:35

groups of samples are actually similar

play06:37

there's no difference in between the

play06:40

gene expression in the alternatives

play06:44

hypothesis you are saying that actually

play06:46

the two genes are different and then if

play06:51

you look at the columns here that mean

play06:54

this indicates for whatever statistical

play06:57

test you are going to really make a call

play06:59

to say they are different or or make a

play07:02

call to say they are not different

play07:04

which means whether you would reject or

play07:06

not reject null hypothesis and so on

play07:10

what is is you in this case so you is

play07:16

this group so for a particular gene the

play07:21

two groups are quite similar and you

play07:24

also did not call it as different what

play07:29

is this it's a true negative right this

play07:36

is a true negative that's you what is v

play07:45

it's a false positive right because

play07:49

these two groups are similar but you

play07:50

call them as different that's a false

play07:52

positive T it's a false negative right

play08:01

these two groups are different but you

play08:03

didn't call it that's a false negative

play08:05

and then the S is a true positive right

play08:07

and so yes we mentioned the false

play08:13

positives are what we call type 1 errors

play08:18

in statistics and the T which is the

play08:22

FA's negatives are the type 2 errors

play08:24

false discovery rate is the V / r r is

play08:31

basically all the things we call so we

play08:34

are trying to ask out of all the calls

play08:37

we make how many are false positives or

play08:40

what percentage are false positives

play08:42

that's the false discovery rate okay so

play08:46

basically when you do a rna-seq

play08:50

experiment you want to see some genes

play08:52

are different and even if there are some

play08:56

mistakes in occasional genes you can

play08:58

tolerate that supposedly if you say I

play09:02

want to call whatever number of genes to

play09:05

be different different as long as the

play09:06

total wrong calls are no no more than 2%

play09:11

that's okay right if I call 200 genes

play09:14

and 2% is false that's 40 genes that's

play09:18

okay so 160 genes are still correct

play09:21

that's false

play09:22

controlling false discovery rate at 2%

play09:26

okay and so you can see family-wise

play09:29

error rate is to make sure that out of

play09:31

those 200 genes not even single one of

play09:35

them are wrong or at this people and

play09:37

this probability where's FDR just says

play09:40

out of the 200 genes or whatever number

play09:43

of genes you call X percentage could be

play09:45

wrong you don't necessarily know which

play09:48

ones are wrong you just know roughly X

play09:50

percentage of them could be wrong and so

play09:54

so the way to calculate this I stole

play09:57

this from the set quest by Joshua Starr

play10:00

myrrh by the way in the course schedule

play10:02

you will see the lecture slides and the

play10:06

videos and the bottom our other videos I

play10:08

found on YouTube I would say sometimes

play10:11

and probably most of the times some of

play10:14

these other videos are really quite good

play10:16

and you should watch those as well I'd

play10:17

especially like this sad quest by Joshua

play10:20

summer and so I just did a scream dump

play10:23

from his YouTube video to show you so um

play10:27

this is also part of this homework one

play10:30

you can see what we wish we asked you to

play10:35

try this you know you you use a random

play10:37

number generator from say uniform or

play10:40

some same distribution and you just

play10:43

generate some group of samples to look

play10:46

at their difference let's just say

play10:49

supposedly you have a random number

play10:51

generator to generate a differential to

play10:54

generate numbers let's just make them

play10:56

simple if the distribution that you

play11:00

generate this number is a normal

play11:02

distribution and say you randomly

play11:04

generate ten numbers in Brooklyn and

play11:07

randomly generate ten numbers in group

play11:09

two and you calculate their differential

play11:11

expression using a t-test that will give

play11:15

you a kind of you know you each will

play11:19

give you a normal distribution and when

play11:21

you look at their difference most of the

play11:23

time you will not see a significant

play11:25

p-value but then occasionally you will

play11:27

see a small p-values but if you look at

play11:30

all the different evaluatee tribution it

play11:32

should be uniform that's what p-value

play11:34

means because it's the same distribution

play11:37

you just randomly generate numbers you

play11:39

can see whether they are different it

play11:42

should be a uniform distribution in here

play11:44

right so from 0 to 1 and every data has

play11:48

the same probability of hitting a

play11:50

particular key value and this is all

play11:54

noise but if indeed there are some genes

play11:57

on this experiment that are different

play11:59

because of the treatment it would mean

play12:01

that underlying there are two different

play12:04

distributions mostly one is a normal the

play12:06

other is a drug treatment or

play12:08

normally the other is a disease stage

play12:09

then when you are randomly generating

play12:13

numbers from the first distribution

play12:14

versus the second distribution and then

play12:17

test them and look at the p-value you

play12:20

probably see something like this right

play12:22

the p-value is much more skewed towards

play12:25

a smaller key value because indeed they

play12:27

are coming from different distributions

play12:29

right so the reality of the data we are

play12:33

getting is a mixture of the two because

play12:35

you are dealing with 20,000 genes and

play12:38

this in the genome and most of the genes

play12:42

in the experimental condition I think we

play12:45

can make some safe assumption that

play12:47

anytime you treat with a real cell life

play12:50

cell with drug or or something most of

play12:54

the genes are gonna be staying roughly

play12:56

the same but genes that are changing is

play12:59

only a subset and so most of the genes

play13:02

will give you this type of noise

play13:05

uniform distribution but only a subset

play13:09

of the genes a tiny portion of those

play13:11

genes will give you a true signal that's

play13:14

skewed towards the left but when you mix

play13:17

them together this is what you get if

play13:23

you take a look at a particular

play13:25

experiment in fact um if you want to

play13:28

test anytime you test a lot of

play13:31

hypotheses you might want to draw the

play13:33

p-value distribution to see whether your

play13:36

statistical test is good because if your

play13:38

normalize your data well and you've done

play13:40

your statistical test well you should

play13:43

see a p-value like this okay

play13:46

what happens if after you finish I

play13:49

experiment and you you run the

play13:51

statistical test and then look at a

play13:53

p-value of all the genes in an rna-seq

play13:55

experiment you only see this what does

play13:58

that mean

play14:05

if you see something like this in the

play14:08

p-value distribution it would mean that

play14:12

your data is too noisy

play14:14

there is no differentially expressed

play14:16

genes that are significant from your

play14:18

experiment okay

play14:20

whereas very often we see something like

play14:22

this by the way if there's no

play14:25

differentially expressed genes the

play14:27

uniform distribution should be the

play14:29

y-axis should be at 1 and it should be

play14:32

quite even to cover this area but if

play14:37

there are real signals that some genes

play14:39

are truly differentially expressed you

play14:41

have more area on the small side and

play14:44

that's gonna take the area away from the

play14:46

rest and so this area may not really

play14:49

these parts may not really reach one it

play14:52

might be 0.9 or 0.95 or whatever right

play14:56

so this is a slightly lower number than

play14:59

one remember we're saying that this is

play15:02

kind of a addition like the sum between

play15:06

the noise and the signal therefore if we

play15:09

say we want to do a p-value cutoff at

play15:12

this first bar at this first bar how

play15:15

much is the signal and how much is the

play15:18

noise we

play15:24

say actually any p-value for any roughly

play15:29

the noise will already give you this

play15:32

level of p-value below that bar right

play15:35

just random chance even if the tooth

play15:37

distribution have no difference so these

play15:40

numbers are these genes are gonna be the

play15:43

random noise whereas anything above this

play15:46

dotted line these are the real signals

play15:48

and so what is the FDR if you cut off at

play15:53

this level so FDR is the false positives

play16:01

divided by all the things you call it

play16:04

would mean that if you were to cut off

play16:05

and if you are gonna call differential

play16:07

genes at this particular p-value you're

play16:10

false positive is how many genes are you

play16:13

gonna call if you're gonna call a

play16:18

p-value at this cut off you're gonna

play16:21

call everything to the left of this as

play16:23

differentially expressed right but out

play16:25

of all of these genes anything below

play16:28

they started line these are all random

play16:30

garbage from the noise but everything

play16:34

above it that's the real signal so the

play16:36

FDR would be this little area divided by

play16:39

the whole bar and so then if you were to

play16:42

cut at another key value let's say let's

play16:45

cut off at here it would be the noise

play16:47

would be everything below this dotted

play16:49

line here and divided by all the genes

play16:54

that are called below this p-value which

play16:56

includes this far this far together

play16:58

right so basically the intuition is at

play17:01

every particular p-value cutoff you know

play17:05

that the a part is coming from random

play17:09

noise and the B part is the real signal

play17:12

and the false discovery rate is roughly

play17:14

a divided by a plus B so a plus B is all

play17:18

the genes you're calling below that

play17:20

p-value and you know that's the noise in

play17:22

that the false the potential false

play17:25

positives are estimated based on this

play17:29

lower dotted line which is kind of you

play17:32

you actually made based on the right

play17:33

side so that's why if you design a good

play17:36

statistical test

play17:38

and if you capture the correct

play17:40

distribution of your data and you do the

play17:42

right statistical test um after you

play17:46

finished testing all of your genes that

play17:48

p-value should give you something so

play17:50

that the right side is pretty flat and

play17:52

you can use that flat level like

play17:55

extrapolate the flags level to the left

play17:57

to estimate roughly how much is your

play18:00

noise and the remaining ones you called

play18:04

are the real signals okay and so as we

play18:08

mentioned in this situation you just

play18:10

know roughly at each p-value cutoff how

play18:16

many genes you are you're gonna call and

play18:18

they're roughly out of those how many

play18:20

are false positives you don't know which

play18:23

one is the false positive you just know

play18:25

a percentage of them are gonna be fake

play18:29

okay that's FDR there is a one to one

play18:33

correspondence basically you can imagine

play18:36

and every p-value cutoff you can

play18:39

estimate the a divided by the a plus B

play18:43

okay and so um the FDR value is always

play18:47

higher than the p-value because the this

play18:52

is basically p-value is whatever cutoff

play18:55

you have

play18:58

um but below that you there always

play19:02

something that are fake okay and so FDR

play19:08

is basically a less conservative way to

play19:11

do multiple hypothesis correction then

play19:14

the family-wise error rate there are

play19:17

very widely used method called benjamina

play19:20

hochberg method it estimated this FDR

play19:22

you know the rough idea is like this you

play19:25

estimate by extrapolation from the right

play19:28

side you know all the things on the

play19:30

right that gives you the level of noise

play19:34

and this will tell you how much truth

play19:37

you are gonna have you might see FDR you

play19:41

might see a depending on algorithmic

play19:43

basic adjusted p-value and sometimes you

play19:46

might see q-value they all mean the same

play19:48

thing

play19:49

adjust the p-value is the multiple

play19:51

hypothesis testing adjusted the p-value

play19:55

that's the same as FDR or Q value is all

play19:59

the same and as we mentioned p-value and

play20:01

FDR are always monotonic so every

play20:04

p-value has is corresponding FDR and

play20:07

it's also bigger than the key values for

play20:09

the same gene so at the end Knox

play20:11

differential expression analysis for

play20:14

each gene based on the negative binomial

play20:16

distribution you'll get a p-value and by

play20:18

looking at all of those p-values

play20:20

together you also estimated FDR and

play20:23

usually people take that FDR values and

play20:26

so in order to really report genes

play20:28

some people take one percent FDR some

play20:32

people take five percent some people

play20:34

take ten percent at the least when you

play20:36

say I'm reporting FDR the readers would

play20:40

understand that you have already done

play20:42

multiple hypothesis testing correction

play20:44

and and sometimes people in addition to

play20:48

the FDR they also filter by full change

play20:50

say the full change needs would be one

play20:51

point two or 1.5 or two full change but

play20:56

in general with this FDR estimate you

play20:58

can really get a sense of the signal to

play21:02

noise of on his you're calling most of

play21:05

the time in reality is people are

play21:08

comfortable with calling between

play21:11

would say 50 genes and 3000 genes as

play21:13

different if after all of this you get

play21:16

three genes that are different you're

play21:19

kind of wondering okay maybe my

play21:21

experiment is too noisy you know

play21:23

something is wrong but if you're calling

play21:26

too many genes that's different you

play21:28

might want to use more stringent like if

play21:30

you see 5,000 genes as different maybe

play21:33

you want to use a more stringent after

play21:35

our cutoff or you want to also add a

play21:40

filter for full change okay um

play21:43

one way to so you can see basically um

play21:47

the level of noise you have in a data

play21:50

versus the signal depending on

play21:52

supposedly if I have only 300 genes that

play21:56

are differentially expressed but I'm

play21:58

doing 20,000 random other genes that the

play22:03

noise is 300 out of all the 20,000 but

play22:08

if instead of 20,000 genes I only look

play22:12

at 5,000 genes then the noise level

play22:15

would be much much lower actually and

play22:17

the signal would be much higher and so

play22:19

is there a way we can reduce the

play22:22

multiple hypotheses we're testing if you

play22:27

have a lot of noise like if you have to

play22:28

test that 10,000 genes

play22:31

well of course in human we don't we

play22:32

don't have 10,000 genes but suppose that

play22:34

you have 10,000 noise versus 300 signal

play22:37

it's gonna be a lot harder to call but

play22:40

if you have only 5,000 genes and 300

play22:43

signal it's easier to call but how do we

play22:45

reduce the number of hypotheses in this

play22:48

case so algorithms such as de seek will

play22:54

try to ignore genes if it has too low

play22:57

gene expression level in too few samples

play23:00

this gene is hopeless right what's the

play23:02

point of testing differential expression

play23:04

you know you will never really actually

play23:06

reach a p-value significance so with BC

play23:09

there are some simple heuristic to say

play23:12

oh if this gene is expressed say you

play23:15

have 6 or 10 samples and it's you only

play23:22

see a single number in one of the sample

play23:24

and the remain

play23:24

samples are all zero it's hopeless you

play23:27

don't this will never really reach

play23:29

p-value and so do you seek we'll use

play23:31

some simple heuristics to remove those

play23:33

genes that have too low expression level

play23:37

in too few number of samples to reduce

play23:40

the total number of genes to test this

play23:43

way the the number of noise will be

play23:45

reduced and the signal will be higher

play23:48

okay yeah but at the end most people are

play23:51

most comfortable with usually a few

play23:53

hundred differentially expressed genes

play23:56

questions about FDR

play23:59

I would yes because if you cut off at 5%

play24:11

p-value and if there is no

play24:15

differentially expressed genes do you

play24:17

know what your FDR is it's a hundred

play24:21

percent all of them are failure right

play24:26

but if out of that supposedly you do a

play24:29

kind of and fifty percent are real and

play24:31

fifty percent are true then you you have

play24:34

50 percent FDR and that p-value you can

play24:36

see here in each of this these are all

play24:40

of your fake ones and these are your

play24:43

correct ones right you are trying to

play24:45

estimate this percent of fake with the

play24:48

total compared to all the others the

play24:52

bottom ones give you that p-value right

play24:56

and so FDR is always bigger than that

play25:01

the p-value okay yeah so I would say the

play25:14

best thing statistics have done to teach

play25:18

biologists is about the multiple

play25:20

hypothesis testing with high school food

play25:23

sequencing I would say before there is

play25:25

any high throughput experiment people

play25:28

didn't really care as much about

play25:30

statistics

play25:31

you know biologists just me feel like oh

play25:33

if I see different like anything

play25:35

different I'll reported I have a paper

play25:37

that's great I'm starting from gene

play25:41

expression microarrays when start people

play25:43

started looking at thousands of genes or

play25:45

tens of thousands of thousands of genes

play25:47

they realize that you need to learn

play25:49

about multiple hypothesis testing and I

play25:53

think that's a really something

play25:56

statistics helped with this biologists

play26:02

and so basically after a microarray

play26:04

experiment you use the e seek negative

play26:07

binomial distribution to call the

play26:09

differential gene then use the FDR to

play26:12

estimate how many genes you

play26:13

really want to call so that's whatever

play26:16

number of genes you call maybe only 5%

play26:18

of those are wrong and the remaining

play26:20

ones are going to be correct okay

Rate This

5.0 / 5 (0 votes)

Étiquettes Connexes
GenomicsHypothesis TestingP-valuesDifferential ExpressionGene AnalysisFalse Discovery RateBonferroni CorrectionRNA-seqBiostatisticsHigh Throughput
Besoin d'un résumé en anglais ?