STAT115 Chapter 5.3 Multiple Hypotheses Testing and False Discovery Rate
Summary
TLDRThe transcript discusses the challenge of multiple hypothesis testing in high-throughput experiments, particularly when analyzing differential gene expression across 20,000 genes. It emphasizes the importance of controlling for false positives using techniques like the Bonferroni correction and False Discovery Rate (FDR). The speaker explains how FDR provides a less conservative approach compared to family-wise error rate, making it useful in identifying genuinely differentially expressed genes. The discussion also covers strategies for reducing noise in data and improving the accuracy of differential gene expression analysis.
Takeaways
- 𧬠High-throughput experiments can involve testing differential expression across thousands of genes, necessitating a method to determine which genes are significantly different.
- π Relying solely on p-values for multiple hypothesis testing can lead to false positives due to the large number of tests, even if no genes are truly differentially expressed.
- π― The Bonferroni correction is a conservative approach to control for the family-wise error rate, adjusting the p-value threshold to avoid any false positives, but it may be too strict.
- π False Discovery Rate (FDR) offers a less conservative approach than the family-wise error rate, allowing a small percentage of false positives while maintaining overall accuracy.
- π FDR is calculated as the proportion of false positives among all the genes called significant, which helps in determining an acceptable balance between true and false positives.
- π§ͺ The FDR value is always higher than the corresponding p-value because it accounts for the likelihood of false positives in multiple testing scenarios.
- π« Noisy data can lead to a uniform p-value distribution, indicating that no genes are significantly differentially expressed; reducing the number of hypotheses can help mitigate this.
- π¬ In RNA-seq experiments, genes with low expression levels in too few samples can be filtered out to reduce noise and improve the signal-to-noise ratio.
- π‘ The key to successful differential expression analysis is to correctly balance the stringency of statistical tests to avoid too many or too few significant results.
- π The concept of multiple hypothesis testing has become increasingly important with the advent of high-throughput sequencing, emphasizing the need for statistical rigor in biological research.
Q & A
What is the primary concern when testing the expression of a large number of genes in a high-throughput experiment?
-The primary concern is controlling for multiple hypothesis testing, as testing a large number of genes (e.g., 20,000) can lead to a significant number of false positives due to random chance.
Why can't a simple p-value cutoff be used to identify differentially expressed genes in large-scale experiments?
-Using a simple p-value cutoff, such as 0.01, can lead to many false positives in large-scale experiments because, with a large number of tests, some genes will appear to be differentially expressed purely by chance.
What is the family-wise error rate (FWER) and how is it controlled?
-The family-wise error rate is the probability of making even one false positive call among all the hypotheses tested. It is controlled by using methods like the Bonferroni correction, which adjusts the p-value cutoff by dividing it by the number of tests performed.
Why might the Bonferroni correction be too conservative in some cases?
-The Bonferroni correction can be too conservative, especially when the data is noisy or there are few replicates, making it difficult to detect any differentially expressed genes because the p-value cutoff becomes very stringent.
What is the false discovery rate (FDR) and why is it preferred over the family-wise error rate in some experiments?
-The false discovery rate (FDR) is the expected proportion of false positives among all the genes called differentially expressed. It is preferred over the family-wise error rate in some experiments because it allows for some false positives, which is often acceptable in large-scale studies where controlling the FWER would be too stringent.
How is the FDR estimated in practice?
-FDR is estimated by looking at the distribution of p-values across all tests. The Benjamini-Hochberg method is a commonly used approach that estimates FDR by extrapolating the level of noise (false positives) from the right side of the p-value distribution and comparing it to the signal (true positives).
What is the significance of the p-value distribution in determining the quality of a statistical test in gene expression analysis?
-The p-value distribution helps determine the quality of a statistical test. A uniform distribution suggests no true signal, while a skewed distribution toward smaller p-values indicates the presence of differentially expressed genes. If the distribution is flat on the right side, it suggests the noise level is low, making the statistical test more reliable.
What does it mean if the p-value distribution is uniform in an RNA-seq experiment?
-If the p-value distribution is uniform, it suggests that the data is too noisy and that there are no significant differentially expressed genes in the experiment.
How can the number of hypotheses tested in an RNA-seq experiment be reduced to improve the signal-to-noise ratio?
-The number of hypotheses tested can be reduced by filtering out genes with low expression levels in too few samples, as these genes are unlikely to reach statistical significance. This reduces the noise and increases the likelihood of detecting true signals.
What role do full change and FDR play in reporting differentially expressed genes?
-Full change and FDR are often used together in reporting differentially expressed genes. FDR helps control for false positives, while filtering by full change ensures that only genes with a biologically meaningful change in expression are reported. This approach provides a more accurate and relevant list of differentially expressed genes.
Outlines
π¬ Multiple Hypothesis Testing in Genomic Studies
The paragraph discusses the challenge of identifying differentially expressed genes in high-throughput experiments, specifically in genome-wide studies with about 20,000 genes. The dilemma lies in determining how many genes are significantly different between two conditions, using p-values derived from statistical tests like the T-distribution or negative binomial distribution. The issue of false positives arises when applying a standard p-value cutoff across many tests, potentially leading to the erroneous identification of genes as differentially expressed.
π Controlling for Family-Wise Error Rate
This paragraph introduces the concept of controlling the family-wise error rate (FWER) to avoid incorrectly identifying even a single gene as differentially expressed. The Bonferroni correction is presented as a method to adjust p-values by dividing the significance level (alpha) by the number of tests (M). This adjustment is highly conservative, often making it difficult to detect truly differentially expressed genes, especially in noisy datasets. To address this, the concept of the false discovery rate (FDR) is introduced as a less stringent alternative.
π― Understanding False Discovery Rate (FDR)
The paragraph explains the false discovery rate (FDR), which estimates the proportion of false positives among all the genes identified as differentially expressed. The description includes an explanation of type I and type II errors (false positives and false negatives) and how FDR aims to balance sensitivity and specificity in identifying differentially expressed genes. FDR allows for a controlled rate of false positives, making it a more practical approach than FWER in large-scale genomic studies.
π P-Value Distributions and Statistical Testing
This paragraph details the interpretation of p-value distributions in the context of RNA-seq experiments. It discusses how the distribution of p-values can reveal the quality of the data and the presence of differentially expressed genes. The mixture of noise and true signals in large datasets is highlighted, with an emphasis on how a well-conducted statistical test should produce a characteristic distribution. The concept of FDR is further explained by showing how to estimate it using the distribution of p-values.
π Applying FDR in Differential Expression Analysis
The focus here is on the practical application of FDR in analyzing differential gene expression. The paragraph explains how researchers choose FDR thresholds (e.g., 1%, 5%, 10%) to report significant genes and the importance of also considering fold change as an additional filter. It discusses the balance between detecting true positives and minimizing false positives, with the goal of obtaining a manageable and meaningful set of differentially expressed genes.
π§ Reducing Hypotheses in Noisy Datasets
This paragraph addresses the challenge of handling noise in datasets with many genes. It suggests strategies for reducing the number of hypotheses tested, such as filtering out genes with low expression levels or low variability across samples. These approaches help to increase the signal-to-noise ratio, making it easier to identify truly differentially expressed genes. The importance of using heuristic methods like DESeq to focus on more promising candidates is emphasized.
π The Impact of High-Throughput Sequencing on Biology
The final paragraph reflects on the transformative impact of high-throughput sequencing technologies on biological research. It notes that before these technologies, biologists were less concerned with statistical rigor. The need for proper statistical methods, like FDR, became apparent with the advent of experiments that generate vast amounts of data, such as gene expression microarrays. The importance of multiple hypothesis testing in these contexts is underscored as a critical advancement in the field.
Mindmap
Keywords
π‘Multiple Hypothesis Testing
π‘P-Value
π‘Bonferroni Correction
π‘Family-Wise Error Rate (FWER)
π‘False Discovery Rate (FDR)
π‘Differential Expression
π‘Type I Error (False Positive)
π‘Type II Error (False Negative)
π‘Benjamini-Hochberg Method
π‘RNA-Seq Experiment
Highlights
Introduction to multiple hypothesis testing in high throughput experiments and its relevance in genome-wide studies.
Explanation of the challenge in determining the cutoff point for differentially expressed genes in a genome with approximately 20,000 genes.
Discussion on the limitations of using p-value as a cutoff in large datasets, leading to potential false positives.
Introduction of the family-wise error rate (FWER) and its role in ensuring that no single gene is incorrectly identified as differentially expressed.
Explanation of the Bonferroni correction as a method to control the family-wise error rate in multiple hypothesis testing.
Introduction to the concept of the false discovery rate (FDR) as a less conservative alternative to FWER, allowing for a certain percentage of false positives.
Comparison of FWER and FDR, highlighting the trade-off between the stringency of FWER and the flexibility of FDR.
Description of the Benjamini-Hochberg method for estimating FDR in multiple hypothesis testing.
Clarification on the interpretation of p-values and FDR in the context of RNA-seq experiments and their role in determining differential gene expression.
Explanation of how noise and signal are separated in statistical tests, and the importance of good experimental design to achieve reliable results.
Discussion on how the p-value distribution can indicate the quality of an experiment and the presence of true signals.
Emphasis on the importance of filtering genes with low expression levels in order to reduce noise and improve the accuracy of differential expression analysis.
Summary of the role of statistical methods in improving the accuracy of high-throughput experiments in biology, particularly in gene expression studies.
Acknowledgment of the significant contribution of statistics in helping biologists understand and apply multiple hypothesis testing.
Final thoughts on the practical implications of using FDR in biological research, ensuring that a manageable number of genes are identified as differentially expressed.
Transcripts
all right so the next question is
multiple hypothesis testing so this
really is relevant to the question we
have a high throughput experiment and in
the whole genome there are about 20,000
genes and we did this differential
expression analysis we tested the
expression of every gene between one
group of samples and another group of
samples and we are asking is it
different and at the end you know you
can rank the genes by their level of
differential expression but where do you
cut the line to say between the two
conditions I have 200 genes that are
different or 500 genes that are
different or a thousand genes that are
different can we use p-value for
something like this from because
basically if you use the negative
binomial distribution or if you convert
the original data into a log normal you
can also use T distribution to calculate
a differential expression each can give
you a p-value right can you use the
p-value to give you to to decide how
many genes to cut right yeah that's that
that definitely the right answer so we
we mention here that when we try to test
a differential expression for every gene
with a p-value supposedly on every gene
you decide you want to use points your
one as cutoff and if this gene is if you
decide this is not differentially
expressed that's the null hypothesis or
that's an IHOP hypothesis if the genes
are really differentially expressed it's
not the alternative hypothesis you are
rejected and now and really cause
something to be differentially expressed
if something means the p-value cut off
the situation as just the student
mentioned in
Gino we have 20,000 genes if you use
this point zero one as cutoff a p-value
eyes cut off there they're gonna be 200
genes that are being that will be called
because you have so many genes and by
chance you will have 200 genes that have
a p-value less than point zero one and
so even if even if you have no genes
differentially expressed you will still
call about 200 genes and probably all of
them are wrong this is similar to a
situation like this I'm today in the
classroom I'm trying to ask how do
students decide whether they are gonna
sit in the first half of the classroom
or the second half of the classroom you
I could try different things I say okay
let's first try undergrad sure it's all
sit in the front
no how about people with classes all sit
in a friend no and all male students
sits in the friend no people who live
within the earth who who are from the US
was it in a friend or who wear sweaters
will sit in the front you know like if
you try enough times maybe we will see
that oh all of those who have a sibling
and ate noodles yesterday sit in a
friend righteous we just don't know be
but if you would try it
twenty thousand different things maybe
one of them will be a hit red or or some
of those will be well well indeed to be
able to separate the students in the
front versus the students in the back
but that's just because I've tried many
different options it's the same
situation with looking at 20,000 genes
and asking whether they are different
and so we need to make sure we we
control for multiple hypothesis testing
and um one way to do this is to make
sure that I don't run recall even a
single gene incorrectly and this is
called a family-wise error rate so this
is the probability of rejecting at most
one hypothesis at less than point at a
certain cutoff which means the probably
of not even calling one of them wrong is
gonna be one minus this number so
supposedly um if I want to correct for
the family a family wise error rate at
point zero five would mean that for
whatever number of genes I call even one
of them is wrong is no longer is is
lower than 0.5 probability then I have
to use on each individual gene a much
more stringent p-value okay so the way
to do this and we wouldn't go to the
statistical deduction is to use this
bonferroni correction to really control
the family-wise error rate if you have M
hypothesis to test and you want to
control the final whatever number of
genes you call I want not even a single
gene to be wrongly called at this level
of confidence say point zero five
confidence then for each individual gene
our p-value kind of need to be alpha
divided by this M which means if they
say 'no rna-seq experiment if the alpha
is 0.5 a point zero five and we have
20,000 genes to make the prediction the
p-value cutoff would be point zero five
divided by 20,000 which is a very small
number and so sometimes when you don't
have so many replicates or you know the
data is fairly noisy you may not be able
to see any genes to be differentially
expressed at least at this cutoff and so
this is to conservatives in terms of
multiple hypothesis testing correction
and so what happy used are this other
concept called false discovery rate and
so if we look at the study like this so
the rows are the truth okay the truth is
it's a zero is your null hypothesis you
say the truth conditions where the two
groups of samples are actually similar
there's no difference in between the
gene expression in the alternatives
hypothesis you are saying that actually
the two genes are different and then if
you look at the columns here that mean
this indicates for whatever statistical
test you are going to really make a call
to say they are different or or make a
call to say they are not different
which means whether you would reject or
not reject null hypothesis and so on
what is is you in this case so you is
this group so for a particular gene the
two groups are quite similar and you
also did not call it as different what
is this it's a true negative right this
is a true negative that's you what is v
it's a false positive right because
these two groups are similar but you
call them as different that's a false
positive T it's a false negative right
these two groups are different but you
didn't call it that's a false negative
and then the S is a true positive right
and so yes we mentioned the false
positives are what we call type 1 errors
in statistics and the T which is the
FA's negatives are the type 2 errors
false discovery rate is the V / r r is
basically all the things we call so we
are trying to ask out of all the calls
we make how many are false positives or
what percentage are false positives
that's the false discovery rate okay so
basically when you do a rna-seq
experiment you want to see some genes
are different and even if there are some
mistakes in occasional genes you can
tolerate that supposedly if you say I
want to call whatever number of genes to
be different different as long as the
total wrong calls are no no more than 2%
that's okay right if I call 200 genes
and 2% is false that's 40 genes that's
okay so 160 genes are still correct
that's false
controlling false discovery rate at 2%
okay and so you can see family-wise
error rate is to make sure that out of
those 200 genes not even single one of
them are wrong or at this people and
this probability where's FDR just says
out of the 200 genes or whatever number
of genes you call X percentage could be
wrong you don't necessarily know which
ones are wrong you just know roughly X
percentage of them could be wrong and so
so the way to calculate this I stole
this from the set quest by Joshua Starr
myrrh by the way in the course schedule
you will see the lecture slides and the
videos and the bottom our other videos I
found on YouTube I would say sometimes
and probably most of the times some of
these other videos are really quite good
and you should watch those as well I'd
especially like this sad quest by Joshua
summer and so I just did a scream dump
from his YouTube video to show you so um
this is also part of this homework one
you can see what we wish we asked you to
try this you know you you use a random
number generator from say uniform or
some same distribution and you just
generate some group of samples to look
at their difference let's just say
supposedly you have a random number
generator to generate a differential to
generate numbers let's just make them
simple if the distribution that you
generate this number is a normal
distribution and say you randomly
generate ten numbers in Brooklyn and
randomly generate ten numbers in group
two and you calculate their differential
expression using a t-test that will give
you a kind of you know you each will
give you a normal distribution and when
you look at their difference most of the
time you will not see a significant
p-value but then occasionally you will
see a small p-values but if you look at
all the different evaluatee tribution it
should be uniform that's what p-value
means because it's the same distribution
you just randomly generate numbers you
can see whether they are different it
should be a uniform distribution in here
right so from 0 to 1 and every data has
the same probability of hitting a
particular key value and this is all
noise but if indeed there are some genes
on this experiment that are different
because of the treatment it would mean
that underlying there are two different
distributions mostly one is a normal the
other is a drug treatment or
normally the other is a disease stage
then when you are randomly generating
numbers from the first distribution
versus the second distribution and then
test them and look at the p-value you
probably see something like this right
the p-value is much more skewed towards
a smaller key value because indeed they
are coming from different distributions
right so the reality of the data we are
getting is a mixture of the two because
you are dealing with 20,000 genes and
this in the genome and most of the genes
in the experimental condition I think we
can make some safe assumption that
anytime you treat with a real cell life
cell with drug or or something most of
the genes are gonna be staying roughly
the same but genes that are changing is
only a subset and so most of the genes
will give you this type of noise
uniform distribution but only a subset
of the genes a tiny portion of those
genes will give you a true signal that's
skewed towards the left but when you mix
them together this is what you get if
you take a look at a particular
experiment in fact um if you want to
test anytime you test a lot of
hypotheses you might want to draw the
p-value distribution to see whether your
statistical test is good because if your
normalize your data well and you've done
your statistical test well you should
see a p-value like this okay
what happens if after you finish I
experiment and you you run the
statistical test and then look at a
p-value of all the genes in an rna-seq
experiment you only see this what does
that mean
if you see something like this in the
p-value distribution it would mean that
your data is too noisy
there is no differentially expressed
genes that are significant from your
experiment okay
whereas very often we see something like
this by the way if there's no
differentially expressed genes the
uniform distribution should be the
y-axis should be at 1 and it should be
quite even to cover this area but if
there are real signals that some genes
are truly differentially expressed you
have more area on the small side and
that's gonna take the area away from the
rest and so this area may not really
these parts may not really reach one it
might be 0.9 or 0.95 or whatever right
so this is a slightly lower number than
one remember we're saying that this is
kind of a addition like the sum between
the noise and the signal therefore if we
say we want to do a p-value cutoff at
this first bar at this first bar how
much is the signal and how much is the
noise we
say actually any p-value for any roughly
the noise will already give you this
level of p-value below that bar right
just random chance even if the tooth
distribution have no difference so these
numbers are these genes are gonna be the
random noise whereas anything above this
dotted line these are the real signals
and so what is the FDR if you cut off at
this level so FDR is the false positives
divided by all the things you call it
would mean that if you were to cut off
and if you are gonna call differential
genes at this particular p-value you're
false positive is how many genes are you
gonna call if you're gonna call a
p-value at this cut off you're gonna
call everything to the left of this as
differentially expressed right but out
of all of these genes anything below
they started line these are all random
garbage from the noise but everything
above it that's the real signal so the
FDR would be this little area divided by
the whole bar and so then if you were to
cut at another key value let's say let's
cut off at here it would be the noise
would be everything below this dotted
line here and divided by all the genes
that are called below this p-value which
includes this far this far together
right so basically the intuition is at
every particular p-value cutoff you know
that the a part is coming from random
noise and the B part is the real signal
and the false discovery rate is roughly
a divided by a plus B so a plus B is all
the genes you're calling below that
p-value and you know that's the noise in
that the false the potential false
positives are estimated based on this
lower dotted line which is kind of you
you actually made based on the right
side so that's why if you design a good
statistical test
and if you capture the correct
distribution of your data and you do the
right statistical test um after you
finished testing all of your genes that
p-value should give you something so
that the right side is pretty flat and
you can use that flat level like
extrapolate the flags level to the left
to estimate roughly how much is your
noise and the remaining ones you called
are the real signals okay and so as we
mentioned in this situation you just
know roughly at each p-value cutoff how
many genes you are you're gonna call and
they're roughly out of those how many
are false positives you don't know which
one is the false positive you just know
a percentage of them are gonna be fake
okay that's FDR there is a one to one
correspondence basically you can imagine
and every p-value cutoff you can
estimate the a divided by the a plus B
okay and so um the FDR value is always
higher than the p-value because the this
is basically p-value is whatever cutoff
you have
um but below that you there always
something that are fake okay and so FDR
is basically a less conservative way to
do multiple hypothesis correction then
the family-wise error rate there are
very widely used method called benjamina
hochberg method it estimated this FDR
you know the rough idea is like this you
estimate by extrapolation from the right
side you know all the things on the
right that gives you the level of noise
and this will tell you how much truth
you are gonna have you might see FDR you
might see a depending on algorithmic
basic adjusted p-value and sometimes you
might see q-value they all mean the same
thing
adjust the p-value is the multiple
hypothesis testing adjusted the p-value
that's the same as FDR or Q value is all
the same and as we mentioned p-value and
FDR are always monotonic so every
p-value has is corresponding FDR and
it's also bigger than the key values for
the same gene so at the end Knox
differential expression analysis for
each gene based on the negative binomial
distribution you'll get a p-value and by
looking at all of those p-values
together you also estimated FDR and
usually people take that FDR values and
so in order to really report genes
some people take one percent FDR some
people take five percent some people
take ten percent at the least when you
say I'm reporting FDR the readers would
understand that you have already done
multiple hypothesis testing correction
and and sometimes people in addition to
the FDR they also filter by full change
say the full change needs would be one
point two or 1.5 or two full change but
in general with this FDR estimate you
can really get a sense of the signal to
noise of on his you're calling most of
the time in reality is people are
comfortable with calling between
would say 50 genes and 3000 genes as
different if after all of this you get
three genes that are different you're
kind of wondering okay maybe my
experiment is too noisy you know
something is wrong but if you're calling
too many genes that's different you
might want to use more stringent like if
you see 5,000 genes as different maybe
you want to use a more stringent after
our cutoff or you want to also add a
filter for full change okay um
one way to so you can see basically um
the level of noise you have in a data
versus the signal depending on
supposedly if I have only 300 genes that
are differentially expressed but I'm
doing 20,000 random other genes that the
noise is 300 out of all the 20,000 but
if instead of 20,000 genes I only look
at 5,000 genes then the noise level
would be much much lower actually and
the signal would be much higher and so
is there a way we can reduce the
multiple hypotheses we're testing if you
have a lot of noise like if you have to
test that 10,000 genes
well of course in human we don't we
don't have 10,000 genes but suppose that
you have 10,000 noise versus 300 signal
it's gonna be a lot harder to call but
if you have only 5,000 genes and 300
signal it's easier to call but how do we
reduce the number of hypotheses in this
case so algorithms such as de seek will
try to ignore genes if it has too low
gene expression level in too few samples
this gene is hopeless right what's the
point of testing differential expression
you know you will never really actually
reach a p-value significance so with BC
there are some simple heuristic to say
oh if this gene is expressed say you
have 6 or 10 samples and it's you only
see a single number in one of the sample
and the remain
samples are all zero it's hopeless you
don't this will never really reach
p-value and so do you seek we'll use
some simple heuristics to remove those
genes that have too low expression level
in too few number of samples to reduce
the total number of genes to test this
way the the number of noise will be
reduced and the signal will be higher
okay yeah but at the end most people are
most comfortable with usually a few
hundred differentially expressed genes
questions about FDR
I would yes because if you cut off at 5%
p-value and if there is no
differentially expressed genes do you
know what your FDR is it's a hundred
percent all of them are failure right
but if out of that supposedly you do a
kind of and fifty percent are real and
fifty percent are true then you you have
50 percent FDR and that p-value you can
see here in each of this these are all
of your fake ones and these are your
correct ones right you are trying to
estimate this percent of fake with the
total compared to all the others the
bottom ones give you that p-value right
and so FDR is always bigger than that
the p-value okay yeah so I would say the
best thing statistics have done to teach
biologists is about the multiple
hypothesis testing with high school food
sequencing I would say before there is
any high throughput experiment people
didn't really care as much about
statistics
you know biologists just me feel like oh
if I see different like anything
different I'll reported I have a paper
that's great I'm starting from gene
expression microarrays when start people
started looking at thousands of genes or
tens of thousands of thousands of genes
they realize that you need to learn
about multiple hypothesis testing and I
think that's a really something
statistics helped with this biologists
and so basically after a microarray
experiment you use the e seek negative
binomial distribution to call the
differential gene then use the FDR to
estimate how many genes you
really want to call so that's whatever
number of genes you call maybe only 5%
of those are wrong and the remaining
ones are going to be correct okay
Browse More Related Video
p-hacking: What it is and how to avoid it!
How Genes Express Themselves: Crash Course Biology #36
Regulation of Gene Expression: Operons, Epigenetics, and Transcription Factors
Incomplete Dominance, Codominance, Polygenic Traits, and Epistasis!
What is a Plasmid? - Plasmids 101
Identifying Tampered Images
5.0 / 5 (0 votes)