Genome-Wide Association Study - An Explanation for Beginners
Summary
TLDRIn this video, Nuna Crevallo explains the concept of a Genome-Wide Association Study (GWAS). She describes how GWAS identifies associations between genetic variations, known as single nucleotide polymorphisms (SNPs), and specific traits like BMI. Crevallo outlines the process, which involves large-scale data collection, genotyping, regression analysis, and the use of statistical methods to determine significant genetic associations. She also explains the importance of correcting for false positives and how the findings can lead to better understanding of genetic diseases and traits, potentially improving treatments and prevention.
Takeaways
- 𧬠Genome-wide association studies (GWAS) investigate associations between genetic variations and specific physical traits.
- π§ͺ Single nucleotide polymorphisms (SNPs) are key genetic variations, where different nucleotides occur at the same position in the DNA.
- π₯ Humans share 99.9% of their genome, but the 0.1% of SNPs contribute to individual differences.
- π GWAS requires large sample sizes, typically from thousands to hundreds of thousands of people, to detect genetic associations.
- π The study focuses on a specific trait or phenotype, such as BMI, and analyzes genetic data across millions of SNPs.
- π A regression analysis is performed to see if the number of SNP variants (alleles) is linked to the physical trait being studied.
- π The strength of association is measured by p-values, which indicate the likelihood that an association is due to chance.
- π’ A Bonferroni correction is applied to adjust for false positives when testing millions of SNPs, using a stricter significance threshold.
- π A Manhattan plot visually represents significant associations, with SNPs positioned along chromosomes and their p-values shown.
- π Identifying SNPs associated with traits can lead to discoveries of important biological mechanisms and improve disease treatment.
Q & A
What is a Genome-Wide Association Study (GWAS)?
-A Genome-Wide Association Study (GWAS) is a method used to discover associations between variations in our genetic code and certain physical traits.
What are single nucleotide polymorphisms (SNPs)?
-SNPs are variations at a single position in a DNA sequence among individuals. They are a key element in understanding the genetic causes for human traits.
Why is it important to understand the genetic makeup of humans?
-Understanding the genetic makeup is important to determine the extent to which certain traits are genetic and to identify biological mechanisms that affect those traits.
How does a GWAS help in understanding traits like BMI?
-A GWAS helps in understanding traits like BMI by looking for variance in DNA that might impact it, such as genes that affect metabolism.
What is the significance of having a large sample size in a GWAS?
-A large sample size, ideally in the thousands or hundreds of thousands, is significant because it helps to minimize confounding genetic variation and increases the statistical power of the study.
What does genotyping involve in the context of a GWAS?
-In a GWAS, genotyping involves recording the nucleotides of each person at many known SNP locations, often over 2 million SNPs.
How is the association between genetic data and a phenotype computed?
-The association between genetic data and a phenotype is computed using a genome-wide association analysis program, which performs regression analysis for every SNP in the dataset.
What role does the software PLINK play in GWAS?
-PLINK is a commonly used software for GWAS that performs quality control filters on the genetic dataset and carries out the actual association analysis.
What is a p-value in the context of GWAS, and why is it important?
-A p-value measures the likelihood that the association found in the distribution of data points was due to random chance. It is important for determining the statistical significance of the association between a SNP and a phenotype.
What is a Bonferroni correction, and why is it used in GWAS?
-A Bonferroni correction is a statistical method used to adjust the threshold for significance when dealing with multiple comparisons, like in GWAS, to reduce the chance of false positives.
How are the results of a GWAS typically visualized?
-The results of a GWAS are typically visualized using a Manhattan plot, which plots each SNP's position along the chromosome and its associated p-value after a negative log transformation.
What is linkage disequilibrium, and how does it relate to GWAS?
-Linkage disequilibrium refers to the non-random association of alleles at different loci. In GWAS, significant SNPs often cluster together due to linkage disequilibrium, which can indicate the presence of important genes.
Outlines
𧬠Understanding Genome-Wide Association Studies
Nuna Crevallo introduces Genome-Wide Association Studies (GWAS), explaining that they are used to find links between genetic variations and physical traits. She starts by defining the human genome and its composition of nucleotides (adenine, thymine, guanine, and cytosine). She points out that only 0.1% of our DNA varies among individuals, which accounts for human diversity. Crevallo then describes Single Nucleotide Polymorphisms (SNPs) as locations in DNA where different nucleotides can occur in a significant portion of the population. These SNPs are crucial for understanding the genetic basis of traits. The process of GWAS involves collecting genetic data from a large group of people, genotyping them for millions of SNPs, and correlating this data with a trait of interest, such as BMI. Crevallo explains that association analysis is performed using software like PLINK, which conducts regression analysis for each SNP to determine if there's a relationship with the trait. The effectiveness of the predictive line is measured by the p-value, which indicates the likelihood that the observed association is not due to chance. The analysis is computationally intensive but can be expedited with efficient processing techniques.
π Analyzing GWAS Results and Significance
In the second paragraph, Crevallo discusses the analysis of GWAS results, focusing on p-values as a measure of statistical significance. She clarifies that a p-value threshold of 0.05 is commonly used to determine if an association is significant and not due to chance. However, with millions of SNPs tested, the likelihood of false positives increases, leading to the use of a Bonferroni correction to adjust the threshold for significance. The field has adopted a threshold of 5x10^-8 to reduce false positives. Crevallo then explains how Manhattan plots are used to visualize the results of GWAS, plotting each SNP's chromosome position against its p-value after a negative log transformation. Significant SNPs are indicated by dots above a certain line on the plot. She mentions that clusters of significant SNPs may be due to linkage disequilibrium and can help identify gene locations associated with the phenotype of interest. Finally, Crevallo suggests that understanding these genetic associations can lead to improved prevention and treatment of genetic diseases, potentially saving many lives. She concludes by inviting suggestions for future topics and wishing the viewers a good day.
Mindmap
Keywords
π‘Genome
π‘Single Nucleotide Polymorphisms (SNPs)
π‘Genome-Wide Association Study (GWAS)
π‘Phenotype
π‘P-value
π‘Bonferroni Correction
π‘Linkage Disequilibrium
π‘Regression Analysis
π‘Effect Size
π‘Manhattan Plot
Highlights
Introduction to genome-wide association study (GWAS) and its importance in discovering associations between genetic variations and physical traits.
Explanation of the human genome consisting of 23 pairs of chromosomes and made up of nucleotides: adenine (A), thymine (T), guanine (G), and cytosine (C).
Humans share 99.9% of their genetic code, with only 0.1% responsible for individual diversity.
Introduction to single nucleotide polymorphisms (SNPs), which represent variations in a single nucleotide at specific positions in the genome.
SNPs play a key role in understanding the genetic basis of heritable traits and can reveal biological mechanisms affecting traits.
Large sample sizes, typically in the hundreds of thousands, are needed for accurate GWAS results, especially for complex traits like BMI.
Ethnic homogeneity within the sample is critical to minimize confounding genetic variations.
Genotyping is the process of recording genetic information at millions of known SNPs for each individual in the study.
A genome-wide association analysis uses regression analysis to examine the relationship between SNP variants and traits such as BMI.
Regression analysis can account for the number of alleles (0, 1, or 2) for each SNP and their relationship to a phenotype like BMI.
P-values help determine the statistical significance of associations between SNPs and traits, with lower p-values indicating stronger associations.
A Bonferroni correction is used to adjust for multiple comparisons in GWAS, lowering the threshold for significance to reduce false positives.
The common threshold for GWAS significance is 5 Γ 10^-8, which helps minimize false positive associations.
Manhattan plots visually represent significant SNPs, with the x-axis showing SNP positions on chromosomes and the y-axis showing p-values.
Clusters of significant SNPs often indicate regions of interest that may be linked to important genes affecting the trait being studied.
Transcripts
hello i am nuna crevallo
and today i'll be explaining what a
genome-wide association study
is the quick summary is that a
genome-wide association study or a
g-loss for short
is the discovery of associations between
certain variations in our genetic code
and a certain physical trait to explain
how all of this can be done
we must begin at the start with an
explanation of single nucleotide
polymorphisms
all of human's genetic code is called
the genome
in the nucleus of each cell the genome
is split between 23 pairs of chromosomes
our dna is made up of an extremely long
chain of connected single units called
nucleotides
which come in four forms adenine
thymine guanine and cytosine
these variations are often shortened to
a t g and c
respectively incredibly humans share
99.9 percent of their genetic makeup
meaning
only 0.1 percent of our entire dna is
responsible for the diversity we see
between individuals
so let's say we choose a spot along our
dna and find that 95 percent of people
have an a nucleotide here while the
remaining five percent have a t
nucleotide
each of these forms is called a variant
given that this location on human dna
can have multiple forms
it is dubbed a single nucleotide
polymorphism or a snip
for short
these snips are key in understanding the
genetic causes for human traits
while some traits such as what la-based
nva team
prefer are almost entirely environmental
other traits like eye color
are extremely heritable snips can help
us understand to what extent certain
traits are genetic
as well as what biological mechanisms in
our body might be affecting those traits
to do this an association analysis can
be performed
let's pretend that we are trying to find
genetic associations with bmi
meaning we're trying to figure out what
variance in our dna might be impacting
bmi
such as genes that might increase or
decrease the efficiency of metabolism in
our body
we first need a large sample size of
volunteers at least in the thousands
were preferably closer to hundreds of
thousands
this cohort of people should ideally be
of identical ethnicities to minimize
confounding genetic variation
from other factors then each person
needs to be genotyped or have their
nucleotides recorded at many known snip
locations
often each person will have genetic
information recorded for over 2 million
snips
after that we need to record each
person's bmi once we have recorded the
genome and the trait or phenotype
of a large number of people an
association between the two can be
computed
the next steps involve using a
genome-wide association analysis program
a commonly used software for this is
called plink using the software
some quality control filters can be
performed on the genetic data set
removing any individuals or snips that
may have faulty information
next the actual association analysis is
done
for every snip in the data set a
regression analysis is performed with
each individual being a data point
for example let's say the program is
performing a regression analysis for
snip id number one
which has two possible alleles an a and
a t
each individual in the data set then has
the number of t alleles they have in
their genome for that snip plotted
against the physical trait of interest
in this case bmi keep in mind that human
dna contains a copy from our father
and a copy from our mother meaning that
a person's allele combinations for the
snip can be a
a a t or t t this would be coded as a 0
a 1 and a 2 respectively
once each individual is plotted the
program tries to draw a line to the data
that best predicts the relationship
between the number of alleles and the
phenotype
if there is no association between that
snip and bmi the regression line would
essentially be a horizontal line
however if an association is present you
can expect the regression line to have
some sort of slope
the effectiveness of the line at
predicting the data points determines
the p-value
the p-value is a measure of the
likelihood that the association found in
the distribution of data points was due
to random chance
given that there is no association
between the snip and bmi
this means the stronger the data points
cluster around a sloped regression line
the less likely it is that it is due to
random chance producing a small p-value
for each snip the p-value calculated is
recorded as well as the slope of the
regression line which is also known as
the effect size
this regression analysis repeats for
every single snip in the data set
when you are working with millions of
snips this will often take hours if not
days
for a computer to accomplish luckily
programs like playing can take advantage
of efficient processing such as
multi-threading
to finish the analysis quicker
furthermore these programs allow for the
addition of covariates which are other
traits that may affect the phenotype
that is of interest
so for example the amount of exercise
done weekly by a person
certainly has an effect on bmi so if
this information is present for the
people in the data set
the regression analysis can also account
for the effect of this covariate when it
competes the effect of the genotype
alone
once p-values have been calculated for
all snips how do we determine if an
association is considered significant or
not
once again this has to do with p-values
you might have heard before
of using five percent as a threshold for
statistical significance
this means that only results producing
p-values lower than .05
are considered to be due to a real
association present and not random
chance
however with a p-value of point zero
five you still have a five percent
chance of producing a false positive
meaning around five percent of your
significant results may still be because
of random chance
when you are working with millions of
snips you are likely to produce
thousands of false positives
which can muddle results and diminish
the statistical power of your study
so a bonferroni correction is performed
which transforms the threshold required
for achieving significance by taking the
typical .05 threshold
and dividing it by the number of snips
in the analysis
the quantitative genomics field has
adopted 5 times 10 to the negative 8 as
the default threshold for significance
this will sharply reduce how many snips
find a significant association with the
phenotype
but chances are that most of those snips
were false positives to begin with
to visualize these results a manhattan
plot is produced
a manhattan plot takes each snip and
plots its position along the chromosome
on the x-axis
and its associated p-value on the y-axis
the p-value undergoes a negative log
transformation to make it easier to read
essentially dots above this line
represent snips that are significantly
associated with bmi
often significant snips will be
clustered together due to linkage
disequilibrium
a topic beyond the scope of this video
but what is important is that these
clusters need to be analyzed using snip
databases that indicate what genes are
present for those regions
identifying gene locations associated
with the phenotype may uncover important
biological mechanisms that had not been
found before
with these findings the prevention and
treatment of certain genetic diseases
may improve drastically and hopefully
many lives can be saved in the future
thank you for watching this video if you
have any suggestions for a quantitative
genomics topic for me to explain
let me know in the comments i'm nuna
crevallo and i hope that you have a good
day
Browse More Related Video
The pros and cons of GWAS
Genetic drift, bottleneck effect and founder effect | Biology | Khan Academy
Pewarisan Sifat Kelas 9 SMP (Part-1)
IPA Kelas 9 : Pewarisan Sifat I (Materi Genetik : Kromosom, DNA dan RNA)
GCSE Biology - DNA Part 2 - Alleles / Dominant / Heterozygous / Phenotypes and more! #64
APES Notes 2.6 - Adaptations
5.0 / 5 (0 votes)