Genome-Wide Association Study - An Explanation for Beginners

Nuno Carvalho
13 Aug 202007:35

Summary

TLDRIn this video, Nuna Crevallo explains the concept of a Genome-Wide Association Study (GWAS). She describes how GWAS identifies associations between genetic variations, known as single nucleotide polymorphisms (SNPs), and specific traits like BMI. Crevallo outlines the process, which involves large-scale data collection, genotyping, regression analysis, and the use of statistical methods to determine significant genetic associations. She also explains the importance of correcting for false positives and how the findings can lead to better understanding of genetic diseases and traits, potentially improving treatments and prevention.

Takeaways

  • 🧬 Genome-wide association studies (GWAS) investigate associations between genetic variations and specific physical traits.
  • 🧪 Single nucleotide polymorphisms (SNPs) are key genetic variations, where different nucleotides occur at the same position in the DNA.
  • 👥 Humans share 99.9% of their genome, but the 0.1% of SNPs contribute to individual differences.
  • 🌐 GWAS requires large sample sizes, typically from thousands to hundreds of thousands of people, to detect genetic associations.
  • 🔄 The study focuses on a specific trait or phenotype, such as BMI, and analyzes genetic data across millions of SNPs.
  • 📊 A regression analysis is performed to see if the number of SNP variants (alleles) is linked to the physical trait being studied.
  • 📉 The strength of association is measured by p-values, which indicate the likelihood that an association is due to chance.
  • 🔢 A Bonferroni correction is applied to adjust for false positives when testing millions of SNPs, using a stricter significance threshold.
  • 🏙 A Manhattan plot visually represents significant associations, with SNPs positioned along chromosomes and their p-values shown.
  • 🔍 Identifying SNPs associated with traits can lead to discoveries of important biological mechanisms and improve disease treatment.

Q & A

  • What is a Genome-Wide Association Study (GWAS)?

    -A Genome-Wide Association Study (GWAS) is a method used to discover associations between variations in our genetic code and certain physical traits.

  • What are single nucleotide polymorphisms (SNPs)?

    -SNPs are variations at a single position in a DNA sequence among individuals. They are a key element in understanding the genetic causes for human traits.

  • Why is it important to understand the genetic makeup of humans?

    -Understanding the genetic makeup is important to determine the extent to which certain traits are genetic and to identify biological mechanisms that affect those traits.

  • How does a GWAS help in understanding traits like BMI?

    -A GWAS helps in understanding traits like BMI by looking for variance in DNA that might impact it, such as genes that affect metabolism.

  • What is the significance of having a large sample size in a GWAS?

    -A large sample size, ideally in the thousands or hundreds of thousands, is significant because it helps to minimize confounding genetic variation and increases the statistical power of the study.

  • What does genotyping involve in the context of a GWAS?

    -In a GWAS, genotyping involves recording the nucleotides of each person at many known SNP locations, often over 2 million SNPs.

  • How is the association between genetic data and a phenotype computed?

    -The association between genetic data and a phenotype is computed using a genome-wide association analysis program, which performs regression analysis for every SNP in the dataset.

  • What role does the software PLINK play in GWAS?

    -PLINK is a commonly used software for GWAS that performs quality control filters on the genetic dataset and carries out the actual association analysis.

  • What is a p-value in the context of GWAS, and why is it important?

    -A p-value measures the likelihood that the association found in the distribution of data points was due to random chance. It is important for determining the statistical significance of the association between a SNP and a phenotype.

  • What is a Bonferroni correction, and why is it used in GWAS?

    -A Bonferroni correction is a statistical method used to adjust the threshold for significance when dealing with multiple comparisons, like in GWAS, to reduce the chance of false positives.

  • How are the results of a GWAS typically visualized?

    -The results of a GWAS are typically visualized using a Manhattan plot, which plots each SNP's position along the chromosome and its associated p-value after a negative log transformation.

  • What is linkage disequilibrium, and how does it relate to GWAS?

    -Linkage disequilibrium refers to the non-random association of alleles at different loci. In GWAS, significant SNPs often cluster together due to linkage disequilibrium, which can indicate the presence of important genes.

Outlines

00:00

🧬 Understanding Genome-Wide Association Studies

Nuna Crevallo introduces Genome-Wide Association Studies (GWAS), explaining that they are used to find links between genetic variations and physical traits. She starts by defining the human genome and its composition of nucleotides (adenine, thymine, guanine, and cytosine). She points out that only 0.1% of our DNA varies among individuals, which accounts for human diversity. Crevallo then describes Single Nucleotide Polymorphisms (SNPs) as locations in DNA where different nucleotides can occur in a significant portion of the population. These SNPs are crucial for understanding the genetic basis of traits. The process of GWAS involves collecting genetic data from a large group of people, genotyping them for millions of SNPs, and correlating this data with a trait of interest, such as BMI. Crevallo explains that association analysis is performed using software like PLINK, which conducts regression analysis for each SNP to determine if there's a relationship with the trait. The effectiveness of the predictive line is measured by the p-value, which indicates the likelihood that the observed association is not due to chance. The analysis is computationally intensive but can be expedited with efficient processing techniques.

05:02

📊 Analyzing GWAS Results and Significance

In the second paragraph, Crevallo discusses the analysis of GWAS results, focusing on p-values as a measure of statistical significance. She clarifies that a p-value threshold of 0.05 is commonly used to determine if an association is significant and not due to chance. However, with millions of SNPs tested, the likelihood of false positives increases, leading to the use of a Bonferroni correction to adjust the threshold for significance. The field has adopted a threshold of 5x10^-8 to reduce false positives. Crevallo then explains how Manhattan plots are used to visualize the results of GWAS, plotting each SNP's chromosome position against its p-value after a negative log transformation. Significant SNPs are indicated by dots above a certain line on the plot. She mentions that clusters of significant SNPs may be due to linkage disequilibrium and can help identify gene locations associated with the phenotype of interest. Finally, Crevallo suggests that understanding these genetic associations can lead to improved prevention and treatment of genetic diseases, potentially saving many lives. She concludes by inviting suggestions for future topics and wishing the viewers a good day.

Mindmap

Keywords

💡Genome

A genome is the complete set of genetic material present in an organism. In the video, it refers to all of the DNA contained in human cells, which includes 23 pairs of chromosomes. Understanding the genome is fundamental to identifying variations in DNA that affect physical traits.

💡Single Nucleotide Polymorphisms (SNPs)

SNPs are variations at a single position in the DNA sequence among individuals. In the video, SNPs are highlighted as key markers in the study of genetic traits, such as BMI, as they help researchers pinpoint genetic differences that could influence specific traits.

💡Genome-Wide Association Study (GWAS)

A GWAS is a study that examines the entire genome of multiple individuals to find genetic variations associated with a particular trait. In the video, the term refers to the method used to discover connections between genetic differences (SNPs) and physical traits like BMI.

💡Phenotype

A phenotype is the observable physical trait or characteristic of an organism, such as eye color or BMI. The video discusses how phenotypes are analyzed in a GWAS to find genetic links to these traits by comparing them with SNP data.

💡P-value

A p-value is a statistical measure used to determine the likelihood that an observed association between SNPs and a trait is due to chance. In the video, p-values are used to assess the significance of the results from the GWAS, with smaller p-values indicating stronger evidence for a true genetic association.

💡Bonferroni Correction

The Bonferroni correction is a statistical method used to adjust the threshold for significance in studies with multiple comparisons to reduce false positives. In the video, this correction is applied to handle the large number of SNPs analyzed in GWAS, ensuring more reliable results.

💡Linkage Disequilibrium

Linkage disequilibrium refers to the non-random association of alleles at different loci in a population. While not explained in depth in the video, it is mentioned as a factor influencing the clustering of significant SNPs in genetic studies, helping researchers identify which genes are associated with traits.

💡Regression Analysis

A regression analysis is a statistical process used to determine the relationship between variables. In GWAS, as explained in the video, it is used to analyze the relationship between SNPs (independent variables) and a trait like BMI (dependent variable), predicting how changes in SNPs affect the trait.

💡Effect Size

Effect size refers to the magnitude of the relationship between a SNP and a trait in a GWAS. In the video, effect size is discussed in relation to the slope of the regression line, indicating the strength of the association between SNP variants and traits like BMI.

💡Manhattan Plot

A Manhattan plot is a graphical representation of GWAS results, where each SNP is plotted based on its chromosomal location and its significance (p-value). In the video, this plot helps visualize which SNPs are significantly associated with a trait, with higher points indicating stronger associations.

Highlights

Introduction to genome-wide association study (GWAS) and its importance in discovering associations between genetic variations and physical traits.

Explanation of the human genome consisting of 23 pairs of chromosomes and made up of nucleotides: adenine (A), thymine (T), guanine (G), and cytosine (C).

Humans share 99.9% of their genetic code, with only 0.1% responsible for individual diversity.

Introduction to single nucleotide polymorphisms (SNPs), which represent variations in a single nucleotide at specific positions in the genome.

SNPs play a key role in understanding the genetic basis of heritable traits and can reveal biological mechanisms affecting traits.

Large sample sizes, typically in the hundreds of thousands, are needed for accurate GWAS results, especially for complex traits like BMI.

Ethnic homogeneity within the sample is critical to minimize confounding genetic variations.

Genotyping is the process of recording genetic information at millions of known SNPs for each individual in the study.

A genome-wide association analysis uses regression analysis to examine the relationship between SNP variants and traits such as BMI.

Regression analysis can account for the number of alleles (0, 1, or 2) for each SNP and their relationship to a phenotype like BMI.

P-values help determine the statistical significance of associations between SNPs and traits, with lower p-values indicating stronger associations.

A Bonferroni correction is used to adjust for multiple comparisons in GWAS, lowering the threshold for significance to reduce false positives.

The common threshold for GWAS significance is 5 × 10^-8, which helps minimize false positive associations.

Manhattan plots visually represent significant SNPs, with the x-axis showing SNP positions on chromosomes and the y-axis showing p-values.

Clusters of significant SNPs often indicate regions of interest that may be linked to important genes affecting the trait being studied.

Transcripts

play00:01

hello i am nuna crevallo

play00:03

and today i'll be explaining what a

play00:05

genome-wide association study

play00:07

is the quick summary is that a

play00:12

genome-wide association study or a

play00:14

g-loss for short

play00:15

is the discovery of associations between

play00:17

certain variations in our genetic code

play00:19

and a certain physical trait to explain

play00:21

how all of this can be done

play00:23

we must begin at the start with an

play00:25

explanation of single nucleotide

play00:27

polymorphisms

play00:29

all of human's genetic code is called

play00:31

the genome

play00:33

in the nucleus of each cell the genome

play00:35

is split between 23 pairs of chromosomes

play00:38

our dna is made up of an extremely long

play00:40

chain of connected single units called

play00:42

nucleotides

play00:43

which come in four forms adenine

play00:46

thymine guanine and cytosine

play00:49

these variations are often shortened to

play00:51

a t g and c

play00:53

respectively incredibly humans share

play00:56

99.9 percent of their genetic makeup

play00:59

meaning

play00:59

only 0.1 percent of our entire dna is

play01:02

responsible for the diversity we see

play01:04

between individuals

play01:06

so let's say we choose a spot along our

play01:08

dna and find that 95 percent of people

play01:10

have an a nucleotide here while the

play01:13

remaining five percent have a t

play01:14

nucleotide

play01:16

each of these forms is called a variant

play01:18

given that this location on human dna

play01:21

can have multiple forms

play01:22

it is dubbed a single nucleotide

play01:24

polymorphism or a snip

play01:26

for short

play01:30

these snips are key in understanding the

play01:32

genetic causes for human traits

play01:34

while some traits such as what la-based

play01:36

nva team

play01:37

prefer are almost entirely environmental

play01:40

other traits like eye color

play01:41

are extremely heritable snips can help

play01:44

us understand to what extent certain

play01:46

traits are genetic

play01:47

as well as what biological mechanisms in

play01:49

our body might be affecting those traits

play01:52

to do this an association analysis can

play01:54

be performed

play01:57

let's pretend that we are trying to find

play01:59

genetic associations with bmi

play02:01

meaning we're trying to figure out what

play02:02

variance in our dna might be impacting

play02:04

bmi

play02:05

such as genes that might increase or

play02:07

decrease the efficiency of metabolism in

play02:09

our body

play02:10

we first need a large sample size of

play02:11

volunteers at least in the thousands

play02:13

were preferably closer to hundreds of

play02:15

thousands

play02:16

this cohort of people should ideally be

play02:18

of identical ethnicities to minimize

play02:20

confounding genetic variation

play02:22

from other factors then each person

play02:25

needs to be genotyped or have their

play02:26

nucleotides recorded at many known snip

play02:29

locations

play02:30

often each person will have genetic

play02:32

information recorded for over 2 million

play02:33

snips

play02:34

after that we need to record each

play02:36

person's bmi once we have recorded the

play02:38

genome and the trait or phenotype

play02:41

of a large number of people an

play02:42

association between the two can be

play02:44

computed

play02:46

the next steps involve using a

play02:47

genome-wide association analysis program

play02:50

a commonly used software for this is

play02:52

called plink using the software

play02:54

some quality control filters can be

play02:56

performed on the genetic data set

play02:58

removing any individuals or snips that

play03:00

may have faulty information

play03:02

next the actual association analysis is

play03:05

done

play03:06

for every snip in the data set a

play03:07

regression analysis is performed with

play03:09

each individual being a data point

play03:11

for example let's say the program is

play03:13

performing a regression analysis for

play03:15

snip id number one

play03:16

which has two possible alleles an a and

play03:18

a t

play03:19

each individual in the data set then has

play03:21

the number of t alleles they have in

play03:23

their genome for that snip plotted

play03:24

against the physical trait of interest

play03:26

in this case bmi keep in mind that human

play03:29

dna contains a copy from our father

play03:31

and a copy from our mother meaning that

play03:33

a person's allele combinations for the

play03:35

snip can be a

play03:36

a a t or t t this would be coded as a 0

play03:40

a 1 and a 2 respectively

play03:45

once each individual is plotted the

play03:47

program tries to draw a line to the data

play03:49

that best predicts the relationship

play03:50

between the number of alleles and the

play03:52

phenotype

play03:53

if there is no association between that

play03:55

snip and bmi the regression line would

play03:57

essentially be a horizontal line

play03:59

however if an association is present you

play04:01

can expect the regression line to have

play04:03

some sort of slope

play04:04

the effectiveness of the line at

play04:06

predicting the data points determines

play04:07

the p-value

play04:08

the p-value is a measure of the

play04:10

likelihood that the association found in

play04:12

the distribution of data points was due

play04:14

to random chance

play04:15

given that there is no association

play04:16

between the snip and bmi

play04:18

this means the stronger the data points

play04:20

cluster around a sloped regression line

play04:22

the less likely it is that it is due to

play04:24

random chance producing a small p-value

play04:27

for each snip the p-value calculated is

play04:29

recorded as well as the slope of the

play04:31

regression line which is also known as

play04:33

the effect size

play04:36

this regression analysis repeats for

play04:38

every single snip in the data set

play04:40

when you are working with millions of

play04:42

snips this will often take hours if not

play04:44

days

play04:44

for a computer to accomplish luckily

play04:46

programs like playing can take advantage

play04:48

of efficient processing such as

play04:50

multi-threading

play04:50

to finish the analysis quicker

play04:52

furthermore these programs allow for the

play04:54

addition of covariates which are other

play04:56

traits that may affect the phenotype

play04:58

that is of interest

play04:59

so for example the amount of exercise

play05:01

done weekly by a person

play05:03

certainly has an effect on bmi so if

play05:06

this information is present for the

play05:07

people in the data set

play05:08

the regression analysis can also account

play05:10

for the effect of this covariate when it

play05:12

competes the effect of the genotype

play05:13

alone

play05:16

once p-values have been calculated for

play05:18

all snips how do we determine if an

play05:20

association is considered significant or

play05:22

not

play05:23

once again this has to do with p-values

play05:25

you might have heard before

play05:26

of using five percent as a threshold for

play05:28

statistical significance

play05:30

this means that only results producing

play05:32

p-values lower than .05

play05:34

are considered to be due to a real

play05:36

association present and not random

play05:38

chance

play05:39

however with a p-value of point zero

play05:41

five you still have a five percent

play05:42

chance of producing a false positive

play05:44

meaning around five percent of your

play05:46

significant results may still be because

play05:48

of random chance

play05:49

when you are working with millions of

play05:51

snips you are likely to produce

play05:52

thousands of false positives

play05:54

which can muddle results and diminish

play05:56

the statistical power of your study

play05:58

so a bonferroni correction is performed

play06:01

which transforms the threshold required

play06:02

for achieving significance by taking the

play06:04

typical .05 threshold

play06:06

and dividing it by the number of snips

play06:08

in the analysis

play06:10

the quantitative genomics field has

play06:12

adopted 5 times 10 to the negative 8 as

play06:14

the default threshold for significance

play06:16

this will sharply reduce how many snips

play06:18

find a significant association with the

play06:20

phenotype

play06:21

but chances are that most of those snips

play06:23

were false positives to begin with

play06:26

to visualize these results a manhattan

play06:29

plot is produced

play06:30

a manhattan plot takes each snip and

play06:32

plots its position along the chromosome

play06:34

on the x-axis

play06:35

and its associated p-value on the y-axis

play06:38

the p-value undergoes a negative log

play06:40

transformation to make it easier to read

play06:43

essentially dots above this line

play06:45

represent snips that are significantly

play06:47

associated with bmi

play06:49

often significant snips will be

play06:51

clustered together due to linkage

play06:53

disequilibrium

play06:54

a topic beyond the scope of this video

play06:58

but what is important is that these

play07:00

clusters need to be analyzed using snip

play07:02

databases that indicate what genes are

play07:04

present for those regions

play07:05

identifying gene locations associated

play07:07

with the phenotype may uncover important

play07:09

biological mechanisms that had not been

play07:11

found before

play07:13

with these findings the prevention and

play07:14

treatment of certain genetic diseases

play07:16

may improve drastically and hopefully

play07:18

many lives can be saved in the future

play07:23

thank you for watching this video if you

play07:25

have any suggestions for a quantitative

play07:27

genomics topic for me to explain

play07:29

let me know in the comments i'm nuna

play07:31

crevallo and i hope that you have a good

play07:33

day

Rate This

5.0 / 5 (0 votes)

هل تحتاج إلى تلخيص باللغة الإنجليزية؟