blogs_1_3blogs_1_4blogs_1_5blogs_1_1blogs_1_2blogs_1_0

Unraveling GWAS: A Researcher's 15-Minute Guide

Numerous colleagues and clients have frequently expressed their frustrations with existing reviews on Genome-Wide Association Studies (GWAS). They often found these resources to be excessively lengthy, overly complicated, and excessively technical. Recognizing the need for a concise and accessible introduction to GWAS, I embarked on crafting an article that covers everything essential for 98% of biological and biomedical researchers. The objective of this article is to provide a comprehensive yet easily digestible 2,000-word overview of GWAS, designed to be read in just 15 minutes.

Understanding GWAS: Origins and Current Status      GWAS, a research approach aimed at identifying statistical associations between diseases or phenotypes and genetic variants spanning the entire genome, originated in the mid-1990s. The first noteworthy GWAS methodologies emerged during this period, and in 2002, Ozaki et al. published the first GWAS research study. Since then, GWAS research has experienced substantial growth and has maintained its momentum well into the 2020s, demonstrating no signs of waning in popularity.

Comparison between Linkage Analysis and Association Analysis      GWAS emerged as a result of two complementary approaches for identifying disease-associated genetic variants: linkage analysis and association analysis. Linkage analysis examines the co-segregation of a disease or phenotype with a genetic variant among family members. It involves assessing existing pedigrees for the natural occurrence of disease-variant co-occurrences or, in animal and plant studies, meticulously crossing lines with controlled genetic backgrounds and analyzing the offspring for such associations. The statistical methods employed in linkage analysis tend to be intricate and often necessitate custom likelihood modeling.

In contrast, association analysis evaluates the relationship between a disease or phenotype and genetic variants based on the strength of linkage disequilibrium (LD) between the variant and a functional variant within a natural population. LD refers to the non-random association between two or more loci in a population and serves as a proxy for the genomic distance between variants. However, several factors such as recombination rate, mutation rate, and genetic drift influence the strength of LD. Association analysis relies on the assumption that the measured variant is in close proximity to the functional variant responsible for the disease or phenotype. Unlike linkage analysis, association analysis utilizes a population of presumably unrelated individuals, making it easier to conduct. The statistical methods employed in association analysis are often elegant and more accessible than those used in linkage analysis.

During the late 1990s and early 2000s, candidate gene association studies gained popularity. These studies focused on assessing the association between individual genes, hypothesized to underlie the disease or phenotype, and the condition of interest. Subsequently, GWAS gained significant traction, primarily due to advancements in next-generation sequencing (NGS) technology. The reduced cost of genotyping on a genome-wide scale, facilitated by NGS, propelled the widespread adoption of GWAS since the mid-2000s.

The Limitations of the Naïve Model      The naïve modeling approach for GWAS involves employing a simple t-test to test the null hypothesis that the phenotype measurements are equivalent between two groups of individuals carrying different alleles. If a resulting p-value is less than 0.05, the null hypothesis is rejected, and the variant is deemed significantly associated with the phenotype. However, this naïve modeling method presents two significant problems.

Firstly, it fails to account for the multiple testing problem. While a p-value cutoff of 0.05 might be acceptable for testing a single variant, when hundreds of thousands or even millions of variants are tested simultaneously, the accumulation of false positives becomes unmanageable. Proper strategies to address the multiple testing problem will be discussed later in this article.

The second issue with the naïve modeling method is its failure to address the relatedness among individuals included in the study. This poses a more challenging problem to solve than the multiple testing issue. Two types of relatedness must be considered: population structure and familial relationships or kinship. Population structure refers to systematic differences in allele frequencies between subpopulations caused by geographic, climate, or other factors. Uncontrolled population structure can lead to spurious or false positive associations, which plagued the earlier stages of GWAS history. Similarly, uncontrolled kinship, represented by a kinship matrix, can undermine a GWAS. Methods to measure and address population structure and kinship will be discussed in later sections of this article.

The Mixed Linear Model Framework      The mixed linear model framework has become the most widely adopted analysis framework for conducting GWAS. This approach defines a linear model where the variant and population structure are treated as fixed effects, while kinship is treated as a random effect. By explicitly incorporating population structure and kinship as independent variables in the linear model, this approach effectively controls for the relatedness between individuals while assessing the association between the phenotype and variant. The publication of the mixed linear model framework by Yu et al. (2006) marked a significant milestone in GWAS methodology development. The vast majority of modern GWAS studies utilize this framework or an improved version thereof.

Read next: Unraveling GWAS: A Researcher's 15-Minute Guide, Part 2

-- About the author

Need assistance in your GWAS, EWAS, TWAS or PWAS project? We may be able to help. Take a look at the intro to our bioinformatician team, see some of the advantages of using our team's help here, and check out our FAQ page!

Send us an inquiry, chat with us online (during our business hours 9-5 Mon-Fri U.S. Central Time), or reach us in other ways!



Chat Support Software