Unraveling GWAS: A Researcher's 15-Minute Guide, Part 2

Read the previous part: Unraveling GWAS: A Researcher's 15-Minute Guide, Part 1

Genotype Formats and Conversion Tools      Several genotype formats are commonly used in GWAS, including HapMap, VCF/BCF, PED/BED, and numerical formats. Various software tools are available for converting between these formats, such as TASSEL, VCFtools, and GTOOL. Additionally, customized scripts can be easily developed for format conversion tasks. These tools and formats play a crucial role in data preparation and harmonization in GWAS analysis.

Data Pre-processing      Before conducting a GWAS, proper pre-processing of the data is crucial. Key pre-processing steps include: 1. Call rate filtering: Removing variants with excessive missing values among individuals and eliminating individuals with excessive missing genotypes. 2. Heterozygosity filtering: Removing variants or individuals that deviate from the expected genotype distributions based on Hardy-Weinberg equilibrium (HWE). 3. Imputation of untyped data: If reference haplotypes and LD structure resources are available, imputation can be performed. For example, in human GWAS, HapMap and 1000 Genomes Project data can be utilized for imputation using tools such as IMPUTE2, MACH, and BEAGLE.

Addressing Relatedness Among Individuals      As discussed earlier, two types of relatedness must be considered in GWAS: population structure and kinship. There are two common methods for controlling population structure. The first involves using software tools like STRUCTURE and fastSTRUCTURE. The second method entails conducting principal component analysis (PCA) on the data and including the first few principal components (PCs), which are believed to represent the population structure, in the mixed linear model framework.

To control for kinship in a GWAS, kinship can be calculated using Identity by Descent (IBD) when pedigree information among individuals is available, as commonly seen in animal studies. Alternatively, Identity by State (IBS) can be employed using methods described in VanRaden (2008) for GWAS without pedigree information.

Software Tools for Conducting GWAS      Over the past 15 years, more than 120 tools have been developed to aid in conducting GWAS. A partial list of these tools can be found at this page. GWAS tools can be categorized into three groups: 1. Online tools: Examples include easyGWAS, Cyverse DE, and GWAPP. 2. Local tools with graphic user interfaces (GUIs): TASSEL is an example of a local tool that provides a GUI for GWAS analysis. 3. Local tools that run on command lines: Popular tools in this category include PLINK, GEMMA, and GAPIT, which offer command-line interfaces for performing GWAS.

Output of a GWAS      In addition to tables displaying the positions of variants significantly associated with the disease or phenotype, along with information such as P-values, minor allele frequencies (MAF), and allele effects, many GWAS tools generate two plots for visualizing the results. The first is the Manhattan plot, a recognizable graph with genomic variant locations on the horizontal axis and -log10(P) values on the vertical axis. Significant variants appear as peaks above a horizontal line representing the predetermined P-value threshold. The second plot is the quantile-quantile (Q-Q) plot, which is a scatter plot comparing the expected -log10(P) values (horizontal axis) with the observed -log10(P) values (vertical axis). The Q-Q plot is a valuable tool for assessing the appropriateness of the selected GWAS model and identifying potential improvements.

Interpreting Significant Variants      The identification of a significant variant often suggests the presence of a functional variant within the neighboring genomic region, typically within the same LD block. Pinpointing the exact functional variant or gene can be challenging, as functional variants have not been deterministically identified for a substantial proportion of significant variants observed in GWAS. It's worth noting that functional variants are not exclusively protein-coding variants; they can also be regulatory variants that influence protein expression levels. For instance, some crucial functional variants discovered through GWASs are found in enhancer regions that affect protein expression.

Replication and Meta-analysis      As with other types of studies, GWAS results can be influenced by chance findings or artifacts. Replication of a finding in an independent GWAS significantly strengthens its credibility. For many important diseases and phenotypes, multiple GWASs are conducted by independent research groups. Meta-analysis, a method of combining the data from multiple GWAS datasets, can enhance confidence in the findings and occasionally lead to novel discoveries.

Read next: Unraveling GWAS: A Researcher's 15-Minute Guide, Part 3

-- About the author

Need assistance in your GWAS, EWAS, TWAS or PWAS project? We may be able to help. Take a look at the intro to our bioinformatician team, see some of the advantages of using our team's help here, and check out our FAQ page!

Send us an inquiry, chat with us online (during our business hours 9-5 Mon-Fri U.S. Central Time), or reach us in other ways!



Chat Support Software