Introduction

This web tutorial is derived from 'A guide to genome-wide association analysis and post-analytic interrogation' (Statistics in Medicine, in review). The tutorial presents fundamental concepts and specific software tools for implementing a complete genome wide association (GWA) analysis, as well as post-analytic visualization and interrogation of potentially novel findings. In this tutorial we use complete GWA data on 1401 individuals from the PennCATH study of coronary artery disease (CAD).

In the steps to follow we begin by demonstrating a method for downloading necessary R packages and setting global parameters as a means for saving progress while working through a GWA analysis. Next, we include quality control steps for both SNP and sample level filtering. The third section is split into principal component calculation for population stratification in statistical modeling, as well as imputation of non-typed SNPs using 1000 Genomes reference genotype data. We then demonstrate strategies to carry out the GWA analysis on the typed data using basic linear modeling functionality in R, as well as imputed data using functionality contained within the snpStats package. Finally, we demonstrate means for post-analytic interrogation, including methods for evaluating the performance of statistical models, as well as visualization of the global and subsetted GWAS output.

Configuring global parameters

We first attempt to isolate most of the variable parameters used in the data processing and analysis. Of particular note, users should set the location to download the GWA data sets, necessary to run the tutorial. Other variables here specify input and output.

We enter save(..., file = working.data.fname(X)), specifying R objects we wish to save and a number in place of X. This will create the file, working.X.Rdata. To save the whole workspace we can enter save(file = working.data.fname(X)).

# Customize as needed for file locations

# Modify data.dir to indicate the location of the GWAStutorial files
# Intermediate data files will also be stored in this same location unless you set out.dir
data.dir <- '/Users/ericreed/Desktop/FoulkesLab/SIMFiles'

out.dir <- data.dir                     # may want to write to a separate dir to avoid clutter

# Download files
urlSupport <- "https://www.mtholyoke.edu/courses/afoulkes/Data/GWAStutorial/GWASTutorial_Files.zip"
zipSupport.fn <- sprintf("%s/GWAStutorial_Files.zip", data.dir) 

# Input files
gwas.fn <- lapply(c(bed='bed',bim='bim',fam='fam',gds='gds'), function(n) sprintf("%s/GWAStutorial.%s", data.dir, n))
clinical.fn <- sprintf("%s/GWAStutorial_clinical.csv", data.dir) 
onethou.fn <- lapply(c(info='info',ped='ped'), function(n) sprintf("%s/chr16_1000g_CEU.%s", data.dir, n))
protein.coding.coords.fname <- sprintf("%s/ProCodgene_coords.csv", data.dir)

# Output files
gwaa.fname <- sprintf("%s/GWAStutorialout.txt", out.dir)
gwaa.unadj.fname <- sprintf("%s/GWAStutorialoutUnadj.txt", out.dir)
impute.out.fname <- sprintf("%s/GWAStutorial_imputationOut.csv", out.dir)
CETP.fname <- sprintf("%s/CETP_GWASout.csv", out.dir)

# Working data saved between each code snippet so each can run independently.
# Use save(data, file=working.data.fname(num))
working.data.fname <- function(num) { sprintf("%s/working.%s.Rdata", out.dir, num) }

Installing necessary packages

This tutorial utilizes several packages available from Bioconductor, an open-soure bioinformatic software repository. Of these, we make the most use of snpStats, which includes functions to read in various formats of genotype data, carry out quality control, imputation and association analysis. SNPRelate is also well utilized and includes functions for sample level quality control, and computationally efficient principal component calculation. Other packages include functionality for data visualization (LDheatmap, postgwas), data manipulation (plyr), statistical calculation (GenABEL), and parallel processing (doParallel).

# Run this once interactively to download and install BioConductor packages and other packages.

source("http://bioconductor.org/biocLite.R")
biocLite("snpStats")
biocLite("SNPRelate")
biocLite("rtracklayer")
biocLite("biomaRt")
install.packages(c('plyr', 'GenABEL', 'LDheatmap','doParallel', 'ggplot2', 'coin', 'igraph', 'devtools', 'downloader'))

library(devtools)
install_url("http://cran.r-project.org/src/contrib/Archive/postgwas/postgwas_1.11.tar.gz")

Downloading support files

Next, we will download and unzip compressed data files that are necessary to run this tutorial.

# Download and unzip data needed for this tutorial
library(downloader)

download(urlSupport, zipSupport.fn)
unzip(zipSupport.fn, exdir = data.dir)