Genome-wide association analysis - Association analysis of typed SNPs using parallel processing

(This lab was adapted for statsTeachR by Sara Nunez, Nicholas Reich and Andrea Foulkes from our GWAS tutorial.)

Now that our data is loaded, filtered, and additional SNP genotypes have been imputed, we are ready to perform the association analysis on the genotyped SNPs. This involves regressing each SNP separately on a given trait, adjusted for clinical, environmental, and demographic factors. Due to the large number of SNPs and the generally uncharacterized relationships to the outcome, a simple single additive model will be employed.

Getting started

Here, we utilize a function we’ve written (GWAA()) to conduct the analysis. To prepare for it, we need to load the data saved from last module and install necessary packages like GenABEL and doParallel.

load("m2_lab2_save.RData")  
library(snpStats)
library(plyr)
library(GenABEL)
library(doParallel)

Data preparation

The GWAA function requires two arguments. The genodata argument should specify the entire genotype data object in SnpMatrix format. The phenodata argument should be a data frame with a column of sample IDs, corresponding to the row names of genodata, followed by columns for the outcome variable and covariates to be included in the model. In the following code, we assign the genodata variable and write out a table called GWAA.txt for the results to later be written to.

genodata <- genotype
#Print the number of SNPs to be checked
cat(paste(ncol(genodata), " SNPs included in analysis.\n"))
## 381257  SNPs included in analysis.
#create text file for GWAA output to be written to
columns<-c("SNP", "Estimate", "Std.Error", "t-value", "p-value")
write.table(t(columns), "GWAA.txt", row.names=FALSE, col.names=FALSE, quote=FALSE)

Next we focus on the phenodata argument. First we assign phenoSub to be phenodata, which is created from merging the clinical and pcs data. We then create the phenotype value (outcome variable) by using a normal rank-transformation of the hdl variable, which we show is necessary later on. Next we rename the FamID column, remove anyone that doesn’t have HDL data available, and subset the genotype data for the remaining individuals. Lastly, we want to remove variables that won’t be used in analysis, e.g. the remaining outcome variables and the untransformed version of HDL.

phenoSub <- merge(clinical,pcs)     
phenoSub$phenotype <- rntransform(as.numeric(phenoSub$hdl), family="gaussian")
phenoSub <- rename(phenoSub, replace=c(FamID="id"))
phenoSub<-phenoSub[!is.na(phenoSub$phenotype),]
phenodata <- phenoSub
genodata <- genodata[as.character(phenodata$id),]
cat(nrow(genodata), "samples included in analysis.\n")
## 1308 samples included in analysis.
phenodata$hdl <- NULL
phenodata$ldl <- NULL
phenodata$tg <- NULL
phenodata$CAD <- NULL
print(head(phenodata))
##      id sex age        pc1          pc2          pc3          pc4
## 2 10004   2  50 0.01263937  0.007136085 -0.009394939 -0.029605670
## 3 10005   1  55 0.01656767 -0.007822581  0.034279450 -0.008364165
## 4 10007   1  52 0.01179249 -0.001340544  0.014388948  0.002448683
## 5 10008   1  58 0.01587414  0.003683702  0.013407115 -0.002069099
## 6 10009   1  59 0.01342526  0.002620955  0.005132100 -0.044515366
## 7 10010   1  54 0.01627605  0.011860496  0.018480130 -0.014176319
##            pc5          pc6          pc7          pc8          pc9
## 2  0.017640584  0.062301372 -0.005341143  0.004102591  0.040899908
## 3  0.043822928 -0.002068763  0.015205957 -0.011775232  0.005064646
## 4  0.008521085  0.017414424  0.015268401  0.037495537  0.016946863
## 5  0.009863999  0.049090388  0.007638991  0.010654394 -0.013350077
## 6  0.043797908  0.012536302  0.045885623 -0.033450390 -0.045956099
## 7 -0.003986151 -0.094898367  0.004855911 -0.014761558  0.018721010
##           pc10  phenotype
## 2  0.003655463 -2.2874211
## 3  0.003447803 -0.4742508
## 4  0.010266512  0.8850182
## 5  0.007254919 -0.1636196
## 6  0.009781019  0.3952471
## 7 -0.019222291  1.7105947

Exercise

How many samples are left after removing samples with missing HDL data?

To show that the normally transformed version of HDL should be used for analysis, the following code can be used to plot the variable before and after transformation.

par(mfrow=c(1,2))
hist(as.numeric(phenoSub$hdl), main=NULL, xlab="HDL")
hist(phenodata$phenotype, main=NULL, xlab="Transformed HDL")

The GWAA function

The function itself has four major elements. Given the data described above, it (1) determines how many cores to use/which method to employ for parallel processing (depending on specified cores and operating system), (2) converts the genotype data into the necessary format for an additive model, (3) fits a linear model for each SNP and covariates (one group at a time) and then (4) writes the SNP coefficient details (the estimate, standard error, test statistic and p value) from each model out to the GWAA.txt file. Read through the function to get a general idea of the format in regards to these four steps. Further details regarding the function elements are provided below.

GWAA <- function(genodata, phenodata, filename = NULL, append = FALSE, workers = getOption("mc.cores", 
    2L), flip = TRUE, select.snps = NULL, hosts = NULL) {
    
    if (is.null(hosts)) {
        cl <- makeCluster(workers)
    } else {
        cl <- makeCluster(hosts, "PSOCK")
    }
    show(cl)
    registerDoParallel(cl)
    
    # Function that will change which allele is counted (major or minor)
    flip.matrix <- function(x) {
        zero2 <- which(x == 0)
        two0 <- which(x == 2)
        x[zero2] <- 2
        x[two0] <- 0
        return(x)
    }
    
    foreach(part = 1:nSplits) %do% {
        genoNum <- as(genodata[, snp.start[part]:snp.stop[part]], "numeric")
        # flip.matrix function employed
        if (isTRUE(flip)) 
            genoNum <- flip.matrix(genoNum)
        rsVec <- colnames(genoNum)
        res <- foreach(snp.name = rsVec, .combine = "rbind") %dopar% {
            a <- summary(glm(phenotype ~ . - id, family = gaussian, data = cbind(phenodata, 
                snp = genoNum[, snp.name])))
            a$coefficients["snp", ]
        }
        
        write.table(cbind(rsVec, res), filename, append = TRUE, quote = FALSE, 
            col.names = FALSE, row.names = FALSE)
        cat(sprintf("GWAS SNPs %s-%s (%s%% finished)\n", snp.start[part], snp.stop[part], 
            100 * part/nSplits))
    }
    
    stopCluster(cl)
    return(print("Done."))
}

Parallel processing

We have mentioned that this function determines how many cores to use for parallel processing, but have not yet emphasized why we would want to do this. Due to the large number of models that require fitting, the GWA analysis can be deployed in parallel across multiple processors or machines to reduce the running time. In this function, we demonstrate two basic methods for performing parallel processing using the doParallel package. This will be carried out differently depending on whether or not the analysis is run on a UNIX based system, though the arguments are the same. The user can specify the number of parallel processes using the worker argument (set to two by default). Increasing this number will simply speed up the analysis by utilizing more cores for jobs to be assigned to.

Genotype format

Next, you will notice the flip.matrix function, which is utilized in the foreach loop. Before using it, the genotype data is first converted to numeric format using the ‘as’ function from snpStats. The genotypes of each SNP are then coded as continuous, thereby taking on the value of 0, 1, and 2. For this example, we wish for the value of the genotype to reflect the number of minor alleles. However, following conversion our values will reflect the opposite. To fix this, the flip.matrix procedure is called on. This can be easily turned on or off using the flip argument.

Fitting the model

Now that the genotype data is counting the number of minor alleles present at each SNP, we can fit a linear model with the normalized HDL variable as the outcome. This is done using the glm function provided by R. For each model (i.e. each SNP), we are interested in the coefficient estimate of the SNP predictor, and thus return its model summary elements. The function runs this analysis on smaller subsets of the data one at a time. Below, we will split the data for analysis into 10 subsets (details to come).

Exercise

How would you interpret the coefficient estimate of a given SNP? Hint: Keep in mind the type of genetic model we are assuming.

Analysis

We have now walked through the different aspects of analysis Another way to speed up analysis would be to subset the data for SNPs of interest. A smaller subset would intuitively imply a shorter run time. The additional argument that’s helpful in this context is the select.snps parameter, set to NULL by default. This allows the user to subset the analysis via a vector of SNP names rather than run analysis on all available SNPs. If your interested in, say, SNPs that fall in a specified region of the genome, you could subset your data for these SNPs and run the GWAA function on these alone. Here, for example, running analysis on all the genotyped and imputed SNPs we have would take approximately six hours with only two cores employed. For demonstrative purposes, we will only run analysis in this lab on SNPs that fall on chromosomes 15-17. Run the following code to subset the data:

SNPs_sub <- genoBim$SNP[genoBim$chr==15 | genoBim$chr==16 
                        | genoBim$chr==17]
genodata_sub <- genodata[,colnames(genodata)%in%SNPs_sub]

Now that the data is ready, we want to split up the genotype data into smaller subsets so that the GWAA function can run analysis on each subset one at a time (as mentioned above). Here, we choose to split the data into 10 subsets. Using the ceiling() function, we round up the number of SNPs in each subset, which can be found by dividing number of SNPs by 10. Then, by using seq() and pmin(), we can find the index of the first and the last SNP in each group. This will allow the GWAA function to run analysis on each subset separately by choosing SNPs within the range of the first and last within a group. The function will print its progress after analysis on each subset is finished.

nSNPs <- ncol(genodata_sub)
nSplits <- 10
genosplit <- ceiling(nSNPs/nSplits) # number of SNPs in each subset
snp.start <- seq(1, nSNPs, genosplit) # index of first SNP in group
snp.stop <- pmin(snp.start+genosplit-1, nSNPs) # index of last SNP in group

Exercise

Approximately how many SNPs are in each subset?

We are now ready to perform the analysis and write the results out to the GWAA.txt file. Keep in mind, that while this analysis is only running on 3 chromosomes, it will still take a few minutes to complete. This process can be faster if more cores are chosen. The detectCores() function from the parallel package can be used to determine how many cores your machine has available.

start <- Sys.time()
GWAA(genodata_sub, phenodata, filename="GWAA.txt")
## socket cluster with 2 nodes on host 'localhost'
## GWAS SNPs 1-3237 (10% finished)
## GWAS SNPs 3238-6474 (20% finished)
## GWAS SNPs 6475-9711 (30% finished)
## GWAS SNPs 9712-12948 (40% finished)
## GWAS SNPs 12949-16185 (50% finished)
## GWAS SNPs 16186-19422 (60% finished)
## GWAS SNPs 19423-22659 (70% finished)
## GWAS SNPs 22660-25896 (80% finished)
## GWAS SNPs 25897-29133 (90% finished)
## GWAS SNPs 29134-32368 (100% finished)
## [1] "Done."
end <- Sys.time()
print(end-start)
## Time difference of 2.71885 mins

Saving work for following labs…

You can save the necessary data from this lab by running the following commands:

save(genoBim, imputed, target, rules, phenodata, genodata_sub, support, GWAA, 
    file = "m3_lab1_save.RData")

On your own

Use the read.table() and head() commands to investigate the results of the above analysis. Can you guess how we will determine if a given SNP is a significant predictor of HDL? What analytic challenges might we have to consider?