Genome-wide association analysis - Association analysis of non-typed SNPs

(This lab was adapted for statsTeachR by Sara Nunez, Nicholas Reich and Andrea Foulkes from our GWAS tutorial.)

Now that we have conducted the association analysis for typed SNPs, we now carry out analysis for the imputed data. Several stand-alone packages can be applied to conduct this analysis like MACH2qtl/dat and ProbABEL, which utilize the posterior probabilities calculated in the imputation step. However, in this lab, we use the R package snpStats to perform the analysis based on the imputation rules we calculated previously. Recall that in the GWAS2 module, we only imputed SNPs on chromosome 16.

Getting started

To prepare, we first need to load the data saved from last lab and install necessary packages.

load("m3_lab1_save.RData")
library(snpStats)
library(plyr)

Model fitting of imputed SNPs

Here we use the snp.rhs.tests function from snpStats to fit a generalized linear model similarly to how we did in the previous lab. The difference between this function and the glm function used for typed SNPs, however, is that snp.rhs.tests can handle imputed SNPs with the optional parameter ‘rules’. This allows the function to take into account the uncertainty that comes with imputed SNPs. In the following code, we fit this model by specifying the variables from phenodata that we want to include in the model. We then optain the SNP p values and write out a results file named ‘impute.csv’.

rownames(phenodata) <- as.character(phenodata$id)

imp <- snp.rhs.tests(phenotype ~ sex + age + pc1 + pc2 + pc3 + pc4 + pc5 + pc6 + 
    pc7 + pc8 + pc9 + pc10, family = "Gaussian", data = phenodata, snp.data = target, 
    rules = rules)

results <- data.frame(SNP = imp@snp.names, p.value = p.value(imp), stringsAsFactors = FALSE)
results <- results[!is.na(results$p.value), ]

write.csv(results, "impute.csv", row.names = FALSE)

The results for SNPs that were successfully fit are then combined with the chromosome and position information by merging the results table with the support information, and adding a column for chromosome. We also add a column indicating that these SNPs were imputed to help distinguish them in the post-analytic interrogation. We also want to take \(−log_{10}(\mbox{p values})\) for plotting purposes in the next module. Run the following code and see what the first few observations in the final data frame look like.

imputeOut <- merge(results, support[, c("SNP", "position")])
imputeOut$chr <- 16

imputeOut$type <- "imputed"

imputeOut$Neg_logP <- -log10(imputeOut$p.value)

imputeOut <- arrange(imputeOut, p.value)
print(head(imputeOut))
##          SNP      p.value position chr    type Neg_logP
## 1  rs1800775 3.970058e-08 56995236  16 imputed 7.401203
## 2  rs3816117 3.970058e-08 56996158  16 imputed 7.401203
## 3  rs1532624 4.763363e-08 57005479  16 imputed 7.322086
## 4  rs7205804 4.763363e-08 57004889  16 imputed 7.322086
## 5 rs12933833 2.112374e-05 56697684  16 imputed 4.675229
## 6 rs11076159 2.400306e-05 56670827  16 imputed 4.619733

Exercise

What do you notice about SNPs that lie near each other on chromosome 16? Is there anything that makes you concerned? If so, why?

Mapping associated SNPs to genes

Using a separate data file containing the chromosome and coordinate locations of each protein-coding gene, we can map all of the imputed SNPs to relative genes. This will be informative when trying to make sense of significant findings.

We use the following function to extract the subset of SNPs that are in or near a gene of interest. Be sure to run the following code chunk to save the map2gene() function to your workspace.

map2gene <- function(gene, coords, SNPs, extend.boundary = 5000) {
  coordsSub <- coords[coords$gene == gene,] 

  coordsSub$start <- coordsSub$start - extend.boundary 
  coordsSub$stop <- coordsSub$stop + extend.boundary

  SNPsub <- SNPs[SNPs$position >= coordsSub$start & SNPs$position <= coordsSub$stop &
                 SNPs$chr == coordsSub$chr,] 

  return(data.frame(SNPsub, gene = gene, stringsAsFactors = FALSE))
}

The SNP with the lowest p value in both the typed and imputed SNP analysis lies within the boundaries of the cholesteryl ester transfer protein gene, CETP. We can call the map2gene() function for CETP to filter the imputed data for SNPs that lie within or \(\pm 5kb\) of it. Once we know which SNPs are in or near CETP, we then subset the genotype data stored in impute for the same list of SNPs. This data will be used for post-analytic interrogation to follow.

# Read in file containing protein coding genes coords
library(downloader)
ProdCod <- download("https://www.mtholyoke.edu/courses/afoulkes/Data/statsTeachR/ProCodgene_coords.csv", 
    destfile = "ProCodgene_coords.csv")
genes <- read.csv("ProCodgene_coords.csv", stringsAsFactors = FALSE)

# Subset for CETP SNPs
impCETP <- map2gene("CETP", coords = genes, SNPs = imputeOut)

# Filter for CETP SNP genotypes
impCETPgeno <- imputed[, impCETP$SNP]

Exercise

How many genes are in the ‘genes’ data frame? What information does this file provide and why will we need it? How many genes are located on chromosome 16?

Exercise

Explain step-by-step what the map2gene() function does? What does each statement in the function do?

Saving work for following labs…

You can save the necessary data from this lab by running the following commands:

save(genoBim, imputed, target, rules, phenodata, support, genes, impCETP, impCETPgeno, 
    imputeOut, GWAA, map2gene, file = "m3_lab2_save.RData")

On your own

  • Choose a different gene on chromosome 16 to subset the imputed SNPs for. Run the map2gene() function on it.
  • Is this gene bigger or smaller than the CETP gene in terms of number of imputed SNPs? How many SNPs were in CETP and how many are in the gene you chose?
  • If you chose a gene that didn’t have any SNPs within or near it, why do you think this happened?